计算机科学
计算机人脸动画
动画
人工智能
自回归模型
代码本
面部运动捕捉
计算机视觉
语音识别
计算机动画
模式识别(心理学)
人脸检测
面部识别系统
计算机图形学(图像)
数学
计量经济学
作者
Jinbo Xing,Menghan Xia,Yuechen Zhang,Xiaodong Cun,Jue Wang,Tien‐Tsin Wong
标识
DOI:10.1109/cvpr52729.2023.01229
摘要
Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness due to the highly ill-posed nature and scarcity of audio-visual data. Existing works typically formulate the cross-modal mapping into a regression task, which suffers from the regression-to-mean problem leading to over-smoothed facial motions. In this paper, we propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook, which effectively promotes the vividness of the generated motions by reducing the cross-modal mapping uncertainty. The codebook is learned by self-reconstruction over real facial motions and thus embedded with realistic facial motion priors. Over the discrete motion space, a temporal autoregressive model is employed to sequentially synthesize facial motions from the input speech signal, which guarantees lip-sync as well as plausible facial expressions. We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. Also, a user study further justifies our superiority in perceptual quality. Code and video demo are available at https://doubiiu.github.io/projects/codetalker.
科研通智能强力驱动
Strongly Powered by AbleSci AI