计算机科学
稳健性(进化)
人工智能
卷积神经网络
面子(社会学概念)
语音识别
计算机视觉
自然性
编码(集合论)
程序设计语言
基因
社会学
量子力学
集合(抽象数据类型)
物理
化学
生物化学
社会科学
作者
Pengfei Li,Huihuang Zhao,Qingyun Liu,Peng Tang,Lin Zhang
标识
DOI:10.1016/j.compeleceng.2023.109049
摘要
In this paper, we present TellMeTalk, an innovative approach for generating expressive talking face videos based on multimodal inputs. Our approach demonstrates robustness across various identities, languages, expressions, and head movements. It overcomes four key limitations of existing talking face video generation methods: (1) reliance on single-modal learning from audio or text, lacking the complementary nature of multimodal inputs; (2) deployment of traditional convolutional neural network generation, leading to restricted capture of spatial features; (3) the absence of natural head movements and expressions; and (4) limitations of artifacts, prominent boundaries caused by image overlapping, and unclear mouth regions. To address these challenges, we propose a face motion network to imbue character images with facial expressions and head movements. We also take text and reference audio as input to generate personalized audio. Furthermore, we introduce a generator equipped with a cross-attention module and Fast Fourier Convolutional blocks to model spatial dependencies. Finally, a face restoration module is designed to reduce artifacts and prominent boundaries. Extensive experiments demonstrate our method produces high-quality expressive talking face videos. Compared to state-of-the-art approaches, our method exhibits superior performance in terms of video quality and precise synchronization of lip movements. The source code is available at https://github.com/lifemo/TellMeTalk.
科研通智能强力驱动
Strongly Powered by AbleSci AI