计算机科学
自然语言处理
语法
语言学
背景(考古学)
语言模型
人工智能
阅读(过程)
编码(集合论)
普通话
解码方法
语言习得
语法
上下文模型
自然语言
语音识别
线性子空间
语义学(计算机科学)
短语结构规则
语言识别
第二语言
语言理解
计算语言学
视觉语言
标识
DOI:10.1109/taslp.2023.3282109
摘要
We observe that for lip reading, the language is locally transformed, instead of globally transformed, i.e., speaking and writing follow the same basic grammar rules. In this work, we present a cross-modal language model to tackle the lip-reading challenge on silent videos. Compared to previous works, we consider multi-motion-informed contexts composed of multiple lip-motion representations from different subspaces to guide decoding via the source-target attention mechanism. We present a piece-wise pre-training strategy inspired by multi-task learning to pre-train a visual module to generate multi-motioninformed contexts for cross-modality and pre-train a decoder to generate texts for language modeling. Our final large-scale model outperforms baseline models on four datasets: LRS2, LRS3, LRW, and GRID. We will open our source code on GitHub.
科研通智能强力驱动
Strongly Powered by AbleSci AI