计算机科学
感知
人工智能
语音识别
计算机视觉
视听
自然语言处理
模式识别(心理学)
心理学
多媒体
神经科学
作者
Jing Liu,Sihan Chen,Xingjian He,Longteng Guo,Xinxin Zhu,Weining Wang,Jinhui Tang
标识
DOI:10.1109/tpami.2024.3479776
摘要
In this paper, we propose the Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multimodal understanding and generation. Unlike widely-studied vision-language pretraining models, VALOR jointly models the relationships among vision, audio, and language in an end-to-end manner. It consists of three separate encoders for single modality representations and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain the VALOR model: Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language, and audio into the same common space, simultaneously building vision-language, audio-language, and audiovisual-language alignment. MGC learns to generate text tokens under conditions of vision, audio, or both. To promote vision-audio-language pretraining research, we construct a large-scale, high-quality tri-modality dataset named VALOR-1M, containing 1 million audible videos with human-annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and generalize to various downstream tasks (e.g., retrieval, captioning, and question answering) with different input modalities (e.g., vision-language, audio-language, and audiovisual-language). VALOR achieves new state-of-the-art performance on a series of public cross-modality benchmarks.
科研通智能强力驱动
Strongly Powered by AbleSci AI