计算机科学
答疑
模式
模态(人机交互)
任务(项目管理)
透视图(图形)
视听
特征(语言学)
特征提取
人工智能
情报检索
自然语言处理
多媒体
语言学
社会科学
哲学
管理
社会学
经济
作者
Zailong Chen,Lei Wang,Peng Wang,Peng Gao
标识
DOI:10.1109/tcsvt.2023.3318220
摘要
As a newly emerging task, audio-visual question answering (AVQA) has attracted research attention. Compared with traditional single-modality (e.g., audio or visual) QA tasks, it poses new challenges due to the higher complexity of feature extraction and fusion brought by the multimodal inputs. First, AVQA requires more comprehensive understanding of the scene which involves both audio and visual information; Second, in the presence of more information, feature extraction has to be better connected with a given question; Third, features from different modalities need to be sufficiently correlated and fused. To address this situation, this work proposes a novel framework for multimodal question answering task. It characterises an audiovisual scene at both global and local levels, and within each level, the features from different modalities are well fused. Furthermore, the given question is utilised to guide not only the feature extraction at the local level but also the final fusion of global and local features to predict the answer. Our framework provides a new perspective for audio-visual scene understanding through focusing on both general and specific representations as well as aggregating multimodalities by prioritizing question-related information. As experimentally demonstrated, our method significantly improves the existing audio-visual question answering performance, with the averaged absolute gain of 3.3% and 3.1% on MUSIC-AVQA and AVQA datasets, respectively. Moreover, the ablation study verifies the necessity and effectiveness of our design. Our code will be publicly released.
科研通智能强力驱动
Strongly Powered by AbleSci AI