计算机科学
杠杆(统计)
答疑
人工智能
利用
模式
语义学(计算机科学)
图形
模态(人机交互)
理解力
自然语言处理
机器学习
理论计算机科学
社会学
程序设计语言
计算机安全
社会科学
作者
Yun Liu,Xiaoming Zhang,Feiran Huang,Bo Zhang,Zhoujun Li
出处
期刊:IEEE transactions on image processing
[Institute of Electrical and Electronics Engineers]
日期:2022-01-01
卷期号:31: 1684-1696
被引量:11
标识
DOI:10.1109/tip.2022.3142526
摘要
Due to the rich spatio-temporal visual content and complex multimodal relations, Video Question Answering (VideoQA) has become a challenging task and attracted increasing attention. Current methods usually leverage visual attention, linguistic attention, or self-attention to uncover latent correlations between video content and question semantics. Although these methods exploit interactive information between different modalities to improve comprehension ability, inter- and intra-modality correlations cannot be effectively integrated in a uniform model. To address this problem, we propose a novel VideoQA model called Cross-Attentional Spatio-Temporal Semantic Graph Networks (CASSG). Specifically, a multi-head multi-hop attention module with diversity and progressivity is first proposed to explore fine-grained interactions between different modalities in a crossing manner. Then, heterogeneous graphs are constructed from the cross-attended video frames, clips, and question words, in which the multi-stream spatio-temporal semantic graphs are designed to synchronously reasoning inter- and intra-modality correlations. Last, the global and local information fusion method is proposed to coalesce the local reasoning vector learned from multi-stream spatio-temporal semantic graphs and the global vector learned from another branch to infer the answer. Experimental results on three public VideoQA datasets confirm the effectiveness and superiority of our model compared with state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI