计算机科学
答疑
图形
人工智能
模式
自然语言处理
代表(政治)
可视化
情报检索
理论计算机科学
社会科学
社会学
政治
政治学
法学
作者
Wenbo Zheng,Lan Yan,Fei‐Yue Wang
标识
DOI:10.1109/tsmc.2023.3319964
摘要
While texts related to images convey fundamental messages for scene understanding and reasoning, text-based visual question answering tasks concentrate on visual questions that require reading texts from images. However, most current methods add multimodal features that are independently extracted from a given image into a reasoning model without considering their inter-and intra-relationships according to three modalities (i.e., scene texts, questions, and images). To this end, we propose a novel text-based visual question answering model, multimodal graph reasoning . Our model first extracts intramodality relationships by taking the representations from identical modalities as semantic graphs. Then, we present graph multihead self-attention, which boosts each graph representation through graph-by-graph aggregation to capture the intermodality relationship. It is a case of “so many heads, so many wits” in the sense that as more semantic graphs are involved in this process, each graph representation becomes more effective. Finally, these representations are reprojected, and we perform answer prediction with their outputs. The experimental results demonstrate that our approach realizes substantially better performance compared with other state-of-the-art models.
科研通智能强力驱动
Strongly Powered by AbleSci AI