计算机科学
隐藏字幕
答疑
人工智能
自然语言处理
情报检索
可视化
图像(数学)
作者
Yiwei Ma,Jiayi Ji,Xiaoshuai Sun,Yiyi Zhou,Yongjian Wu,Feiyue Huang,Rongrong Ji
标识
DOI:10.1109/tmm.2022.3164787
摘要
Attention has become an indispensable component of the models of various multimedia tasks like Image Captioning (IC) and Visual Question Answering (VQA). However, most existing attention modules are designed for capturing the spatial dependency, and are still insufficient in semantic understanding, e.g. , the categories of objects and their attributes, which is also critical for image captioning. To compensate for this defect, we propose a novel attention module termed Channel-wise Attention Block (CAB) to model channel-wise dependency for both visual modality and linguistic modality, thereby improving semantic learning and multi-modal reasoning simultaneously. Specifically, CAB has two novel designs to tackle with the high overhead of channel-wise attention, which are the reduction-reconstruction block structure and the gating-based attention prediction . Based on CAB, we further propose a novel Semantic-enhanced Dual Attention Transformer (termed SDATR), which combines the merits of spatial and channel-wise attentions. To validate SDATR, we conduct extensive experiments on the MS COCO dataset and yield new state-of-the-art performance of 134.5 CIDEr score on COCO Karpathy test split and 136.0 CIDEr score on the official online testing server. To examine the generalization of SDATR, we also apply it to the task of visual question answering, where superior performance gains are also witnessed. The code and models are publicly available at https://github.com/xmu-xiaoma666/SDATR .
科研通智能强力驱动
Strongly Powered by AbleSci AI