隐藏字幕
计算机科学
变压器
地点
网格
图形
人工智能
对偶(语法数字)
情报检索
图像(数学)
数据挖掘
理论计算机科学
物理
文学类
哲学
艺术
量子力学
电压
语言学
数学
几何学
作者
Yunpeng Luo,Jiayi Ji,Xiaoshuai Sun,Liujuan Cao,Yongjian Wu,Feiyue Huang,Chia‐Wen Lin,Rongrong Ji
出处
期刊:Proceedings of the ... AAAI Conference on Artificial Intelligence
[Association for the Advancement of Artificial Intelligence (AAAI)]
日期:2021-05-18
卷期号:35 (3): 2286-2293
被引量:240
标识
DOI:10.1609/aaai.v35i3.16328
摘要
Descriptive region features extracted by object detection networks have played an important role in the recent advancements of image captioning. However, they are still criticized for the lack of contextual information and fine-grained details, which in contrast are the merits of traditional grid features. In this paper, we introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features. Concretely, in DLCT, these two features are first processed by a novel Dual-way Self Attenion (DWSA) to mine their intrinsic properties, where a Comprehensive Relation Attention component is also introduced to embed the geometric information. In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features, where a geometric alignment graph is constructed to accurately align and reinforce region and grid features. To validate our model, we conduct extensive experiments on the highly competitive MS-COCO dataset, and achieve new state-of-the-art performance on both local and online test sets, i.e., 133.8% CIDEr on Karpathy split and 135.4% CIDEr on the official split.
科研通智能强力驱动
Strongly Powered by AbleSci AI