计算机科学
人工智能
职位(财务)
特征(语言学)
图像(数学)
判决
匹配(统计)
对象(语法)
嵌入
注意力网络
关系(数据库)
任务(项目管理)
模态(人机交互)
情报检索
模式识别(心理学)
数据挖掘
经济
管理
哲学
统计
语言学
数学
财务
作者
Yaxiong Wang,Hao Yang,Xiuxiu Bai,Xueming Qian,Lin Ma,Jing Lü,Biao Li,Xin Fan
标识
DOI:10.1109/tmm.2020.3024822
摘要
Bi-directional image-text retrieval and matching attract much attention recently. This cross-domain task demands a fine understanding of both modalities for learning a measure of different modality data. In this paper, we propose a novel position focused attention network to investigate the relation between the visual and the textual views. This work integrates the prior object position to enhance the visual-text joint-embedding learning. The image is first split into blocks, which are treated as the basic position cells, and the position of an image region is inferred. Then, we propose a position attention to model the relations between the image region and position cells. Finally, we generate a valuable position feature to further enhance the region expression and model a more reliable relationship between the visual image and the textual sentence. Experiments on the popular datasets Flickr30K and MS-COCO show the effectiveness of the proposed method. Besides the public datasets, we also conduct experiments on our collected practical large-scale news dataset (Tencent-News) to validate the practical application value of the proposed method. As far as we know, this is the first attempt to test the performance on the practical application. Our method achieves the competitive performance on all of these three datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI