隐藏字幕
计算机科学
判别式
正确性
人工智能
对象(语法)
推论
接地
自然语言处理
词(群论)
概率逻辑
词汇
图像(数学)
机器学习
模式识别(心理学)
计算机视觉
作者
Wenhui Jiang,Minwei Zhu,Yuming Fang,Guangming Shi,Xiaowei Zhao,Yang Liu
出处
期刊:IEEE transactions on image processing
[Institute of Electrical and Electronics Engineers]
日期:2022-01-01
卷期号:: 1-1
标识
DOI:10.1109/tip.2022.3177318
摘要
Attention mechanisms have been extensively adopted in vision and language tasks such as image captioning. It encourages a captioning model to dynamically ground appropriate image regions when generating words or phrases, and it is critical to alleviate the problems of object hallucinations and language bias. However, current studies show that the grounding accuracy of existing captioners is still far from satisfactory. Recently, much effort is devoted to improving the grounding accuracy by linking the words to the full content of objects in images. However, due to the noisy grounding annotations and large variations of object appearance, such strict word-object alignment regularization may not be optimal for improving captioning performance. In this paper, to improve the performance of both grounding and captioning, we propose a novel grounding model which implicitly links the words to the evidence in the image. The proposed model encourages the captioner to dynamically focus on informative regions of the objects, which could be either discriminative parts or full object content. With slacked constraints, the proposed captioning model can capture correct linguistic characteristics and visual relevance, and then generate more grounded image captions. In addition, we propose a novel quantitative metric for evaluating the correctness of the soft attention mechanism by considering the overall contribution of all object proposals when generating certain words. The proposed grounding model can be seamlessly plugged into most attention-based architectures without introducing inference complexity. We conduct extensive experiments on Flickr30k [1] and MS COCO datasets [2], demonstrating that the proposed method consistently improves image captioning in both grounding and captioning. Besides, the proposed attention evaluation metric shows better consistency with the captioning performance.
科研通智能强力驱动
Strongly Powered by AbleSci AI