Visual Cluster Grounding for Image Captioning

隐藏字幕计算机科学判别式正确性人工智能对象（语法）推论接地自然语言处理词（群论）概率逻辑词汇图像（数学）机器学习模式识别（心理学）计算机视觉

作者

Wenhui Jiang,Minwei Zhu,Yuming Fang,Guangming Shi,Xiaowei Zhao,Yang Liu

出处

期刊：IEEE transactions on image processing [Institute of Electrical and Electronics Engineers]
日期：2022-01-01 卷期号：: 1-1

标识

DOI：10.1109/tip.2022.3177318

摘要

Attention mechanisms have been extensively adopted in vision and language tasks such as image captioning. It encourages a captioning model to dynamically ground appropriate image regions when generating words or phrases, and it is critical to alleviate the problems of object hallucinations and language bias. However, current studies show that the grounding accuracy of existing captioners is still far from satisfactory. Recently, much effort is devoted to improving the grounding accuracy by linking the words to the full content of objects in images. However, due to the noisy grounding annotations and large variations of object appearance, such strict word-object alignment regularization may not be optimal for improving captioning performance. In this paper, to improve the performance of both grounding and captioning, we propose a novel grounding model which implicitly links the words to the evidence in the image. The proposed model encourages the captioner to dynamically focus on informative regions of the objects, which could be either discriminative parts or full object content. With slacked constraints, the proposed captioning model can capture correct linguistic characteristics and visual relevance, and then generate more grounded image captions. In addition, we propose a novel quantitative metric for evaluating the correctness of the soft attention mechanism by considering the overall contribution of all object proposals when generating certain words. The proposed grounding model can be seamlessly plugged into most attention-based architectures without introducing inference complexity. We conduct extensive experiments on Flickr30k [1] and MS COCO datasets [2], demonstrating that the proposed method consistently improves image captioning in both grounding and captioning. Besides, the proposed attention evaluation metric shows better consistency with the captioning performance.

求助该文献

Visual Cluster Grounding for Image Captioning

今日热心研友