隐藏字幕
自上而下和自下而上的设计
计算机科学
对象(语法)
推论
判决
人工智能
水准点(测量)
光学(聚焦)
编码(集合论)
图像(数学)
自然语言处理
计算机视觉
程序设计语言
地理
集合(抽象数据类型)
物理
光学
大地测量学
作者
Yingwei Pan,Yehao Li,Ting Yao,Tao Mei
摘要
A bottom-up and top-down attention mechanism has led to the revolutionizing of image captioning techniques, which enables object-level attention for multi-step reasoning over all the detected objects. However, when humans describe an image, they often apply their own subjective experience to focus on only a few salient objects that are worthy of mention, rather than all objects in this image. The focused objects are further allocated in linguistic order, yielding the “object sequence of interest” to compose an enriched description. In this work, we present the Bottom-up and Top-down Object inference Network (BTO-Net), which novelly exploits the object sequence of interest as top-down signals to guide image captioning. Technically, conditioned on the bottom-up signals (all detected objects), an LSTM-based object inference module is first learned to produce the object sequence of interest, which acts as the top-down prior to mimic the subjective experience of humans. Next, both of the bottom-up and top-down signals are dynamically integrated via an attention mechanism for sentence generation. Furthermore, to prevent the cacophony of intermixed cross-modal signals, a contrastive learning-based objective is involved to restrict the interaction between bottom-up and top-down signals, and thus leads to reliable and explainable cross-modal reasoning. Our BTO-Net obtains competitive performances on the COCO benchmark, in particular, 134.1% CIDEr on the COCO Karpathy test split. Source code is available at https://github.com/YehLi/BTO-Net .
科研通智能强力驱动
Strongly Powered by AbleSci AI