计算机科学
边距(机器学习)
人工智能
对象(语法)
词(群论)
依赖关系(UML)
秩(图论)
解析
自然语言处理
图像(数学)
模式识别(心理学)
简单(哲学)
依存语法
语义学(计算机科学)
视觉对象识别的认知神经科学
可视化
目标检测
机器学习
数学
哲学
几何学
组合数学
程序设计语言
认识论
作者
Viet-Quoc Pham,Nao Mishima
标识
DOI:10.1109/icassp49357.2023.10096489
摘要
Weakly supervised visual grounding aims to predict the region in an image that corresponds to a specific linguistic query, where the mapping between the target object and query is unknown in the training stage. The state-of-the-art method uses a vision language pre-training model to acquire heatmaps from Grad-CAM, which matches every query word with an image region, and uses the combined heatmap to rank the region proposals. In this paper, we propose two simple but efficient methods for improving this approach. First, we propose a target-aware cropping approach to encourage the model to learn both object and scene level semantic representations. Second, we apply dependency parsing to extract words related to the target object, and then put emphasis on these words in the heatmap combination. Our method surpasses the previous SOTA methods on RefCOCO, RefCOCO+, and RefCOCOg by a notable margin.
科研通智能强力驱动
Strongly Powered by AbleSci AI