Image Captioning under Extreme Occlusion Settings

隐藏字幕计算机科学人工智能自动汇总计算机视觉图像（数学）集合（抽象数据类型）任务（项目管理）正规化（语言学）自然语言处理模式识别（心理学）编码器嵌入语音识别自编码编码（内存）冗余（工程）预处理器语义学（计算机科学）分割任务分析块（置换群论）解码方法手势可视化图像质量相似性（几何）语言模型卷积神经网络

作者

RUI DAVID FREITAS CARDOSO

出处

期刊：RCAAP Project by FCT - Portuguese National Funding Agency for Science, Research and Technology - RCAAP Search Portal 日期：2025-11-18

链接

handle.net

摘要

Image captioning is a research area in Artificial Intelligence (AI) that aims to generate coherent and contextually accurate textual descriptions of images. Some of its practical applications include image retrieval, video summarization and enhancing human–computer interactions in areas like robotics and virtual reality. Vision- Language Model (VLM) are suited to solve this multimodal task and often rely on pretrained vision encoders such as Contrastive Language-Image Pre-training (CLIP). However, CLIP underperforms when faced with occluded objects, where crucial visual cues are missing. In this work, we investigate whether a lightweight unified multimodal decoder that does not use pretrained data can outperform CLIP-based baselines under the same settings. Given an input image, we learn a model that generates a textual caption with just a few selected patches of the images as context. The baseline experiment replaces CLIP’s embeddings with flattened patches in the text sequence, and subsequent experiments iteratively extend this setup to probe different aspects of the methodology. Specifically, we ask: (i) does inserting patch embeddings both before and after the text sequence improve alignment between modalities? (ii) can replacing a single occluded CLIP embedding with multiple patch tokens under the same occlusion conditions enhance semantic recovery? (iii) do convolutional preprocessed patches yield more informative visual representations? (iv) does adding two-dimensional positional encoding improve spatial awareness? (v) how sensitive is caption quality to the specific set of randomly sampled patches? (vi) can additional regularization to align patch embeddings further strengthen visual grounding? Most of our results show consistent gains over the baseline, narrowing the gap to using CLIP embeddings. Nonetheless, the unified decoder lags behind CLIP on standard captioning metrics (BLEU@4, METEOR, CIDEr, SPICE), suggesting either the need for substantially larger models and datasets, or that architectures with uni-modal encoders, e.g. image specific encoders, remain better suited for robust captioning under extreme partial occlusion.

求助该文献

Image Captioning under Extreme Occlusion Settings

今日热心研友