隐藏字幕
计算机科学
人工智能
自动汇总
计算机视觉
图像(数学)
集合(抽象数据类型)
任务(项目管理)
正规化(语言学)
自然语言处理
模式识别(心理学)
编码器
嵌入
语音识别
自编码
编码(内存)
冗余(工程)
预处理器
语义学(计算机科学)
分割
任务分析
块(置换群论)
解码方法
手势
可视化
图像质量
相似性(几何)
语言模型
卷积神经网络
作者
RUI DAVID FREITAS CARDOSO
出处
期刊:RCAAP Project by FCT - Portuguese National Funding Agency for Science, Research and Technology - RCAAP Search Portal
日期:2025-11-18
摘要
Image captioning is a research area in Artificial Intelligence (AI) that aims to generate coherent and contextually accurate textual descriptions of images. Some of its practical applications include image retrieval, video summarization and enhancing human–computer interactions in areas like robotics and virtual reality. Vision- Language Model (VLM) are suited to solve this multimodal task and often rely on pretrained vision encoders such as Contrastive Language-Image Pre-training (CLIP). However, CLIP underperforms when faced with occluded objects, where crucial visual cues are missing. In this work, we investigate whether a lightweight unified multimodal decoder that does not use pretrained data can outperform CLIP-based baselines under the same settings. Given an input image, we learn a model that generates a textual caption with just a few selected patches of the images as context. The baseline experiment replaces CLIP’s embeddings with flattened patches in the text sequence, and subsequent experiments iteratively extend this setup to probe different aspects of the methodology. Specifically, we ask: (i) does inserting patch embeddings both before and after the text sequence improve alignment between modalities? (ii) can replacing a single occluded CLIP embedding with multiple patch tokens under the same occlusion conditions enhance semantic recovery? (iii) do convolutional preprocessed patches yield more informative visual representations? (iv) does adding two-dimensional positional encoding improve spatial awareness? (v) how sensitive is caption quality to the specific set of randomly sampled patches? (vi) can additional regularization to align patch embeddings further strengthen visual grounding? Most of our results show consistent gains over the baseline, narrowing the gap to using CLIP embeddings. Nonetheless, the unified decoder lags behind CLIP on standard captioning metrics (BLEU@4, METEOR, CIDEr, SPICE), suggesting either the need for substantially larger models and datasets, or that architectures with uni-modal encoders, e.g. image specific encoders, remain better suited for robust captioning under extreme partial occlusion.
科研通智能强力驱动
Strongly Powered by AbleSci AI