Image Captioning under Extreme Occlusion Settings

隐藏字幕 计算机科学 人工智能 自动汇总 计算机视觉 图像(数学) 集合(抽象数据类型) 任务(项目管理) 正规化(语言学) 自然语言处理 模式识别(心理学) 编码器 嵌入 语音识别 自编码 编码(内存) 冗余(工程) 预处理器 语义学(计算机科学) 分割 任务分析 块(置换群论) 解码方法 手势 可视化 图像质量 相似性(几何) 语言模型 卷积神经网络
作者
RUI DAVID FREITAS CARDOSO
出处
期刊:RCAAP Project by FCT - Portuguese National Funding Agency for Science, Research and Technology - RCAAP Search Portal
摘要

Image captioning is a research area in Artificial Intelligence (AI) that aims to generate coherent and contextually accurate textual descriptions of images. Some of its practical applications include image retrieval, video summarization and enhancing human–computer interactions in areas like robotics and virtual reality. Vision- Language Model (VLM) are suited to solve this multimodal task and often rely on pretrained vision encoders such as Contrastive Language-Image Pre-training (CLIP). However, CLIP underperforms when faced with occluded objects, where crucial visual cues are missing. In this work, we investigate whether a lightweight unified multimodal decoder that does not use pretrained data can outperform CLIP-based baselines under the same settings. Given an input image, we learn a model that generates a textual caption with just a few selected patches of the images as context. The baseline experiment replaces CLIP’s embeddings with flattened patches in the text sequence, and subsequent experiments iteratively extend this setup to probe different aspects of the methodology. Specifically, we ask: (i) does inserting patch embeddings both before and after the text sequence improve alignment between modalities? (ii) can replacing a single occluded CLIP embedding with multiple patch tokens under the same occlusion conditions enhance semantic recovery? (iii) do convolutional preprocessed patches yield more informative visual representations? (iv) does adding two-dimensional positional encoding improve spatial awareness? (v) how sensitive is caption quality to the specific set of randomly sampled patches? (vi) can additional regularization to align patch embeddings further strengthen visual grounding? Most of our results show consistent gains over the baseline, narrowing the gap to using CLIP embeddings. Nonetheless, the unified decoder lags behind CLIP on standard captioning metrics (BLEU@4, METEOR, CIDEr, SPICE), suggesting either the need for substantially larger models and datasets, or that architectures with uni-modal encoders, e.g. image specific encoders, remain better suited for robust captioning under extreme partial occlusion.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
笨笨无色完成签到 ,获得积分10
刚刚
Yana完成签到,获得积分10
刚刚
夜空发布了新的文献求助10
1秒前
wz完成签到 ,获得积分10
1秒前
Hello应助猫独秀采纳,获得10
1秒前
JamesPei应助快乐薯条采纳,获得10
2秒前
2秒前
xh-notes完成签到,获得积分10
2秒前
端庄不愁发布了新的文献求助10
2秒前
3秒前
3秒前
莫妮卡卡发布了新的文献求助10
4秒前
若琦2026发布了新的文献求助10
4秒前
5秒前
yjq发布了新的文献求助20
5秒前
领导范儿应助nwds采纳,获得10
5秒前
5秒前
NexusExplorer应助凯七采纳,获得10
5秒前
Shrine发布了新的文献求助10
5秒前
今后应助时迁采纳,获得10
5秒前
天桂星发布了新的文献求助10
5秒前
Ccc发布了新的文献求助10
6秒前
喜欢桑叶的夏天完成签到 ,获得积分10
6秒前
6秒前
Yana关注了科研通微信公众号
6秒前
7秒前
孙宇完成签到,获得积分10
7秒前
7秒前
完美世界应助好运大王采纳,获得10
7秒前
夏沐沐完成签到,获得积分10
7秒前
思源应助科研小白鼠采纳,获得20
8秒前
8秒前
8秒前
斯文败类应助duoduo采纳,获得10
9秒前
molihuakai应助不知道在干嘛采纳,获得10
9秒前
weide9587完成签到,获得积分10
9秒前
所所应助我与春风同行采纳,获得10
9秒前
琥珀关注了科研通微信公众号
9秒前
iitj应助nono采纳,获得20
9秒前
orixero应助璇子采纳,获得10
9秒前
高分求助中
Cronologia da história de Macau 5000
Erwählung und Berufung bei Paulus: Bedeutung, Entwicklung und Funktion einer Vorstellung in ihrem frühjüdischen und griechisch-römischen Kontext 850
Matrix Methods in Data Mining and Pattern Recognition 510
Interactions of Vowel Quality and Prosody in East Slavic 500
用于植入式医疗器械的馈通设计与实现 400
Animalia: Animal and Human Interaction in the Early Medieval English World (Exeter Studies in Medieval Europe) 400
Synfacts Issue 07 · Volume 22 400
热门求助领域 (近24小时)
化学 材料科学 医学 生物 纳米技术 工程类 有机化学 化学工程 生物化学 计算机科学 内科学 物理 复合材料 催化作用 细胞生物学 无机化学 光电子学 物理化学 电极 基因
热门帖子
关注 科研通微信公众号,转发送积分 7134406
求助须知:如何正确求助?哪些是违规求助? 8783859
关于积分的说明 18569209
捐赠科研通 6719402
什么是DOI,文献DOI怎么找? 3153364
关于科研通互助平台的介绍 2278702
邀请新用户注册赠送积分活动 2127689