计算机科学
桥接(联网)
判别式
变压器
隐藏字幕
人工智能
冗余(工程)
情态动词
自然语言处理
语音识别
图像(数学)
工程类
电气工程
电压
操作系统
化学
高分子化学
计算机网络
作者
Jinhong Hu,Guojin Zhong,Jin Yuan,Wenbo Pan,Xiaoping Wang
标识
DOI:10.1109/tip.2025.3573471
摘要
Recent advancements in diffusion models (DMs) have showcased superior capabilities in generating images and text. This paper first introduces DMs for image change captioning (ICC) and proposes a novel Context-aware Contrastive Diffusion model with Mediator-bridging Cross-modal Transformer (MCT-CCDiff) to accurately predict visual difference descriptions conditioned on two similar images. Technically, MCT-CCDiff develops a Text Embedding Contrastive Loss (TECL) that leverages both positive and negative samples to more effectively distinguish text embeddings, thus generating more discriminative text representations for ICC. To accurately predict visual difference descriptions, MCT-CCDiff introduces a Mediator-bridging Cross-modal Transformer (MCTrans) designed to efficiently explore the cross-modal correlations between visual differences and corresponding text by using a lightweight mediator, mitigating interference from visual redundancy and reducing interaction overhead. Additionally, it incorporates context-augmented denoising to further understand the contextual relationships within caption words implemented by a revised diffusion loss, which provides a tighter optimization bound, leading to enhanced optimization effects for high-quality text generation. Extensive experiments conducted on four benchmark datasets clearly demonstrate that our MCT-CCDiff significantly outperforms state-of-the-art methods in the field of ICC, marking a notable advancement in the generation of precise visual difference descriptions.
科研通智能强力驱动
Strongly Powered by AbleSci AI