人工智能
计算机科学
计算机视觉
图像分割
分割
医学影像学
图像(数学)
自然语言处理
作者
Q.-W. Pan,Zhengrong Li,Guang Yang,Qing Yang,Bing Ji
标识
DOI:10.1109/tmi.2025.3622492
摘要
The disparity between image and text representations, often referred to as the modality gap, remains a significant obstacle for Vision Language Models (VLMs) in medical image segmentation. This gap complicates multi-modal fusion, thereby restricting segmentation performance. To address this challenge, we propose Evidence-driven Vision Language Model (EviVLM)-a novel paradigm that integrates Evidential Learning (EL) into VLMs to systematically measure and mitigate the modality gap for enhanced multi-modal fusion. To drive this paradigm, an Evidence Affinity Map Generator (EAMG) is proposed to collect complementary cross-modal evidences by learning a global cross-modal affinity map, thus refining modality-specific evidence embedding. An Evidence Differential Similarity Learning (EDSL) is further proposed to collect consistent cross-modal evidences by performing Bias-Variance Decomposition on differential matrix derived from bidirectional similarity matrices between image and text evidence embeddings. Finally, the subjective logic is used for mapping the collected evidences to opinions, and the Dempster-Shafer's theory based combination rule is introduced for opinion aggregation, thereby quantifying the modality gap and facilitating effective multi-modal integration. Experimental results on three public medical image segmentation datasets validate that the proposed EviVLM can achieve state-of-the-art performance. Code is available at: https://github.com/QingtaoPan/EviVLM.
科研通智能强力驱动
Strongly Powered by AbleSci AI