计算机科学
短语
接地
语言模型
稳健性(进化)
安全性令牌
隐藏字幕
机器学习
可用性
人工智能
自然语言处理
图像(数学)
人机交互
量子力学
化学
基因
计算机安全
生物化学
物理
作者
Ke Zou,Yang Bai,Bo Liu,Yidi Chen,Zhihao Chen,Yang Zhou,Xuedong Yuan,Meng Wang,Xiaojing Shen,Xiaochun Cao,Yih Chung Tham,Huazhu Fu
标识
DOI:10.1109/tpami.2025.3596878
摘要
Medical phrase grounding is crucial for identifying relevant regions in medical images based on phrase queries, facilitating accurate image analysis and diagnosis. However, current methods rely on manual extraction of key phrases from medical reports, reducing efficiency and increasing the workload for clinicians. Additionally, the lack of model confidence estimation limits clinical trust and usability. In this paper, we introduce a novel task-Medical Report Grounding (MRG)-which aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner. To address this challenge, we propose uMedGround, a a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases by embedding a unique token, < $\mathtt {BOX}$BOX >, into the vocabulary to enhance detection capabilities. A vision encoder-decoder processes the embedded token and input image to generate grounding boxes. Critically, uMedGround incorporates an uncertainty-aware prediction model, significantly improving the robustness and reliability of grounding predictions. Experimental results demonstrate that uMedGround outperforms state-of-the-art medical phrase grounding methods and fine-tuned large visual-language models, validating its effectiveness and reliability. This study represents a pioneering exploration of the MRG task, marking the first-ever endeavor in this domain. Additionally, we demonstrate the applicability of uMedGround in medical visual question answering and class-based localization tasks, where it highlights visual evidence aligned with key diagnostic phrases, supporting clinicians in interpreting various types of textual inputs, including free-text reports, visual question answering queries, and class labels.
科研通智能强力驱动
Strongly Powered by AbleSci AI