计算机科学
隐藏字幕
一致性(知识库)
人工智能
样品(材料)
任务(项目管理)
图形
自然语言处理
机器学习
模式识别(心理学)
理论计算机科学
图像(数学)
色谱法
经济
化学
管理
作者
Zhanyu Wang,Lei Wang,Xiu Li,Luping Zhou
标识
DOI:10.1109/tpami.2025.3562866
摘要
Radiographic images are similar to each other, making it challenging for diagnostic captioning to narrate fine-grained visual differences of clinical importance. In this paper, we propose a self-boosting framework integrating two novel strategies to learn tightly correlated image and text features for diagnostic captioning. The first strategy explicitly aligns image and text features through training an auxiliary task of image-text matching (ITM) jointly with the main task of report generation (RG) as two branches of a network model. The ITM branch explicitly learns image-text alignment and provides highly correlated visual and textual features for the RG branch to generate high-quality reports. The high-quality reports generated by RG branch, in turn, are utilized as additional harder negative samples to push the ITM branch to evolve towards better image-text alignment. These two branches help improve each other progressively, so that the whole model is self-boosted without requiring external resources. The second strategy aligns image-sample space and report-sample space to achieve consistent image and text feature embeddings. To achieve this, the sample graph of the embedded ground-truth reports is built and used as the target to train the sample graph of the embedded images so that the fine discrepancy in the ground-truth reports could be captured by the learned visual feature embeddings. Our proposed framework demonstrates its superiority on two medical report generation benchmarks, including the largest dataset MIMIC-CXR.
科研通智能强力驱动
Strongly Powered by AbleSci AI