计算机科学
情态动词
人工智能
标准化
数据挖掘
模式识别(心理学)
情报检索
化学
高分子化学
操作系统
作者
Yiming Cao,Lizhen Cui,Lei Zhang,Fuqiang Yu,Ziheng Cheng,Zhen Li,Yonghui Xu,Miao Chen
标识
DOI:10.1007/978-3-031-30675-4_30
摘要
Automatic medical image report generation has attracted extensive research interest in medical data mining, which effectively alleviates doctors' workload and improves report standardization. The mainstream approaches adopt the Transformer-based Encoder-Decoder architecture to align the visual and linguistic features. However, they rarely consider the importance of cross-modal interaction (e.g., the interaction between images and reports) and do not adequately explore the relations between multi-modal medical data, leading to inaccurate and incoherent reports. To address these issues, we propose a Cross-modal Memory Transformer model (CMT) to process multi-modal medical data (i.e., medical images, medical terminology knowledge, and medical report text), and leverage the relations between multi-modal medical data to generate accurate medical reports. To explore the interaction of cross-modal information, we design a novel cross-modal feature memory decoder to memorize the relations between image and report features. Furthermore, the multi-modal feature fusion module in CMT exploits the multi-modal medical data to adaptively measure the contribution of multi-modal features for word generation, which improves the accuracy of generated reports. Extensive experiments on three real datasets demonstrate that our proposed CMT outperforms benchmark methods on automatic metrics.
科研通智能强力驱动
Strongly Powered by AbleSci AI