嵌入
图像(数学)
计算机科学
安全性令牌
空格(标点符号)
计算机视觉
人工智能
医学影像学
图像处理
计算机安全
操作系统
作者
Yan Yang,Jun Yu,Zhenqi Fu,Ke Zhang,Ting Yu,Xianyun Wang,Hanliang Jiang,Junhui Lv,Qingming Huang,Weidong Han
标识
DOI:10.1109/tmi.2024.3412402
摘要
Medical image reporting focused on automatically generating the diagnostic reports from medical images has garnered growing research attention. In this task, learning cross-modal alignment between images and reports is crucial. However, the exposure bias problem in autoregressive text generation poses a notable challenge, as the model is optimized by a word-level loss function using the teacher-forcing strategy. To this end, we propose a novel Token-Mixer framework that learns to bind image and text in one embedding space for medical image reporting. Concretely, Token-Mixer enhances the cross-modal alignment by matching image-to-text generation with text-to-text generation that suffers less from exposure bias. The framework contains an image encoder, a text encoder and a text decoder. In training, images and paired reports are first encoded into image tokens and text tokens, and these tokens are randomly mixed to form the mixed tokens. Then, the text decoder accepts image tokens, text tokens or mixed tokens as prompt tokens and conducts text generation for network optimization. Furthermore, we introduce a tailored text decoder and an alternative training strategy that well integrate with our Token-Mixer framework. Extensive experiments across three publicly available datasets demonstrate Token-Mixer successfully enhances the image-text alignment and thereby attains a state-of-the-art performance. Related codes are available at https://github.com/yangyan22/Token-Mixer.
科研通智能强力驱动
Strongly Powered by AbleSci AI