TFormer: A throughout fusion transformer for multi-modal skin lesion diagnosis

计算机科学模式变压器人工智能情态动词特征提取模式识别（心理学）数据挖掘计算机视觉电压工程类社会科学电气工程社会学化学高分子化学

作者

Yilan Zhang,Fengying Xie,Jianqi Chen,Jie Liu

出处

期刊：Computers in Biology and Medicine [Elsevier]
日期：2023-05-01 卷期号：157: 106712-106712 被引量：3

链接

arxiv.org arxiv.org nih.govdoi.org

标识

DOI：10.1016/j.compbiomed.2023.106712

摘要

Multi-modal skin lesion diagnosis (MSLD) has achieved remarkable success by modern computer-aided diagnosis (CAD) technology based on deep convolutions. However, the information aggregation across modalities in MSLD remains challenging due to severity unaligned spatial resolution (e.g., dermoscopic image and clinical image) and heterogeneous data (e.g., dermoscopic image and patients’ meta-data). Limited by the intrinsic local attention, most recent MSLD pipelines using pure convolutions struggle to capture representative features in shallow layers, thus the fusion across different modalities is usually done at the end of the pipelines, even at the last layer, leading to an insufficient information aggregation. To tackle the issue, we introduce a pure transformer-based method, which we refer to as “Throughout Fusion Transformer (TFormer)”, for sufficient information integration in MSLD. Different from the existing approaches with convolutions, the proposed network leverages transformer as feature extraction backbone, bringing more representative shallow features. We then carefully design a stack of dual-branch hierarchical multi-modal transformer (HMT) blocks to fuse information across different image modalities in a stage-by-stage way. With the aggregated information of image modalities, a multi-modal transformer post-fusion (MTP) block is designed to integrate features across image and non-image data. Such a strategy that information of the image modalities is firstly fused then the heterogeneous ones enables us to better divide and conquer the two major challenges while ensuring inter-modality dynamics are effectively modeled. Experiments conducted on the public Derm7pt dataset validate the superiority of the proposed method. Our TFormer achieves an average accuracy of 77.99% and diagnostic accuracy of 80.03% , which outperforms other state-of-the-art methods. Ablation experiments also suggest the effectiveness of our designs. The codes can be publicly available from https://github.com/zylbuaa/TFormer.git.

求助该文献

最长约 10秒，即可获得该文献文件

TFormer: A throughout fusion transformer for multi-modal skin lesion diagnosis

今日热心研友