计算机科学
计算机视觉
融合
变压器
人工智能
红外线的
工程类
物理
光学
电气工程
电压
哲学
语言学
作者
Zhishe Wang,Fan Yang,Jing Sun,Jiawei Xu,Fengbao Yang,Xiaomei Yan
标识
DOI:10.1016/j.knosys.2024.111949
摘要
Existing deep learning-based methods often follow either image-level or feature-level fusion frameworks to uniformly or separately extract features, ignoring the specialized interactive information learning, which may produce limited fusion performance. To tackle this challenge, we devise a powerful fusion baseline via adaptive interactive Transformer learning, namely AITFuse. Unlike previous methods, our network alternately incorporates local and global relationships through collaborative learning of both CNN and Transformer. In particular, we propose a cascaded token-wise and channel-wise Vision Transformer architecture with different attention mechanisms to model the long-range contexts, and allow feature communication across different tokens and independent channels in an interactive manner. On this basis, the modal-specific feature rectification module employs self-attention operation to revise distinctive features within the same domain for efficient encoding. Meanwhile, the cross-modal feature integration module constructs cross-attention mechanism to fuse complementary characteristics from different domains for multi-level decoding. In addition, we discard the learning position embedding to release our fusion model for the image of arbitrary sizes without splitting operations. Extensive experiments on mainstream datasets and downstream tasks demonstrate the rationality and superiority of our AITFuse. The codes will be available at https://github.com/Zhishe-Wang/AITFuse.
科研通智能强力驱动
Strongly Powered by AbleSci AI