计算机科学
计算机视觉
人工智能
图像处理
图像(数学)
作者
Chunlei Meng,Wei Lin,Bowen Liu,Hongda Zhang,Zhongxue Gan,Chun Ouyang
标识
DOI:10.1109/jbhi.2024.3525054
摘要
Vision transformers have achieved remarkable success in image classification. The dual-branch vision transformer generates more features by taking advantage of feature fusion. Inspired by this, a dual-branch vision transformer with Real-Time Share feature was proposed during the encoding process for retinal image classification tasks. The approach processes image patches of varying sizes (base and large) through two independent branches and implements multi-stage Real-Time feature fusion via the Real-Time Share feature encoder. This encoder enables the branches to complement each other's features at each encoding stage, facilitating finer feature learning and enhancing the self-attention information passed to subsequent stages. It significantly boosts feature representation and classification performance. Additionally, a straightforward and effective feature fusion method, L-Times Attention Fusion, was proposed: vector concatenation for Real-Time Share feature in the earlier (L-1) encoding stages and element-wise addition for overall feature fusion at the L-th stage, achieving more efficient feature integration. The method was validated on a retinal image dataset. Results show that the approach outperforms the recent Cross-ViT average TOP-1 Acc by 5.61% with lower FLOPs and model parameters, without relying on pre-trained weights, highlighting stronger self-learning feature capabilities and reduced reliance on extensive pre-training data.
科研通智能强力驱动
Strongly Powered by AbleSci AI