计算机科学
图像检索
人工智能
变压器
模式识别(心理学)
培训(气象学)
图像自动标注
计算机视觉
情报检索
图像(数学)
量子力学
物理
气象学
电压
作者
Xue Li,Jiong Yu,Shaochen Jiang,Hongchun Lu,Ziyang Li
标识
DOI:10.1109/tmm.2023.3304021
摘要
The recently developed vision transformer (ViT) has achieved promising results on image retrieval compared to convolutional neural networks. However, most of these vision transformer-based image retrieval methods use the original ViT model to extract global features, ignoring the importance of local features for image retrieval. In this work, we propose a vision transformer-based multiscale feature fusion image retrieval method (MSViT) to achieve the fusion of global features with local features. The challenge of this research work is how to learn the feature representation ability of transformer model, so as to improve the performance of image retrieval model. First, a transformer-based two-branch network structure is proposed to obtain different scale features by processing image patches with different granularities. Second, we present a multiscale feature fusion strategy, which can efficiently and effectively fuse the feature information of different sizes on two branches. Finally, to more fully utilize the label information to supervise the network training process, we optimize the construction rules for the triplet data. The comparison of experimental results with ten CNN-based and six transformer-based image retrieval methods on four publicly available image datasets shows that our method outperforms the state-of-the-art methods. And ablation experiments show that the designed multiscale feature fusion strategy and improved triplet loss function have an implicit improvement on the performance of MSViT.
科研通智能强力驱动
Strongly Powered by AbleSci AI