计算机科学
图像检索
推论
软件部署
钥匙(锁)
特征提取
编码(内存)
人工智能
特征(语言学)
情报检索
图像(数学)
计算机视觉
软件工程
计算机安全
语言学
哲学
摘要
In recent years, the CLIP model has achieved remarkable success in image-text retrieval tasks through contrastive learning. However, CLIP still exhibits certain limitations when handling complex backgrounds and small objects. To address these challenges, this paper proposes two key innovations: First, during inference, the YOLOv10 model is employed to detect and crop small objects and essential background information in the image, enhancing ability of CLIP to comprehend complex scenes. Second, the Next-ViT network is utilized as the backbone for image encoding. By leveraging its more efficient multi-scale feature extraction capabilities, Next-ViT improves the retrieval accuracy of small objects while also being more adaptable for deployment in industrial scenarios. Experimental results demonstrate that these two innovations significantly enhance performance of CLIP in image-text retrieval tasks and achieve a balance between accuracy and efficiency across various vision tasks.
科研通智能强力驱动
Strongly Powered by AbleSci AI