计算机科学
一般化
构造(python库)
人工智能
背景(考古学)
任务(项目管理)
特征(语言学)
图像(数学)
模式识别(心理学)
机器学习
图像检索
数据挖掘
语言学
哲学
数学分析
古生物学
数学
管理
经济
生物
程序设计语言
作者
Qiang Li,Feng Zhao,Linlin Zhao,Liu Mao-kai,Yübo Wang,Shuo Zhang,Yuanyuan Guo,Shibo Wang,Weigang Wang
摘要
ABSTRACT The algorithm for multimodal image‐text retrieval aims to overcome the differences between visual and textual data, enabling efficient and accurate recognition between images and text. Since manually labeled data are usually expensive, many researchers attempted to use low‐quality multimodal data obtained through network batch operations. This presents a challenge for the model's generalization performance and prediction accuracy. To address this issue, we construct a system of multimodal image‐text retrieval based on the fusion of pre‐trained models. Firstly, we enhance the diversity of the original data using the MixGen algorithm to improve the model's generalization performance. Next, we employ Chinese‐CLIP as the most suitable foundational model based on comparative experiments among three different models. Finally, we construct a comprehensive ensemble model with three base Chinese‐CLIP models based on the specific characteristics of the tasks, which includes a prediction‐based fusion model for the text‐to‐image task and a feature‐based fusion model for the image‐to‐text task. Extensive experiments show that our model outperforms state‐of‐the‐art single foundation models in generalization, especially with low‐quality image‐text pairs and small datasets in the Chinese context.
科研通智能强力驱动
Strongly Powered by AbleSci AI