RAPID: Zero-Shot Domain Adaptation for Code Search with Pre-Trained Models

计算机科学域适应零（语言学）适应（眼睛）领域（数学分析）编码（集合论）人工智能理论计算机科学程序设计语言算法机器学习数学心理学神经科学集合（抽象数据类型）哲学数学分析分类器（UML）语言学

作者

Guodong Fan,Shizhan Chen,Cuiyun Gao,Jianmao Xiao,Tao Zhang,Zhiyong Feng

出处

期刊：ACM Transactions on Software Engineering and Methodology [Association for Computing Machinery]
日期：2024-01-18 卷期号：33 (5): 1-35

链接

acm.orgdoi.org

标识

DOI：10.1145/3641542

摘要

Code search, which refers to the process of identifying the most relevant code snippets for a given natural language query, plays a crucial role in software maintenance. However, current approaches heavily rely on labeled data for training, which results in performance decreases when confronted with cross-domain scenarios including domain- or project-specific situations. This decline can be attributed to their limited ability to effectively capture the semantics associated with such scenarios. To tackle the aforementioned problem, we propose a ze R o-shot dom A in ada P tion with pre-tra I ned mo D els framework for code search named RAPID. The framework first generates synthetic data by pseudo labeling, then trains the CodeBERT with sampled synthetic data. To avoid the influence of noisy synthetic data and enhance the model performance, we propose a mixture sampling strategy to obtain hard negative samples during training. Specifically, the mixture sampling strategy considers both relevancy and diversity to select the data that are hard to be distinguished by the models. To validate the effectiveness of our approach in zero-shot settings, we conduct extensive experiments and find that RAPID outperforms the CoCoSoDa and UniXcoder model by an average of 15.7% and 10%, respectively, as measured by the MRR metric. When trained on full data, our approach results in an average improvement of 7.5% under the MRR metric using CodeBERT. We observe that as the model’s performance in zero-shot tasks improves, the impact of hard negatives diminishes. Our observation also indicates that fine-tuning CodeT5 for generating pseudo labels can enhance the performance of the code search model, and using only 100-shot samples can yield comparable results to the supervised baseline. Furthermore, we evaluate the effectiveness of RAPID in real-world code search tasks in three GitHub projects through both human and automated assessments. Our findings reveal RAPID exhibits superior performance, e.g., an average improvement of 18% under the MRR metric over the top-performing model.

求助该文献

最长约 10秒，即可获得该文献文件

RAPID: Zero-Shot Domain Adaptation for Code Search with Pre-Trained Models

今日热心研友