CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models

水准点（测量）计算机科学自然语言处理人工智能中文中国情报检索语言学历史地理地图学哲学考古

作者

Yuanjie Lyu,Zhiyu Li,Simin Niu,Feiyu Xiong,Bo Tang,Wenjin Wang,Hao Wu,Huanyong Liu,Tong Xu,Enhong Chen

出处

期刊：ACM Transactions on Information Systems [Association for Computing Machinery]
日期：2024-10-19 被引量：9

链接

acm.org arxiv.org arxiv.orgdoi.org

标识

DOI：10.1145/3701228

摘要

Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources. This method addresses common LLM limitations, including outdated information and the tendency to produce inaccurate “hallucinated” content. However, evaluating RAG systems is a challenge. Most benchmarks focus primarily on question answering applications, neglecting other potential scenarios where RAG could be beneficial. Accordingly, in the experiments, these benchmarks often assess only the LLM components of the RAG pipeline or the retriever in knowledge-intensive scenarios, overlooking the impact of external knowledge base construction and the retrieval component on the entire RAG pipeline in non-knowledge-intensive scenarios. To address these issues, this paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios. Specifically, we refer to the CRUD actions that describe interactions between users and knowledge bases, and also categorize the range of RAG applications into four distinct types–Create, Read, Update, and Delete (CRUD). “Create” refers to scenarios requiring the generation of original, varied content. “Read” involves responding to intricate questions in knowledge-intensive situations. “Update” focuses on revising and rectifying inaccuracies or inconsistencies in pre-existing texts. “Delete” pertains to the task of summarizing extensive texts into more concise forms. For each of these CRUD categories, we have developed different datasets to evaluate the performance of RAG systems. We also analyze the effects of various components of the RAG system, such as the retriever, context length, knowledge base construction, and LLM. Finally, we provide useful insights for optimizing the RAG technology for different scenarios 1 .

求助该文献

最长约 10秒，即可获得该文献文件

CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models

今日热心研友