Evaluating Large Language Models: A Comprehensive Survey

简编 计算机科学 风险分析(工程) 业务 地理 考古 操作系统
作者
Z. J. Guo,Renren Jin,Chuang LIU,Yufei Huang,Dongquan Shi,Supryadi,Lixin Yu,Yan Liu,Jiaxuan Li,Bin Xiong,Deyi Xiong
出处
期刊:Cornell University - arXiv
标识
DOI:10.48550/arxiv.2310.19736
摘要

Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability. We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
欣喜的手机完成签到,获得积分10
1秒前
小一一发布了新的文献求助10
1秒前
1秒前
赘婿应助英勇的大碗采纳,获得10
2秒前
2秒前
研友_ngqgY8完成签到,获得积分10
2秒前
慕无忌发布了新的文献求助10
3秒前
4秒前
吃书的猪完成签到,获得积分10
4秒前
研友_ngqgY8发布了新的文献求助20
5秒前
7秒前
liningcen发布了新的文献求助10
9秒前
罗子超发布了新的文献求助10
9秒前
学骨科的小王同学完成签到,获得积分10
11秒前
CipherSage应助ZRH采纳,获得10
11秒前
huangjie发布了新的文献求助10
13秒前
个性的紫菜应助戴先森采纳,获得10
14秒前
liningcen完成签到,获得积分10
16秒前
bkagyin应助润泽无语采纳,获得10
17秒前
Bunny酱酱君完成签到 ,获得积分10
17秒前
贝贝完成签到 ,获得积分10
18秒前
情怀应助科研通管家采纳,获得10
18秒前
Hello应助科研通管家采纳,获得10
18秒前
FashionBoy应助科研通管家采纳,获得10
18秒前
Jasper应助科研通管家采纳,获得10
18秒前
科目三应助科研通管家采纳,获得10
18秒前
丘比特应助科研通管家采纳,获得30
18秒前
18秒前
18秒前
19秒前
19秒前
21秒前
sciress发布了新的文献求助10
21秒前
22秒前
今后应助abc采纳,获得10
22秒前
23秒前
拼搏的电源完成签到 ,获得积分10
23秒前
ZRH发布了新的文献求助10
24秒前
北兮驳回了benben应助
24秒前
DKN发布了新的文献求助10
26秒前
高分求助中
The three stars each: the Astrolabes and related texts 1120
Electronic Structure Calculations and Structure-Property Relationships on Aromatic Nitro Compounds 500
Revolutions 400
Psychological Warfare Operations at Lower Echelons in the Eighth Army, July 1952 – July 1953 400
宋、元、明、清时期“把/将”字句研究 300
Classroom Discourse Competence 260
我在山東當院長:一位中國大學小官的自白 230
热门求助领域 (近24小时)
化学 材料科学 医学 生物 有机化学 工程类 生物化学 纳米技术 物理 内科学 计算机科学 化学工程 复合材料 遗传学 基因 物理化学 催化作用 电极 光电子学 量子力学
热门帖子
关注 科研通微信公众号,转发送积分 2437837
求助须知:如何正确求助?哪些是违规求助? 2117564
关于积分的说明 5376262
捐赠科研通 1845632
什么是DOI,文献DOI怎么找? 918474
版权声明 561748
科研通“疑难数据库(出版商)”最低求助积分说明 491299