计算机科学
正确性
质量(理念)
控制(管理)
问责
可靠性(半导体)
服务(商务)
风险分析(工程)
服务质量
知识管理
计算机安全
生成模型
领域(数学分析)
人工智能
人类智力
访问控制
过程管理
基本事实
公共服务
标杆管理
主题专家
语言模型
声誉
机器学习
方案(数学)
动作(物理)
数据科学
作者
Inbal Yahav,Anat Goldstein,Tomer Geva,Sagi Meir,Onn Shehory
标识
DOI:10.1287/isre.2023.0426
摘要
As businesses increasingly rely on large language models (LLMs) for tasks such as customer service and information retrieval, ensuring the accuracy of their responses is a critical challenge. Traditional verification is costly, slow, and often requires scarce domain experts. We introduce the automated quality evaluation based on textual responses (AQER) framework, a novel, cost-effective method to assess the correctness of free-text answers from both LLMs and human workers without needing preexisting correct answers. AQER works by intelligently aggregating multiple responses to the same question, leveraging the wisdom of the crowd to create a reliable synthetic correct answer, followed by an iterative procedure that accounts for response quality cues. AQER obtains state-of-the-art performance compared with existing automated response evaluation baselines. For managers AQER offers a scalable, data-driven method to (i) evaluate and select the best performing LLMs for specific organizational needs and use cases, (ii) continuously monitor artificial intelligence (AI) performance to ensure reliability and accountability across different model versions, and (iii) manage the quality of crowd workers essential for high-quality AI training and validation. AQER, thus, offers a robust mechanism for improving model performance and mitigating the significant financial and reputational risks associated with deploying untrustworthy generative AI technologies.
科研通智能强力驱动
Strongly Powered by AbleSci AI