指南
计算机科学
水准点(测量)
可扩展性
人工智能
质量(理念)
机器学习
领域(数学分析)
数据科学
梅德林
专家系统
主题专家
自然语言处理
人工智能应用
软件工程
数据挖掘
系统回顾
语言模型
数据质量
临床决策支持系统
风险评估
情报检索
决策支持系统
知识管理
作者
David Chen,Patrick Li,Ealia Khoshkish,Seungmin Lee,Tony Ning,Umair Tahir,Henry C Y Wong,M. Lee,Srinivas Raman
标识
DOI:10.1093/jamia/ocaf223
摘要
Abstract Objectives To develop AutoReporter, a large language model (LLM) system that automates evaluation of adherence to research reporting guidelines. Materials and Methods Eight prompt-engineering and retrieval strategies coupled with reasoning and general-purpose LLMs were benchmarked on the SPIRIT–CONSORT–TM corpus. The top-performing approach, AutoReporter, was validated on BenchReport, a novel benchmark dataset of expert-rated reporting guideline assessments from 10 systematic reviews. Results AutoReporter, a zero-shot, no-retrieval prompt coupled with the o3-mini reasoning LLM, demonstrated strong accuracy (CONSORT 90.09%; SPIRIT: 92.07%), substantial agreement with humans (CONSORT Cohen’s κ = 0.70, SPIRIT Cohen’s κ = 0.77), runtime (CONSORT: 617.26 s; SPIRIT: 544.51 s), and cost (CONSORT: 0.68 USD; SPIRIT: 0.65 USD). AutoReporter achieved a mean accuracy of 91.8% and substantial agreement (Cohen’s κ > 0.6) with expert ratings from the BenchReport benchmark. Discussion Structured prompting alone can match or exceed fine-tuned domain models while forgoing manually annotated corpora and computationally intensive training. Conclusion Large language models can feasibly automate reporting guideline adherence assessments for scalable quality control in scientific research reporting. AutoReporter is publicly accessible at https://autoreporter.streamlit.app.
科研通智能强力驱动
Strongly Powered by AbleSci AI