系统回顾
批判性评价
荟萃分析
计算机科学
置信区间
医学
自然语言处理
梅德林
病理
替代医学
内科学
生物
生物化学
作者
Diego A. Forero,Sandra E Abreu,Blanca Elpidia Tovar Riveros,Marilyn H. Oermann
标识
DOI:10.1093/jamia/ocaf117
摘要
Abstract Objectives To explore the performance of 4 large language model (LLM) chatbots for the analysis of 2 of the most commonly used tools for the advanced analysis of systematic reviews (SRs) and meta-analyses. Materials and Methods We explored the performance of 4 LLM chatbots (ChatGPT, Gemini, DeepSeek, and QWEN) for the analysis of ROBIS and AMSTAR 2 tools (sample sizes: 20 SRs), in comparison with assessments by human experts. Results Gemini showed the best agreement with human experts for both ROBIS and AMSTAR 2 (accuracy: 58% and 70%). The second best LLM chatbots were ChatGPT and QWEN, for ROBIS and AMSTAR 2, respectively. Discussion Some LLM chatbots underestimated the risk of bias or overestimated the confidence of the results in published SRs, which is compatible with recent articles for other tools. Conclusion This is one of the first studies comparing the performance of several LLM chatbots for the automated analyses of ROBIS and AMSTAR 2.
科研通智能强力驱动
Strongly Powered by AbleSci AI