可读性
校准
克朗巴赫阿尔法
利克特量表
医学物理学
科恩卡帕
卡帕
比例(比率)
可用性
医学
计算机科学
心理学
统计
临床心理学
心理测量学
数学
机器学习
发展心理学
人机交互
物理
几何学
量子力学
程序设计语言
作者
Ömer Faruk Asker,Muhammed Selim Recai,Yunus Emre Genç,Kader Ada Dogan,Tarık Emre Şener,Bahadır Şahin
出处
期刊:BJUI
[Wiley]
日期:2025-07-31
摘要
Objective To evaluate widely used chatbots’ accuracy, calibration error, readability, and understandability with objective measurements by 35 questions derived from urology in‐service examinations, as the integration of large language models (LLMs) into healthcare has gained increasing attention, raising questions about their applications and limitations. Materials and Methods A total of 35 European Board of Urology questions were asked to five LLMs with a standardised prompt that was systematically designed and used across all models: ChatGPT‐4o, DeepSeek‐R1, Gemini, Grok‐2, and Claude 3.5. Accuracy was calculated by Cohen's kappa for all models. Readability was assessed by Flesch Reading Ease, Gunning Fog, Coleman–Liau, Simple Measure of Gobbledygook, and Automated Readability Index, while understandability was determined by scores of residents’ ratings by a Likert scale. Results The models and answer key were in substantial agreement with a Fleiss’ kappa of 0.701, and Cronbach's alpha of 0.914. For accuracy, Cohen's kappa was 0.767 for ChatGPT‐4o, 0.764 for DeepSeek‐R, and 0.765 for Grok‐2 (80% accuracy for each), followed by 0.729 for Claude 3.5 (77% accuracy) and 0.611 for Gemini (68.4% accuracy). The lowest calibration error was found in ChatGPT‐4o (19.2%) and DeepSeek‐R1 scored the highest for readability. In understandability analysis, Claude 3.5 had the highest rating compared to others. Conclusion Chatbots demonstrated various powers across different tasks. DeepSeek‐R1, despite being just released, showed promising results in medical applications. These findings highlight the need for further optimisation to better understand the applications of chatbots in urology.
科研通智能强力驱动
Strongly Powered by AbleSci AI