医学
药物警戒
等级间信度
耳鼻咽喉科
医学诊断
家庭医学
患者安全
物理疗法
可靠性(半导体)
利克特量表
儿科
临床病史
体格检查
客观结构化临床检查
门诊部
医疗急救
医学物理学
急诊医学
梅德林
人为因素与人体工程学
职业安全与健康
替代医学
作者
Filippo Bruno,Lise Sogalow,Bertrand Blankert,Jerome R Lechien
摘要
Abstract Objective To compare the clinical and pharmacovigilance performance, stability, and correctability of 3 large language models (LLMs) in otolaryngology outpatient care. Study design Prospective case series. Setting Multicenter University Hospitals. Methods Consecutive adults (August‐October 2024) with established primary diagnoses were entered into ChatGPT‐4o, Gemini‐1.5‐Pro, and Claude‐3.5‐Sonnet using only history and physical examination findings (no complementary tests) via standardized prompts. Two blinded otolaryngologists rated clinical accuracy with the Artificial Intelligence Performance Instrument (AIPI); 2 blinded pharmacists rated pharmacological information on a 5‐point Likert scale. Errors were fed back to models and all cases were re‐queried one month later. Interrater reliability used ICC; stability used Cronbach's α . Group differences used Kruskal‐Wallis. Results Fifty‐one patients with 60 diagnoses across otolaryngology subspecialties were consecutively recruited (38 females (74.5%); mean age of 42.4 ± 17.4 years). All LLMs recommended significantly more additional examinations than practitioners ( P = .001), with a significant increase of the number of recommended additional examinations after regenerated inputs for ChatGPT‐4o and Claude‐3.5‐Sonnet, respectively. Claude‐3.5‐Sonnet and ChatGPT‐4o outperformed Gemini‐1.5‐Pro for AIPI‐clinical management ( P = .001) and pharmacovigilance findings ( P = .001). The physicians (ICC = 0.853) and the pharmacists (ICC = 0.991) demonstrated an almost perfect interrater reliability. All LLMs demonstrated an almost perfect clinical stability ( α = 0.831‐0.856), though human feedback did not significantly reduce misdiagnosis rates in subsequent interactions. Conclusion In outpatient ENT cases using clinical features alone, ChatGPT‐4o and Claude‐3.5‐Sonnet deliver higher clinical and pharmacovigilance performance than Gemini‐1.5‐Pro, with almost perfect interrater reliability and stable outputs. Re‐querying after feedback did not improve accuracy, questioning short‐term correctability.
科研通智能强力驱动
Strongly Powered by AbleSci AI