Comparative Accuracy, Stability, and Correctability of Large Language Models in Otolaryngology and Pharmacovigilance

医学药物警戒等级间信度耳鼻咽喉科医学诊断家庭医学患者安全物理疗法可靠性（半导体）利克特量表儿科临床病史体格检查客观结构化临床检查门诊部医疗急救医学物理学急诊医学梅德林人为因素与人体工程学职业安全与健康替代医学

作者

Filippo Bruno,Lise Sogalow,Bertrand Blankert,Jerome R Lechien

出处

期刊：Otolaryngology-Head and Neck Surgery [Wiley]
日期：2025-11-18

链接

wiley.com nih.govdoi.org

标识

DOI：10.1002/ohn.70070

摘要

Abstract Objective To compare the clinical and pharmacovigilance performance, stability, and correctability of 3 large language models (LLMs) in otolaryngology outpatient care. Study design Prospective case series. Setting Multicenter University Hospitals. Methods Consecutive adults (August‐October 2024) with established primary diagnoses were entered into ChatGPT‐4o, Gemini‐1.5‐Pro, and Claude‐3.5‐Sonnet using only history and physical examination findings (no complementary tests) via standardized prompts. Two blinded otolaryngologists rated clinical accuracy with the Artificial Intelligence Performance Instrument (AIPI); 2 blinded pharmacists rated pharmacological information on a 5‐point Likert scale. Errors were fed back to models and all cases were re‐queried one month later. Interrater reliability used ICC; stability used Cronbach's α . Group differences used Kruskal‐Wallis. Results Fifty‐one patients with 60 diagnoses across otolaryngology subspecialties were consecutively recruited (38 females (74.5%); mean age of 42.4 ± 17.4 years). All LLMs recommended significantly more additional examinations than practitioners ( P = .001), with a significant increase of the number of recommended additional examinations after regenerated inputs for ChatGPT‐4o and Claude‐3.5‐Sonnet, respectively. Claude‐3.5‐Sonnet and ChatGPT‐4o outperformed Gemini‐1.5‐Pro for AIPI‐clinical management ( P = .001) and pharmacovigilance findings ( P = .001). The physicians (ICC = 0.853) and the pharmacists (ICC = 0.991) demonstrated an almost perfect interrater reliability. All LLMs demonstrated an almost perfect clinical stability ( α = 0.831‐0.856), though human feedback did not significantly reduce misdiagnosis rates in subsequent interactions. Conclusion In outpatient ENT cases using clinical features alone, ChatGPT‐4o and Claude‐3.5‐Sonnet deliver higher clinical and pharmacovigilance performance than Gemini‐1.5‐Pro, with almost perfect interrater reliability and stable outputs. Re‐querying after feedback did not improve accuracy, questioning short‐term correctability.

求助该文献

最长约 10秒，即可获得该文献文件

Comparative Accuracy, Stability, and Correctability of Large Language Models in Otolaryngology and Pharmacovigilance

今日热心研友