DeepSeek-R1 vs OpenAI o1 for Ophthalmic Diagnoses and Management Plans

麦克内马尔试验医学医学诊断临床实习医学物理学家庭医学病理统计数学

作者

David Mikhail,Andrew Farah,Jason Milad,Andrew Mihalache,Daniel Milad,Fares Antaki,Michael Balas,Marko M. Popovic,Rajeev H. Muni,Pearse A. Keane,Renaud Duval

出处

期刊：JAMA Ophthalmology [American Medical Association]
日期：2025-09-04 被引量：1

链接

nih.govdoi.org

标识

DOI：10.1001/jamaophthalmol.2025.2918

摘要

Importance Large language models (LLMs) are increasingly being explored in clinical decision-making, but few studies have evaluated their performance on complex ophthalmology cases from clinical practice settings. Understanding whether open-weight, reasoning-enhanced LLMs can outperform proprietary models has implications for clinical utility and accessibility. Objective To evaluate the diagnostic accuracy, management decision-making, and cost of DeepSeek-R1 vs OpenAI o1 across diverse ophthalmic subspecialties. Design, Setting, and Participants This was a cross-sectional evaluation conducted using standardized prompts and model configurations. Clinical cases were sourced from JAMA Ophthalmology ’s Clinical Challenge articles, containing complex cases from clinical practice settings. Each case included an open-ended diagnostic question and a multiple-choice next-step decision. All cases were included without exclusions, and no human participants were involved. Data were analyzed from March 13 to March 30, 2025. Exposures DeepSeek-R1and OpenAI o1 were evaluated using the Plan-and-Solve Plus (PS+) prompt engineering method. Main Outcomes and Measures Primary outcomes were diagnostic accuracy and next-step decision-making accuracy, defined as the proportion of correct responses. Token cost analyses were performed to estimate expenses. Intermodel agreement was evaluated using Cohen κ, and McNemar test was used to compare performance. Results A total of 422 clinical cases were included, spanning 10 subspecialties. DeepSeek-R1 achieved a higher diagnostic accuracy of 70.4% (297 of 422 cases) compared with 63.0% (266 of 422 cases) for OpenAI o1, a 7.3% difference (95% CI, 1.0%-13.7%; P = .02). For next-step decisions, DeepSeek-R1 was correct in 82.7% of cases (349 of 422 cases) vs OpenAI o1’s accuracy of 75.8% (320 of 422 cases), a 6.9% difference (95% CI, 1.4%-12.3%; P = .01). Intermodel agreement was moderate (κ = 0.422; 95% CI, 0.375-0.469; P &lt; .001). DeepSeek-R1 offered lower costs per query than OpenAI o1, with savings exceeding 66-fold (up to 98.5%) during off-peak pricing. Conclusions and Relevance DeepSeek-R1 outperformed OpenAI o1 in diagnosis and management across subspecialties while lowering operating costs, supporting the potential of open-weight, reinforcement learning–augmented LLMs as scalable and cost-saving tools for clinical decision support. Further investigations should evaluate safety guardrails and assess performance of self-hosted adaptations of DeepSeek-R1 with domain-specific ophthalmic expertise to optimize clinical utility.

求助该文献

最长约 10秒，即可获得该文献文件

DeepSeek-R1 vs OpenAI o1 for Ophthalmic Diagnoses and Management Plans

今日热心研友