作者
Sahana Srinivasan,X. C. Ai,Minjie Zou,Ke Zou,Hyunjae Kim,Thaddaeus Wai Soon Lo,Krithi Pushpanathan,Guang Yang,Jocelyn Hui Lin Goh,Yiming Kong,Anran Li,Maxwell Singer,Kai Jin,Fares Antaki,David Z. Chen,Dianbo Liu,Ron A. Adelman,Qingyu Chen,Yih‐Chung Tham
摘要
Importance OpenAI’s recent large language model (LLM) o1 has dedicated reasoning capabilities, but it remains untested in specialized medical fields like ophthalmology. Evaluating o1 in ophthalmology is crucial to determine whether its general reasoning can meet specialized needs or if domain-specific LLMs are warranted. Objective To assess the performance and reasoning ability of OpenAI’s o1 compared with other LLMs on ophthalmological questions. Design, Setting, and Participants In September through October 2024, the LLMs o1, GPT-4o (OpenAI), GPT-4 (OpenAI), GPT-3.5 (OpenAI), Llama 3-8B (Meta), and Gemini 1.5 Pro (Google) were evaluated on 6990 standardized ophthalmology questions from the Medical Multiple-Choice Question Answering (MedMCQA) dataset. The study did not analyze human participants. Main Outcomes and Measures Models were evaluated on performance (accuracy and macro F1 score) and reasoning abilities (text-generation metrics: Recall-Oriented Understudy for Gisting Evaluation [ROUGE-L], BERTScore, BARTScore, AlignScore, and Metric for Evaluation of Translation With Explicit Ordering [METEOR]). Mean scores are reported for o1, while mean differences (Δ) from o1’s scores are reported for other models. Expert qualitative evaluation of o1 and GPT-4o responses assessed usefulness, organization, and comprehensibility using 5-point Likert scales. Results The LLM o1 achieved the highest accuracy (mean, 0.877; 95% CI, 0.870 to 0.885) and macro F1 score (mean, 0.877; 95% CI, 0.869 to 0.884) ( P < .001). In BERTScore, GPT-4o (Δ = 0.012; 95% CI, 0.012 to 0.013) and GPT-4 (Δ = 0.014; 95% CI, 0.014 to 0.015) outperformed o1 ( P < .001). Similarly, in AlignScore, GPT-4o (Δ = 0.019; 95% CI, 0.016 to 0.021) and GPT-4 (Δ = 0.024; 95% CI, 0.021 to 0.026) again performed better ( P < .001). In ROUGE-L, GPT-4o (Δ = 0.018; 95% CI, 0.017 to 0.019), GPT-4 (Δ = 0.026; 95% CI, 0.025 to 0.027), and GPT-3.5 (Δ = 0.008; 95% CI, 0.007 to 0.009) all outperformed o1 ( P < .001). Conversely, o1 led in BARTScore (mean, –4.787; 95% CI, –4.813 to –4.762; P < .001) and METEOR (mean, 0.221; 95% CI, 0.218 to 0.223; P < .001 except GPT-4o). Also, o1 outperformed GPT-4o in usefulness (o1: mean, 4.81; 95% CI, 4.73 to 4.89; GPT-4o: mean, 4.53; 95% CI, 4.40 to 4.65; P < .001) and organization (o1: mean, 4.83; 95% CI, 4.75 to 4.90; GPT-4o: mean, 4.63; 95% CI, 4.51 to 4.74; P = .003). Conclusions and Relevance This study found that o1 excelled in accuracy but showed inconsistencies in text-generation metrics, trailing GPT-4o and GPT-4; expert reviews found o1’s responses to be more clinically useful and better organized than GPT-4o. While o1 demonstrated promise, its performance in addressing ophthalmology-specific challenges is not fully optimal, underscoring the potential need for domain-specialized LLMs and targeted evaluations.