Evaluating Large Language Models in Ophthalmology: Systematic Review

作者
ZITAO ZHANG,Haiyang Zhang,Z.P. Pan,Zhangqian Bi,Yao Wan,Xuefei Song,Xianqun Fan
出处
期刊:Journal of Medical Internet Research [JMIR Publications]
卷期号:27: e76947-e76947 被引量:1
标识
DOI:10.2196/76947
摘要

Background Large language models (LLMs) have the potential to revolutionize ophthalmic care, but their evaluation practice remains fragmented. A systematic assessment is crucial to identify gaps and guide future evaluation practices and clinical integration. Objective This study aims to map the current landscape of LLM evaluations in ophthalmology and explore whether performance synthesis is feasible for a common task. Methods A comprehensive search of PubMed, Web of Science, Embase, and IEEE Xplore was conducted up to November 17, 2024 (no language limits). Eligible publications quantitatively assessed an existing or modified LLM on ophthalmology-related tasks. Studies without full-text availability or those focusing solely on vision-only models were excluded. Two reviewers screened studies and extracted data across 6 dimensions (evaluated LLM, data modality, ophthalmic subspecialty, medical task, evaluation dimension, and clinical alignment), and disagreements were resolved by a third reviewer. Descriptive statistics were analyzed and visualized using Python (with NumPy, Pandas, SciPy, and Matplotlib libraries). The Fisher exact test compared open- versus closed-source models. An exploratory random-effects meta-analysis (logit transformation; DerSimonian-Laird τ2) was performed for the diagnosis-making task; heterogeneity was reported with I2 and subgrouped by model, modality, and subspecialty. Results Of the 817 identified records, 187 studies met the inclusion criteria. Closed-source LLMs dominated: 170 for ChatGPT, 58 for Gemini, and 32 for Copilot. Open-source LLMs appeared in only 25 (13.4%) of studies overall, but they appeared in 17 (77.3%) of evaluation-after-development studies, versus 8 (4.8%) pure-evaluation studies (P<1×10-5). Evaluations were chiefly text-only (n=168); image-text tasks, despite the centrality of imaging, were used in 19 studies. Subspecialty coverage was skewed toward comprehensive ophthalmology (n=72), retina and vitreous (n=32), and glaucoma (n=20). Refractive surgery, ocular pathology and oncology, and ophthalmic pharmacology each appeared in 3 or fewer studies. Medical query (n=86), standardized examination (n=41), and diagnosis making (n=29) emerged as the 3 predominant tasks, while research assistance (n=5), patient triaging (n=3), and disease prediction (n=3) received less attention. Accuracy was reported in most studies (n=176), whereas calibration and uncertainty were almost absent (n=5). Real-world patient data (n=45), human performance comparison (n=63), non‑English testing (n=24), and real-world deployment (n=4) were relatively absent. Exploratory meta-analysis pooled 28 diagnostic evaluations from 17 studies: overall accuracy was 0.594 (95% CI 0.488-0.692) with extreme heterogeneity (I2=94.5%). Subgroups remained heterogeneous (I2>80%), and findings were inconsistent (eg, pooled GPT-3.5>GPT-4). Conclusions Evidence on LLM evaluations in ophthalmology is extensive but heterogeneous. Most studies have tested a few closed-source LLMs on text-based questions, leaving open-source systems, multimodal tasks, non-English contexts, and real-world deployment underexamined. High methodological variability precludes meaningful performance aggregation, as illustrated by the heterogeneous meta-analysis. Standardized, multimodal benchmarks and phased clinical validation pipelines are urgently needed before LLMs can be safely integrated into eye care workflows.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
xue完成签到,获得积分10
刚刚
量子星尘发布了新的文献求助10
刚刚
科研通AI6应助纪你巴采纳,获得10
刚刚
1秒前
完美世界应助淡然水绿采纳,获得10
1秒前
2秒前
打打应助太空采纳,获得10
3秒前
3秒前
3秒前
cjjwei完成签到 ,获得积分10
3秒前
zl发布了新的文献求助10
3秒前
李健的粉丝团团长应助Zarc采纳,获得10
4秒前
5秒前
CodeCraft应助Ningxin采纳,获得10
5秒前
hxscu完成签到 ,获得积分10
6秒前
6秒前
万丈光芒发布了新的文献求助10
6秒前
梦茵发布了新的文献求助10
7秒前
英俊的铭应助fengjz采纳,获得10
7秒前
啦啦啦发布了新的文献求助10
7秒前
orixero应助热情孤丹采纳,获得10
7秒前
七七发布了新的文献求助10
8秒前
852应助啦啦啦啦德玛西亚采纳,获得10
8秒前
夏至未至完成签到,获得积分20
8秒前
9秒前
NEXUS1604应助勤恳的伟宸采纳,获得20
9秒前
10秒前
轩轩轩轩完成签到 ,获得积分10
10秒前
okiya完成签到,获得积分10
11秒前
脑洞疼应助无语采纳,获得10
11秒前
Mira完成签到,获得积分10
12秒前
橘子完成签到,获得积分10
12秒前
大模型应助夏至未至采纳,获得10
13秒前
victoria发布了新的文献求助10
13秒前
乐乐应助wangping采纳,获得10
13秒前
酷波er应助wangping采纳,获得10
13秒前
dddd完成签到 ,获得积分10
13秒前
Ava应助wangping采纳,获得10
13秒前
JamesPei应助wangping采纳,获得10
13秒前
共享精神应助wangping采纳,获得10
13秒前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Binary Alloy Phase Diagrams, 2nd Edition 8000
Encyclopedia of Reproduction Third Edition 3000
Comprehensive Methanol Science Production, Applications, and Emerging Technologies 2000
From Victimization to Aggression 1000
Study and Interlaboratory Validation of Simultaneous LC-MS/MS Method for Food Allergens Using Model Processed Foods 500
Red Book: 2024–2027 Report of the Committee on Infectious Diseases 500
热门求助领域 (近24小时)
化学 材料科学 生物 医学 工程类 计算机科学 有机化学 物理 生物化学 纳米技术 复合材料 内科学 化学工程 人工智能 催化作用 遗传学 数学 基因 量子力学 物理化学
热门帖子
关注 科研通微信公众号,转发送积分 5646330
求助须知:如何正确求助?哪些是违规求助? 4770916
关于积分的说明 15034350
捐赠科研通 4805112
什么是DOI,文献DOI怎么找? 2569392
邀请新用户注册赠送积分活动 1526467
关于科研通互助平台的介绍 1485812