误传
子专业
对抗制
医学
神经外科
稳健性(进化)
人工智能
医学教育
计算机科学
病理
外科
计算机安全
生物化学
化学
基因
作者
Rohaid Ali,Hael Abdulrazeq,Advait Patil,Michelle Cheatham,Ian D. Connolly,Oliver Y. Tang,Cody Doberstein,Tori Riccelli,Kevin T. Huang,Ganesh M. Shankar,Theresa Williamson,John H. Shin,Bob S. Carter,Radmehr Torabi,Christine K. Lee,Deus Cielo,Albert E. Telfeian,Ziya L. Gokaslan,Aaron Cohen‐Gadol,James Zou
标识
DOI:10.3171/2024.12.jns241607
摘要
OBJECTIVE Large language models (LLMs) have shown promising performance on medical licensing examinations, but their ability to excel in subspecialty domains and their robustness under adversarial conditions remain unclear. Herein, the authors present AtlasGPT, a subspecialty-focused LLM for neurosurgery, and evaluate its performance on a benchmark multiple-choice question bank and under adversarial testing, as well as its ability to generate high-quality explanations. METHODS AtlasGPT was built by fine-tuning GPT-4 architecture and retrieval-augmented generation from neurosurgical knowledge sources. Its performance was compared with that of GPT-4 and Gemini Advanced on a 149-question neurosurgery examination. Adversarial testing assessed robustness to misinformation. Answer explanations were rated by 15 independent neurosurgeons and compared with the question bank. RESULTS Across all 149 questions and on text-only questions, AtlasGPT (96%) outperformed Gemini Advanced (93%) and GPT-4 (88%) in accuracy. In adversarial testing, under which AtlasGPT was tasked with identifying medical misinformation, it was fooled 14% of the time, compared with 44% for GPT-4 and 68% for Gemini Advanced. Neurosurgeons rated AtlasGPT’s answer explanations as significantly more comprehensive, relevant, and better referenced than the question bank’s explanations of the responses (p < 0.001). AtlasGPT did not demonstrate any evidence of hallucination or other content that would be harmful for patient care or the surgeon’s clinical decision. CONCLUSIONS AtlasGPT demonstrates the potential of subspecialty-focused LLMs to outperform general models, exhibit robustness to misinformation, and generate high-quality explanations. Domain-specific LLMs may improve medical knowledge, decision-making, and educational materials in complex fields like neurosurgery.
科研通智能强力驱动
Strongly Powered by AbleSci AI