Collaborative Enhancement of Consistency and Accuracy in US Diagnosis of Thyroid Nodules Using Large Language Models

医学甲状腺结节一致性（知识库）医学诊断介绍接收机工作特性恶性肿瘤放射科病理人工智能家庭医学内科学计算机科学

作者

Shaohong Wu,Wenjuan Tong,Ming‐De Li,Shunro Matsumoto,Xiao-Zhou Lu,Ze-Rong Huang,Xin-Xin Lin,Rui-Fang Lu,Ming‐De Lu,Li‐Da Chen,Wei Wang

出处

期刊：Radiology [Radiological Society of North America]
日期：2024-03-01 卷期号：310 (3) 被引量：52

链接

nih.govdoi.org

标识

DOI：10.1148/radiol.232255

摘要

Background Large language models (LLMs) hold substantial promise for medical imaging interpretation. However, there is a lack of studies on their feasibility in handling reasoning questions associated with medical diagnosis. Purpose To investigate the viability of leveraging three publicly available LLMs to enhance consistency and diagnostic accuracy in medical imaging based on standardized reporting, with pathology as the reference standard. Materials and Methods US images of thyroid nodules with pathologic results were retrospectively collected from a tertiary referral hospital between July 2022 and December 2022 and used to evaluate malignancy diagnoses generated by three LLMs-OpenAI's ChatGPT 3.5, ChatGPT 4.0, and Google's Bard. Inter- and intra-LLM agreement of diagnosis were evaluated. Then, diagnostic performance, including accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC), was evaluated and compared for the LLMs and three interactive approaches: human reader combined with LLMs, image-to-text model combined with LLMs, and an end-to-end convolutional neural network model. Results A total of 1161 US images of thyroid nodules (498 benign, 663 malignant) from 725 patients (mean age, 42.2 years ± 14.1 [SD]; 516 women) were evaluated. ChatGPT 4.0 and Bard displayed substantial to almost perfect intra-LLM agreement (κ range, 0.65-0.86 [95% CI: 0.64, 0.86]), while ChatGPT 3.5 showed fair to substantial agreement (κ range, 0.36-0.68 [95% CI: 0.36, 0.68]). ChatGPT 4.0 had an accuracy of 78%-86% (95% CI: 76%, 88%) and sensitivity of 86%-95% (95% CI: 83%, 96%), compared with 74%-86% (95% CI: 71%, 88%) and 74%-91% (95% CI: 71%, 93%), respectively, for Bard. Moreover, with ChatGPT 4.0, the image-to-text-LLM strategy exhibited an AUC (0.83 [95% CI: 0.80, 0.85]) and accuracy (84% [95% CI: 82%, 86%]) comparable to those of the human-LLM interaction strategy with two senior readers and one junior reader and exceeding those of the human-LLM interaction strategy with one junior reader. Conclusion LLMs, particularly integrated with image-to-text approaches, show potential in enhancing diagnostic medical imaging. ChatGPT 4.0 was optimal for consistency and diagnostic accuracy when compared with Bard and ChatGPT 3.5. © RSNA, 2024

求助该文献

最长约 10秒，即可获得该文献文件

Collaborative Enhancement of Consistency and Accuracy in US Diagnosis of Thyroid Nodules Using Large Language Models

今日热心研友