医学
计算机辅助设计
冠状动脉造影
自然语言处理
心脏病学
心肌梗塞
工程类
工程制图
计算机科学
作者
Yuli Wang,Wen-Chi Hsu,Yuwei Dai,V. Shi,Gigin Lin,Harrison X. Bai,Cheng Ting Lin
出处
期刊:Circulation
[Lippincott Williams & Wilkins]
日期:2024-11-12
卷期号:150 (Suppl_1)
标识
DOI:10.1161/circ.150.suppl_1.4119869
摘要
Background: The effectiveness of different large language models (LLM) in extracting important information from CT angiography (CTA) radiology reports is unknown. Purpose: The Coronary Artery Disease-Reporting and Data System (CAD-RADS) is a standardized reporting system for CTA that categorizes the severity of coronary artery disease to aid in clinical decision-making. To evaluate the agreement between human readers and LLMs in assigning CAD-RADS categories based on CTA reports. Materials and Methods: This retrospective study analyzed reports from patients who underwent coronary CTA for screening or diagnosis at Johns Hopkins Hospital. Reports with CAD-RADS 0–5 findings were collected from Jan. 2020 to Oct. 2023. Initially, 6,212 CTA reports were retrieved. After excluding non-organized reports and retaining only those with CAD-RADS scores, 590 reports remained. Board-certified radiologists and three large language models (GPT-3.5, GPT-4, and Llama3) assigned CAD-RADS categories based on the original radiologists' findings. Two radiologists reviewed the CAD scores, with only one discrepancy in the 590 reports, indicating high inter-agreement. Agreement between human readers and LLMs for CAD-RADS categories was assessed using the Gwet agreement coefficient (AC1 value). Frequencies of changes in CAD-RADS category assignments that could potentially affect clinical management (CAD-RADS 0-2 vs. 3-5) were calculated and compared. Typically, CAD-RADS 0-2 indicates mild disease leading to conservative management, while CAD-RADS 3-5 denotes further testing or disease requiring aggressive intervention to prevent adverse cardiovascular events. Results: Among 590 reports, agreement between original and reviewing radiologists was almost perfect (AC1 = 0.985). Good agreement was also found between original radiologists and Llama3, GPT-3.5, and GPT-4 (AC1 = 0.861, 0.907, 0.941, respectively). Differences in CAD-RADS category upgrades/downgrades potentially affecting clinical management were observed: 0 of 590 (0%) for human readers, 56 of 590 (9.4%) for Llama3, 33 of 590 (5.5%) for GPT-3.5, and 23 of 590 (3.9%) for GPT-4 upgrading CAD-RADs; 8 of 590 (1.4%) for human readers, 19 of 590 (3.2%) for Llama3, 22 of 590 (3.7%) for GPT-3.5, and 12 of 590 (2.0%) for GPT-4 downgrading RAD-CADs. Conclusion: LLMs showed good agreement with human readers in assigning CAD-RADS categories on CTA reports. However, they were more likely to upgrade CAD-RADS category when compared to radiologists.
科研通智能强力驱动
Strongly Powered by AbleSci AI