作者
Ish Talati,Juan Manuel Zambrano Chaves,Avisha Das,Imon Banerjee,Daniel L. Rubin
摘要
BACKGROUND. The increasing complexity and volume of radiology reports present challenges for timely communication of critical findings. OBJECTIVE. The purpose of this study was to evaluate the performance of two out-of-the-box large language models (LLMs) in detecting and classifying critical findings in radiology reports by use of various prompt strategies. METHODS. The analysis included 252 radiology reports of varying modalities and anatomic regions that were extracted from the MIMIC-III (Medical Information Mart for Intensive Care) database, and were divided into a prompt-engineering tuning set of 50 reports, a holdout test set of 125 reports, and a pool of 77 remaining reports used as examples for few-shot prompting. An external test set of 180 chest radiography reports was extracted from the CheXpert Plus database. Reports were manually reviewed to identify critical findings and classify such findings into one of three categories (true critical findings, known/expected critical findings, and equivocal critical findings). After prompt engineering using various prompt strategies was conducted, a final prompt for optimal detection of true critical findings was selected. Two general-purpose LLMs, GPT-4 and Mistral-7B, processed reports in the test sets by use of the final prompt. Evaluation included automated text similarity metrics (BLEU-1 [Bilingual Evaluation Understudy], ROUGE-F1 [Recall-Oriented Understudy for Gisting Evaluation with F1], and G-Eval) and manual performance metrics (precision and recall). RESULTS. For true critical findings, zero-shot, few-shot static (five examples), and few-shot dynamic (five examples) prompting yielded BLEU-1 of 0.691, 0.778, and 0.748; ROUGE-F1 of 0.706, 0.797, and 0.773; and G-Eval of 0.428, 0.573, and 0.516, respectively. Precision and recall for true critical findings, known/expected critical findings, and equivocal critical findings, respectively, were as follows: 90.1% and 86.9%, 80.9% and 85.0%, and 80.5% and 94.3% in the holdout test set for GPT-4; 75.6% and 77.4%, 34.1% and 70.0%, and 41.3% and 74.3% in the holdout test set for Mistral-7B; 82.6% and 98.3%, 76.9% and 71.4%, and 70.8% and 85.0% in the external test set for GPT-4; and 75.0% and 93.1%, 33.3% and 92.9%, and 34.0% and 80.0% in the external test set for Mistral-7B. CONCLUSION. Out-of-the-box LLMs were used to detect and classify arbitrary numbers of critical findings in radiology reports. The optimal model for true critical findings entailed a few-shot static approach. CLINICAL IMPACT. The study shows a role of contemporary general-purpose models in adapting to specialized medical tasks using minimal data annotation.