Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer

医学分类癌症放射科胰腺癌自然语言处理医学物理学人工智能肿瘤科普通外科内科学计算机科学

作者

Rajesh Bhayana,Bipin Nanda,Taher Dehkharghanian,Yangqing Deng,Nishaant Bhambra,Gavin J.B. Elias,Daksh Datta,Avinash Kambadakone,Chaya Shwaartz,Carol‐Anne Moulton,David Henault,Steven Gallinger,Satheesh Krishna

出处

期刊：Radiology [Radiological Society of North America]
日期：2024-06-01 卷期号：311 (3): e233117-e233117 被引量：69

链接

nih.govdoi.org

标识

DOI：10.1148/radiol.233117

摘要

Background Structured radiology reports for pancreatic ductal adenocarcinoma (PDAC) improve surgical decision-making over free-text reports, but radiologist adoption is variable. Resectability criteria are applied inconsistently. Purpose To evaluate the performance of large language models (LLMs) in automatically creating PDAC synoptic reports from original reports and to explore performance in categorizing tumor resectability. Materials and Methods In this institutional review board-approved retrospective study, 180 consecutive PDAC staging CT reports on patients referred to the authors' European Society for Medical Oncology-designated cancer center from January to December 2018 were included. Reports were reviewed by two radiologists to establish the reference standard for 14 key findings and National Comprehensive Cancer Network (NCCN) resectability category. GPT-3.5 and GPT-4 (accessed September 18-29, 2023) were prompted to create synoptic reports from original reports with the same 14 features, and their performance was evaluated (recall, precision, F1 score). To categorize resectability, three prompting strategies (default knowledge, in-context knowledge, chain-of-thought) were used for both LLMs. Hepatopancreaticobiliary surgeons reviewed original and artificial intelligence (AI)-generated reports to determine resectability, with accuracy and review time compared. The McNemar test, t test, Wilcoxon signed-rank test, and mixed effects logistic regression models were used where appropriate. Results GPT-4 outperformed GPT-3.5 in the creation of synoptic reports (F1 score: 0.997 vs 0.967, respectively). Compared with GPT-3.5, GPT-4 achieved equal or higher F1 scores for all 14 extracted features. GPT-4 had higher precision than GPT-3.5 for extracting superior mesenteric artery involvement (100% vs 88.8%, respectively). For categorizing resectability, GPT-4 outperformed GPT-3.5 for each prompting strategy. For GPT-4, chain-of-thought prompting was most accurate, outperforming in-context knowledge prompting (92% vs 83%, respectively; P = .002), which outperformed the default knowledge strategy (83% vs 67%, P < .001). Surgeons were more accurate in categorizing resectability using AI-generated reports than original reports (83% vs 76%, respectively; P = .03), while spending less time on each report (58%; 95% CI: 0.53, 0.62). Conclusion GPT-4 created near-perfect PDAC synoptic reports from original reports. GPT-4 with chain-of-thought achieved high accuracy in categorizing resectability. Surgeons were more accurate and efficient using AI-generated reports. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Chang in this issue.

求助该文献

最长约 10秒，即可获得该文献文件

Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer

今日热心研友