Open-Weight Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports: Assessment of Approaches and Parameters

计算机科学 数据提取 人工智能 自然语言处理 水准点(测量) 推论 情报检索 机器学习 梅德林 政治学 法学 大地测量学 地理
作者
Mohamed Sobhi Jabal,Pranav Warman,Jikai Zhang,Kartikeye Gupta,Ayush Jain,Maciej A. Mazurowski,Walter F. Wiggins,Kirti Magudia,Evan Calabrese
出处
期刊:Radiology [Radiological Society of North America]
被引量:1
标识
DOI:10.1148/ryai.240551
摘要

“Just Accepted” papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. Purpose To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weights language models (LMs) and retrieval augmented generation (RAG) and to assess the effects of model configuration variables on extraction performance. Materials and Methods This retrospective study utilized two datasets: 7,294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2,154 pathology reports annotated for IDH mutation status (January 2017 to July 2021). An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations for structured data extraction accuracy from reports. The impact of model size, quantization, prompting strategies, output formatting, and inference parameters on model accuracy was systematically evaluated. Results The best performing models achieved up to 98% accuracy in extracting BT-RADS scores from radiology reports and over 90% for IDH mutation status extraction from pathology reports. The best model was medical finetuned llama3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models (mean accuracy, 86% versus 75%; P < .001). Model quantization had minimal impact on performance. Few-shot prompting significantly improved accuracy (mean increase: 32% ± 32%, P = .02). RAG improved performance for complex pathology reports +48% ± 11% ( P = .001), but not for shorter radiology reports-8% ± 31% ( P = .39). Conclusion This study demonstrates the potential of open LMs in automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semiautomated optimization using annotated data are critical for optimal performance. ©RSNA, 2025
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
PDF的下载单位、IP信息已删除 (2025-6-4)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
袁昊发布了新的文献求助10
刚刚
1秒前
1秒前
我是老大应助肥肥采纳,获得10
3秒前
4秒前
坚定士萧应助科研通管家采纳,获得10
4秒前
香蕉觅云应助科研通管家采纳,获得10
4秒前
bkagyin应助科研通管家采纳,获得10
4秒前
情怀应助科研通管家采纳,获得10
4秒前
科目三应助科研通管家采纳,获得10
4秒前
4秒前
香蕉觅云应助科研通管家采纳,获得10
5秒前
CodeCraft应助科研通管家采纳,获得10
5秒前
科研通AI5应助科研通管家采纳,获得10
5秒前
领导范儿应助科研通管家采纳,获得10
5秒前
dyf完成签到,获得积分20
5秒前
Ava应助科研通管家采纳,获得10
5秒前
5秒前
5秒前
SciGPT应助科研通管家采纳,获得10
5秒前
lulu发布了新的文献求助20
5秒前
6秒前
6秒前
linkman应助hyd1640采纳,获得30
6秒前
gggja发布了新的文献求助20
7秒前
ChenYX发布了新的文献求助20
8秒前
华仔应助Denim采纳,获得10
9秒前
坐忘道发布了新的文献求助10
9秒前
9秒前
今后应助张青梦采纳,获得10
11秒前
11秒前
爆米花应助111采纳,获得10
12秒前
无助的人完成签到,获得积分10
12秒前
小狗梨花冻完成签到,获得积分10
12秒前
清新完成签到,获得积分10
12秒前
12秒前
好数据完成签到 ,获得积分10
14秒前
肥肥发布了新的文献求助10
15秒前
CHEN完成签到,获得积分10
15秒前
16秒前
高分求助中
(应助此贴封号)【重要!!请各位详细阅读】【科研通的精品贴汇总】 10000
Les Mantodea de Guyane: Insecta, Polyneoptera [The Mantids of French Guiana] 3000
F-35B V2.0 How to build Kitty Hawk's F-35B Version 2.0 Model 2000
줄기세포 생물학 1000
The Netter Collection of Medical Illustrations: Digestive System, Volume 9, Part III - Liver, Biliary Tract, and Pancreas (3rd Edition) 600
Founding Fathers The Shaping of America 500
中国减肥产品行业市场发展现状及前景趋势与投资分析研究报告(2025-2030版) 500
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 生物化学 物理 纳米技术 计算机科学 内科学 化学工程 复合材料 物理化学 基因 催化作用 遗传学 冶金 电极 光电子学
热门帖子
关注 科研通微信公众号,转发送积分 4511548
求助须知:如何正确求助?哪些是违规求助? 3957169
关于积分的说明 12267819
捐赠科研通 3618331
什么是DOI,文献DOI怎么找? 1991029
邀请新用户注册赠送积分活动 1027330
科研通“疑难数据库(出版商)”最低求助积分说明 918629