Fine‐tuning open‐source large language models to improve their performance on radiation oncology tasks: A feasibility study to investigate their potential clinical applications in radiation oncology

放射肿瘤学模态（人机交互）医学物理学计算机科学放射治疗计划任务（项目管理）医学排名（信息检索）养生放射治疗人工智能内科学管理经济

作者

P. Wang,Zhengliang Liu,Yiwei Li,Jason Holmes,Peng Shu,Lian Zhang,Xiang Li,Quanzheng Li,Brady Laughlin,Diego Santos Toesca,Carlos Vargas,Sujay A. Vora,Samir H. Patel,Terence T. Sio,Tianming Liu,Wei Liu

出处

期刊：Medical Physics [Wiley]
日期：2025-07-01 卷期号：52 (7)

链接

arxiv.org arxiv.org nih.govdoi.org

标识

DOI：10.1002/mp.17985

摘要

Abstract Background The radiation oncology clinical practice involves many steps relying on the dynamic interplay of abundant text data. Large language models have displayed remarkable capabilities in processing complex text information. But their direct applications in specific fields like radiation oncology remain underexplored. Purpose This study aims to investigate whether fine‐tuning LLMs with domain knowledge can improve the performance on Task (1) treatment regimen generation, Task (2) treatment modality selection (photon, proton, electron, or brachytherapy), and Task (3) ICD‐10 code prediction in radiation oncology. Methods Data for 15 724 patient cases were extracted. Cases where patients had a single diagnostic record, and a clearly identifiable primary treatment plan were selected for preprocessing and manual annotation to have 7903 cases of the patient diagnosis, treatment plan, treatment modality, and ICD‐10 code. Each case was used to construct a pair consisting of patient diagnostics details and an answer (treatment regimen, treatment modality, or ICD‐10 code, respectively) for the supervised fine‐tuning of these three tasks. Open source LLaMA2‐7B and Mistral‐7B models were utilized for the fine‐tuning with the Low‐Rank Approximations method. Accuracy and ROUGE‐1 score were reported for the fine‐tuned models and original models. Clinical evaluation was performed on Task (1) by radiation oncologists, while precision, recall, and F‐1 score were evaluated for Task (2) and (3). One‐sided Wilcoxon signed‐rank tests were used to statistically analyze the results. Results Fine‐tuned LLMs outperformed original LLMs across all tasks with p value ≤ 0.001. Clinical evaluation demonstrated that over 60% of the fine‐tuned LLMs‐generated treatment regimens were clinically acceptable. Precision, recall, and F1‐score showed improved performance of fine‐tuned LLMs. Conclusion Fine‐tuned LLMs demonstrated statistically significant improvements over original LLMs upon three clinically important tasks in radiation oncology. This study explored the feasibility of applying fine‐tuned LLMs in radiation oncology, inspiring further development of utilizing LLMs to assist with radiation oncology tasks.

求助该文献

最长约 10秒，即可获得该文献文件

Fine‐tuning open‐source large language models to improve their performance on radiation oncology tasks: A feasibility study to investigate their potential clinical applications in radiation oncology

今日热心研友