医学
自然语言处理
放射科
人工智能
医学物理学
语音识别
计算机科学
作者
Reuben Schmidt,Jarrel Seah,Ke Cao,Louis Lim,Wei Xiang Lim,Justin Yeung
出处
期刊:Radiology
[Radiological Society of North America]
日期:2024-01-24
摘要
"Just Accepted" papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. This study evaluated the ability of generative large language models (LLMs) to detect speech recognition errors in radiology reports. A dataset of 3,233 CT and MRI reports was assessed by radiologists for speech recognition errors. Errors were categorized as clinically significant or not clinically significant. Performances of five generative LLMs-GPT-3.5-turbo, GPT-4, text-davinci-003, Llama-v2-70B-chat, and Bard-were compared in detecting these errors, using manual error detection as the reference standard. Prompt engineering was used to optimize model performance. GPT-4 demonstrated high accuracy in detecting clinically significant errors (precision 76.9%, recall 100%, F1 86.9%) and not clinically significant errors (93.9% precision, 94.7% recall, 94.3% F1). Text-davinci-003 achieved F1 scores of 72% and 46.6% for clinically significant and not clinically significant errors, respectively. GPT-3.5-turbo obtained 59.1% and 32.2% F1 scores, while Llama-v2-70B-chat scored 72.8% and 47.7%. Bard showed the lowest accuracy, with F1 scores of 47.5% and 20.9%. GPT-4 effectively identified challenging errors of nonsense phrases and internally inconsistent statements. Longer reports, resident dictation, and overnight shifts were associated with higher error rates. In conclusion, advanced generative LLMs show potential for automatic detection of speech recognition errors in radiology reports. ©RSNA, 2024.
科研通智能强力驱动
Strongly Powered by AbleSci AI