Physician- and Large Language Model–Generated Hospital Discharge Summaries

医学叙述的利克特量表比例（比率）家庭医学医疗保健文档统计经济增长程序设计语言经济计算机科学量子力学数学物理语言学哲学

作者

Christopher Y. K. Williams,Charumathi Raghu Subramanian,Syed Salman Ali,Michael Apolinario,Elisabeth Askin,Peter Barish,Monica Cheng,William James Deardorff,Nisha Donthi,Smitha Ganeshan,Owen Huang,Molly A. Kantor,Andrew Lai,Ashley Manchanda,Kendra A. Moore,Anoop Muniyappa,Geethu Nair,Prashant Patel,Lekshmi Santhosh,Susan Schneider

出处

期刊：JAMA Internal Medicine [American Medical Association]
日期：2025-05-05

链接

nih.govdoi.org

标识

DOI：10.1001/jamainternmed.2025.0821

摘要

Importance High-quality discharge summaries are associated with improved patient outcomes, but contribute to clinical documentation burden. Large language models (LLMs) provide an opportunity to support physicians by drafting discharge summary narratives. Objective To determine whether LLM-generated discharge summary narratives are of comparable quality and safety to those of physicians. Design, Setting, and Participants This cross-sectional study conducted at the University of California, San Francisco included 100 randomly selected inpatient hospital medicine encounters of 3 to 6 days’ duration between 2019 and 2022. The analysis took place in July 2024. Exposure A blinded evaluation of physician- and LLM-generated narratives was performed in duplicate by 22 attending physician reviewers. Main Outcomes and Measures Narratives were reviewed for overall quality, reviewer preference, comprehensiveness, concision, coherence, and 3 error types (inaccuracies, omissions, and hallucinations). Each error individually, and each narrative overall, were assigned potential harmfulness scores ranging from 0 to 7 on an adapted Agency for Healthcare Research and Quality scale. Results Across 100 encounters, LLM- and physician-generated narratives were comparable in overall quality on a Likert scale ranging from 1 to 5 (higher scores indicate higher quality; mean [SD] score, 3.67 [0.49] vs 3.77 [0.57]; P = .21) and reviewer preference (χ 2 = 5.2; P = .27). LLM-generated narratives were more concise (mean [SD] score, 4.01 [0.37] vs 3.70 [0.59]; P &lt; .001) and more coherent (mean [SD] score, 4.16 [0.39] vs 4.01 [0.53]; P = .02) than their physician-generated counterparts, but less comprehensive (mean [SD] score, 3.72 [0.58] vs 4.13 [0.58]; P &lt; .001). LLM-generated narratives contained more unique errors (mean [SD] errors per summary, 2.91 [2.54]) than physician-generated narratives (mean [SD] errors per summary, 1.82 [1.94]). There was no significant difference in the potential for harm between LLM- and physician-generated narratives across individual errors (mean [SD] of 1.35 [1.07] vs 1.34 [1.05]; P = .99), with 6 and 5 individual errors, respectively, with scores of 4 (potential for permanent harm) or greater. Both LLM- and physician-generated narratives had low overall potential for harm (scores &lt;1 on a scale ranging from 0-7), with LLM-generated narratives scoring higher than physician narratives (mean [SD] score of 0.84 [0.98] vs 0.36 [0.70]; P &lt; .001) and only 1 LLM-generated narrative (compared with 0 physician-generated narratives) scoring 4 or greater. Conclusions and Relevance In this cross-sectional study of 100 inpatient hospital medicine encounters, LLM-generated discharge summary narratives were of comparable quality, and were preferred equally, to those generated by physicians. LLM-generated narratives were more likely to contain errors but had low overall harmfulness scores. These results suggest that, in clinical practice, using such narratives after human review may provide a viable option for hospitalists.

求助该文献

最长约 10秒，即可获得该文献文件

Physician- and Large Language Model–Generated Hospital Discharge Summaries

今日热心研友