Evaluating a Large Language Model in Translating Patient Instructions to Spanish Using a Standardized Framework

等价（形式语言）医学质量（理念）评定量表医学教育人工智能自然语言处理计算机科学语言学心理学认识论发展心理学哲学

作者

Mondira Ray,Daniel J. Kats,Joss Moorkens,Dinesh Rai,Nate Shaar,D Quinones,Amanda VerMeulen,Camila M. Mateo,Ryan Brewster,Alisa Khan,Benjamin Rader,John S. Brownstein,Jonathan D. Hron

出处

期刊：JAMA Pediatrics [American Medical Association]
日期：2025-07-07 卷期号：179 (9): 1026-1026 被引量：5

链接

nih.gov nih.govdoi.org

标识

DOI：10.1001/jamapediatrics.2025.1729

摘要

Importance Patients and caregivers who use languages other than English in the US encounter barriers to accessing language-concordant written instructions after clinical visits. Large language models (LLMs), such as OpenAI’s GPT-4o, may improve access to translated patient materials; however, rigorous evaluation is needed to ensure clinical standards are met. Objective To determine whether GPT-4o can generate high-quality Spanish translations of personalized patient instructions comparable to those performed by professional human translators. Design, Setting, and Participants This cross-sectional study compared LLM translations to professional human translations using equivalence testing. The personalized pediatric instructions used were derived from real clinical encounters at a large US academic medical center and translated between January 2023 and December 2023. Patient instructions in English were translated into Spanish by GPT-4o and professional human translators. The source English texts were translated using GPT-4o on August 2, 2024. Both sets of translations were evaluated by 3 independent professional medical translators. Exposure Patient instructions were translated using GPT-4o with an engineered prompt, and these translations were compared with those produced by professional human translators. Main Outcomes and Measures The primary outcome was translation quality, assessed using the Multidimensional Quality Metrics (MQM) framework to generate an overall MQM score (rated on a 0-100 scale). Secondary outcomes included a general preference rating and error rates for types of translation errors. Results This study included 20 source files of pediatric patient instructions. Equivalence testing showed no significant difference in translation quality between GPT-4o and human translations, with a mean difference of 1.6 points (90% CI, 0.7-2.5), falling within a predefined equivalence margin of plus or minus 5 MQM points. The LLM yielded fewer mistranslation errors, and a mean (SE) of 52% (6%) of professional translator ratings preferred the LLM translations. Conclusions and Relevance In this cross-sectional study, GPT-4o generated Spanish translations of pediatric patient instructions that were comparable in quality to those by professional human translators as evaluated using a standardized framework. While human review of LLM translation remains essential in health care, these findings suggest that GPT-4o could reduce the translation workload for Spanish, potentially freeing resources to support languages of lesser diffusion.

求助该文献

最长约 10秒，即可获得该文献文件

Evaluating a Large Language Model in Translating Patient Instructions to Spanish Using a Standardized Framework

今日热心研友