等价(形式语言)
医学
质量(理念)
评定量表
医学教育
人工智能
自然语言处理
计算机科学
语言学
心理学
认识论
发展心理学
哲学
作者
Mondira Ray,Daniel J. Kats,Joss Moorkens,Dinesh Rai,Nawar M. Shaar,D Quinones,Amanda VerMeulen,Camila M. Mateo,Ryan Brewster,Alisa Khan,Benjamin Rader,John S. Brownstein,Jonathan D. Hron
标识
DOI:10.1001/jamapediatrics.2025.1729
摘要
Importance Patients and caregivers who use languages other than English in the US encounter barriers to accessing language-concordant written instructions after clinical visits. Large language models (LLMs), such as OpenAI’s GPT-4o, may improve access to translated patient materials; however, rigorous evaluation is needed to ensure clinical standards are met. Objective To determine whether GPT-4o can generate high-quality Spanish translations of personalized patient instructions comparable to those performed by professional human translators. Design, Setting, and Participants This cross-sectional study compared LLM translations to professional human translations using equivalence testing. The personalized pediatric instructions used were derived from real clinical encounters at a large US academic medical center and translated between January 2023 and December 2023. Patient instructions in English were translated into Spanish by GPT-4o and professional human translators. The source English texts were translated using GPT-4o on August 2, 2024. Both sets of translations were evaluated by 3 independent professional medical translators. Exposure Patient instructions were translated using GPT-4o with an engineered prompt, and these translations were compared with those produced by professional human translators. Main Outcomes and Measures The primary outcome was translation quality, assessed using the Multidimensional Quality Metrics (MQM) framework to generate an overall MQM score (rated on a 0-100 scale). Secondary outcomes included a general preference rating and error rates for types of translation errors. Results This study included 20 source files of pediatric patient instructions. Equivalence testing showed no significant difference in translation quality between GPT-4o and human translations, with a mean difference of 1.6 points (90% CI, 0.7-2.5), falling within a predefined equivalence margin of plus or minus 5 MQM points. The LLM yielded fewer mistranslation errors, and a mean (SE) of 52% (6%) of professional translator ratings preferred the LLM translations. Conclusions and Relevance In this cross-sectional study, GPT-4o generated Spanish translations of pediatric patient instructions that were comparable in quality to those by professional human translators as evaluated using a standardized framework. While human review of LLM translation remains essential in health care, these findings suggest that GPT-4o could reduce the translation workload for Spanish, potentially freeing resources to support languages of lesser diffusion.
科研通智能强力驱动
Strongly Powered by AbleSci AI