Large Language Model Clinical Vignettes and Multiple-Choice Questions for Postgraduate Medical Education

医学教育高等教育梅德林心理学医学政治学法学

作者

Frank I. Jackson,Nathan Keller,Insaf Kouba,Wassil Kouba,Luis A. Bracero,Matthew J. Blitz

出处

期刊：Academic Medicine [Lippincott Williams & Wilkins]
日期：2025-06-23 卷期号：100 (10): 1163-1166 被引量：6

链接

nih.govdoi.org

标识

DOI：10.1097/acm.0000000000006137

摘要

PROBLEM: Clinical vignette-based multiple-choice questions (MCQs) have been used to assess postgraduate medical trainees but require substantial time and effort to develop. Large language models, a type of artificial intelligence (AI), can potentially expedite this task. This report describes prompt engineering techniques used with ChatGPT-4 to generate clinical vignettes and MCQs for obstetrics-gynecology residents and evaluates whether residents and attending physicians can differentiate between human- and AI-generated content. APPROACH: The authors generated MCQs using a structured prompt engineering approach, incorporating authoritative source documents and an iterative prompt chaining technique, to refine output quality. Fifty human-generated and 50 AI-generated MCQs were randomly arranged into 10 quizzes (10 questions each). The AI-generated MCQs were developed in August 2024 and surveys conducted in September 2024. Obstetrics-gynecology residents and attending physician faculty members at Northwell Health or Donald and Barbara Zucker School of Medicine at Hofstra/Northwell completed an online survey, answering each MCQ and indicating whether they believed it was human or AI written or if they were uncertain. OUTCOMES: Thirty-three participants (16 residents, 17 attendings) completed the survey (80.5% response rate). Respondents correctly identified MCQ authorship a median (interquartile range [IQR]) of 39.1% (30.0%-50.0%) of the time, indicating difficulty in distinguishing human- and AI-generated questions. The median (IQR) correct answer selection rate was 62.3% (50.0%-75.0%) for human-generated MCQs and 64.4% (50.0%-83.3%) for AI-generated MCQs ( P = .74). The difficulty (0.69 vs 0.66, P = .83) and discriminatory (0.42 vs 0.38, P = .90) indexes showed no significant differences, supporting the feasibility of large language model-generated MCQs in medical education. NEXT STEPS: Future studies should explore the optimal balance between AI-generated content and expert review, identifying strategies to maximize efficiency without compromising accuracy. The authors will develop practice exams and assess their predictive validity by comparing scores with standardized exam results.

求助该文献

Large Language Model Clinical Vignettes and Multiple-Choice Questions for Postgraduate Medical Education

今日热心研友