医学教育
高等教育
梅德林
心理学
医学
政治学
法学
作者
Frank I. Jackson,Nathan Keller,Insaf Kouba,Wassil Kouba,Luis A. Bracero,Matthew J. Blitz
出处
期刊:Academic Medicine
[Lippincott Williams & Wilkins]
日期:2025-06-23
卷期号:100 (10): 1163-1166
被引量:6
标识
DOI:10.1097/acm.0000000000006137
摘要
PROBLEM: Clinical vignette-based multiple-choice questions (MCQs) have been used to assess postgraduate medical trainees but require substantial time and effort to develop. Large language models, a type of artificial intelligence (AI), can potentially expedite this task. This report describes prompt engineering techniques used with ChatGPT-4 to generate clinical vignettes and MCQs for obstetrics-gynecology residents and evaluates whether residents and attending physicians can differentiate between human- and AI-generated content. APPROACH: The authors generated MCQs using a structured prompt engineering approach, incorporating authoritative source documents and an iterative prompt chaining technique, to refine output quality. Fifty human-generated and 50 AI-generated MCQs were randomly arranged into 10 quizzes (10 questions each). The AI-generated MCQs were developed in August 2024 and surveys conducted in September 2024. Obstetrics-gynecology residents and attending physician faculty members at Northwell Health or Donald and Barbara Zucker School of Medicine at Hofstra/Northwell completed an online survey, answering each MCQ and indicating whether they believed it was human or AI written or if they were uncertain. OUTCOMES: Thirty-three participants (16 residents, 17 attendings) completed the survey (80.5% response rate). Respondents correctly identified MCQ authorship a median (interquartile range [IQR]) of 39.1% (30.0%-50.0%) of the time, indicating difficulty in distinguishing human- and AI-generated questions. The median (IQR) correct answer selection rate was 62.3% (50.0%-75.0%) for human-generated MCQs and 64.4% (50.0%-83.3%) for AI-generated MCQs ( P = .74). The difficulty (0.69 vs 0.66, P = .83) and discriminatory (0.42 vs 0.38, P = .90) indexes showed no significant differences, supporting the feasibility of large language model-generated MCQs in medical education. NEXT STEPS: Future studies should explore the optimal balance between AI-generated content and expert review, identifying strategies to maximize efficiency without compromising accuracy. The authors will develop practice exams and assess their predictive validity by comparing scores with standardized exam results.
科研通智能强力驱动
Strongly Powered by AbleSci AI