计算机科学
人在回路中
问责
审计
可扩展性
工作流程
人工智能
数据科学
数据库
政治学
经济
管理
法学
作者
Charles MacDonald Burke
标识
DOI:10.3390/educsci15081029
摘要
Educational assessment relies on well-constructed test items to measure student learning accurately, yet traditional item development is time-consuming and demands specialized psychometric expertise. Automatic item generation (AIG) offers template-based scalability, and recent large language model (LLM) advances promise to democratize item creation. However, fully automated approaches risk introducing factual errors, bias, and uneven difficulty. To address these challenges, we propose and evaluate a hybrid human-in-the-loop (HITL) framework for AIG that combines psychometric rigor with the linguistic flexibility of LLMs. In a Spring 2025 case study at Franklin University Switzerland, the instructor collaborated with ChatGPT (o4-mini-high) to generate parallel exam variants for two undergraduate business courses: Quantitative Reasoning and Data Mining. The instructor began by defining “radical” and “incidental” parameters to guide the model. Through iterative cycles of prompt, review, and refinement, the instructor validated content accuracy, calibrated difficulty, and mitigated bias. All interactions (including prompt templates, AI outputs, and human edits) were systematically documented, creating a transparent audit trail. Our findings demonstrate that a HITL approach to AIG can produce diverse, psychometrically equivalent exam forms with reduced development time, while preserving item validity and fairness, and potentially reducing cheating. This offers a replicable pathway for harnessing LLMs in educational measurement without sacrificing quality, equity, or accountability.
科研通智能强力驱动
Strongly Powered by AbleSci AI