Evaluating ChatGPT-3.5 and Claude-2 in Answering and Explaining Conceptual Medical Physiology Multiple-Choice Questions

集合（抽象数据类型）理解力多项选择人工智能过程（计算）计算机科学考试（生物学）医学医学教育显著性差异内科学操作系统古生物学生物程序设计语言

作者

Mayank Agarwal,Ayan Goswami,Priyanka Sharma

出处

期刊：Cureus [Cureus, Inc.]
日期：2023-09-29 卷期号：15 (9): e46222-e46222 被引量：31

链接

cureus.com nih.gov nih.gov nih.govdoi.org

标识

DOI：10.7759/cureus.46222

摘要

Background Generative artificial intelligence (AI) systems such as ChatGPT-3.5 and Claude-2 may assist in explaining complex medical science topics. A few studies have shown that AI can solve complicated physiology problems that require critical thinking and analysis. However, further studies are required to validate the effectiveness of AI in answering conceptual multiple-choice questions (MCQs) in human physiology. Objective This study aimed to evaluate and compare the proficiency of ChatGPT-3.5 and Claude-2 in answering and explaining a curated set of MCQs in medical physiology. Methods In this cross-sectional study, a set of 55 MCQs from 10 competencies of medical physiology was purposefully constructed that required comprehension, problem-solving, and analytical skills to solve them. The MCQs and a structured prompt for response generation were presented to ChatGPT-3.5 and Claude-2. The explanations provided by both AI systems were documented in an Excel spreadsheet. All three authors subjected these explanations to a rating process using a scale of 0 to 3. A rating of 0 was assigned to an incorrect, 1 to a partially correct, 2 to a correct explanation with some aspects missing, and 3 to a perfectly correct explanation. Both AI models were evaluated for their ability to choose the correct answer (option) and provide clear and comprehensive explanations of the MCQs. The Mann-Whitney U test was used to compare AI responses. The Fleiss multi-rater kappa (κ) was used to determine the score agreement among the three raters. The statistical significance level was decided at P ≤ 0.05. Results Claude-2 answered 40 MCQs correctly, which was significantly higher than the 26 correct responses from ChatGPT-3.5. The rating distribution for the explanations generated by Claude-2 was significantly higher than that of ChatGPT-3.5. The κ values were 0.804 and 0.818 for Claude-2 and ChatGPT-3.5, respectively. Conclusion In terms of answering and elucidating conceptual MCQs in medical physiology, Claude-2 surpassed ChatGPT-3.5. However, accessing Claude-2 from India requires the use of a virtual private network, which may raise security concerns.

求助该文献

Evaluating ChatGPT-3.5 and Claude-2 in Answering and Explaining Conceptual Medical Physiology Multiple-Choice Questions

今日热心研友