[Preliminary exploration of the applications of five large language models in the field of oral auxiliary diagnosis, treatment and health consultation].

领域(数学) 心理学 医学物理学 医学 计算机科学 数学 纯数学
作者
Chunlei Han,Shizhu Bai,Ting Zhang,Chengming Liu,Yi‐Chun Liu,Xiaowen Hu,Yewang Zhao
出处
期刊:PubMed 卷期号:60 (8): 871-878
标识
DOI:10.3760/cma.j.cn112144-20241107-00418
摘要

Objective: To evaluate the accuracy of the oral healthcare information provided by different large language models (LLM) to explore their feasibility and limitations in the application of oral auxiliary, treatment and health consultation. Methods: This study designed eight items comprising 47 questions in total related to the diagnosis and treatment of oral diseases [to assess the performance of LLM as an artificial intelligence (AI) medical assistant], and five items comprising 35 questions in total about oral health consultations (to assess the performance of LLM as a simulated doctor). These questions were answered individually by the five LLM models (Erine Bot, HuatuoGPT, Tongyi Qianwen, iFlytek Spark, ChatGPT). Two attending physicians with more than 5 years of experience independently rated the responses using the 3C criteria (correct, clear, concise), and the consistency between the raters was assessed using the Spearman rank correlation coefficient, and the Kruskal-Wallis test and Dunn post hoc test were used to assess the statistical differences between the models. Additionally, this study used 600 questions from the 2023 dental licensing examination to evaluate the time taken to answer, scores, and accuracy of each model. Results: As an AI medical assistant, LLM can assist doctors in diagnosis and treatment decision-making, with an inter-evaluator Spearman coefficient of 0.505 (P<0.01). As a simulated doctor, LLM can carry out patient popularization, with an inter-evaluator Spearman coefficient of 0.533 (P<0.01). The 3C scoring results were represented by the median (lower quartile, upper quartile), and the 3C scores of each model as an AI medical assistant and a simulated doctor were respectively: 2.00 (1.00, 3.00) and 2.00 (1.00, 3.00) points of Erine Bot, 1.00 (1.00, 2.00) and 2.00 (1.00, 2.00) points of HuatuoGPT, 2.00 (1.00, 2.00) and 2.00 (1.00, 3.00) points of Tongyi Qianwen, 2.00 (1.00, 2.00) and 2.00 (1.75, 2.25) points of iFlytek Spark, 3.00 (2.00, 3.00) and 3.00 (2.00, 3.00) points of ChatGPT (full score of 4 points). The Kruskal-Wallis test results showed that, as an AI medical assistant or a simulated doctor, there were statistically differences in the 3C scores among the five large language models (all P<0.001). The average score of the 5 LLMs on the dental licensing examination was 370.2, with an accuracy rate of 61.7% (370.2/600) and a time consumption of 94.6 minutes. Specifically, Erine Bot took 115 minutes, scored 363 points with an accuracy rate of 60.5% (363/600), HuatuoGPT took 224 minutes and scored 305 points with an accuracy rate of 50.8% (305/600), Tongyi Qianwen took 43 minutes, scored 438 points with an accuracy rate of 73.0% (480/600), iFlytek Spark took 32 minutes, scored 364 points with an accuracy rate of 60.7% (364/600), and ChatGPT took 59 minutes, scored 381 points with an accuracy rate of 63.5% (381/600). Conclusions: Based on the evaluation of LLM's dual roles as an AI medical assistant and a simulated doctor, ChatGPT performes the best, with basically correct, clear and concise answers, followed by Erine Bot, Tongyi Qianwen and iFlytek Spark, with HuatuoGPT lagging behind significantly. In the dental licensing examination, all the 4 LLM, except for HuatuoGPT, reach the passing level, and the time consumpution for answering is significantly reduced compared to the 8 h required for the exam regulations in all of the five models. LLM has the feasibility of application in oral auxiliary, treatment and health consultation, and it can help both doctors and patients obtain medical information quickly. Howere, their outputs carry a risk of errors (since the 3C scoring results do not reach the full marks), so prudent judgment should be exercised when using them.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
PDF的下载单位、IP信息已删除 (2025-6-4)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
冷静大米完成签到,获得积分20
1秒前
1秒前
cookie完成签到,获得积分10
3秒前
emm完成签到,获得积分10
3秒前
余芝完成签到,获得积分10
3秒前
周志昂发布了新的文献求助10
4秒前
陌shang发布了新的文献求助30
6秒前
8秒前
9秒前
11秒前
阿璟发布了新的文献求助10
11秒前
11秒前
12秒前
调皮语堂发布了新的文献求助10
12秒前
蓝天碧海小西服完成签到,获得积分0
13秒前
zkl发布了新的文献求助10
14秒前
15秒前
科目三应助科研通管家采纳,获得10
15秒前
bkagyin应助科研通管家采纳,获得10
15秒前
浮游应助科研通管家采纳,获得10
15秒前
浮游应助科研通管家采纳,获得10
15秒前
研友_VZG7GZ应助科研通管家采纳,获得10
15秒前
顾矜应助科研通管家采纳,获得10
15秒前
深情安青应助科研通管家采纳,获得10
16秒前
小蘑菇应助科研通管家采纳,获得10
16秒前
飘逸怜菡发布了新的文献求助10
16秒前
思源应助科研通管家采纳,获得10
16秒前
orixero应助科研通管家采纳,获得10
16秒前
16秒前
落寞的天薇完成签到,获得积分20
16秒前
16秒前
18秒前
18秒前
19秒前
19秒前
多多完成签到,获得积分10
19秒前
20秒前
Smile完成签到,获得积分10
20秒前
阿滕完成签到,获得积分10
21秒前
XD完成签到,获得积分10
21秒前
高分求助中
Encyclopedia of Quaternary Science Third edition 2025 12000
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
The Social Work Ethics Casebook: Cases and Commentary (revised 2nd ed.). Frederic G. Reamer 800
Beyond the sentence : discourse and sentential form / edited by Jessica R. Wirth 600
Holistic Discourse Analysis 600
Vertébrés continentaux du Crétacé supérieur de Provence (Sud-Est de la France) 600
Reliability Monitoring Program 500
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 生物化学 物理 纳米技术 计算机科学 内科学 化学工程 复合材料 物理化学 基因 遗传学 催化作用 冶金 量子力学 光电子学
热门帖子
关注 科研通微信公众号,转发送积分 5339456
求助须知:如何正确求助?哪些是违规求助? 4476253
关于积分的说明 13930947
捐赠科研通 4371718
什么是DOI,文献DOI怎么找? 2402066
邀请新用户注册赠送积分活动 1395009
关于科研通互助平台的介绍 1366964