Performance Evaluation of Large Language Models in Cervical Cancer Management Based on A Standardized Questionnaire: Comparative Study (Preprint)

预印本 宫颈癌 医学 计算机科学 癌症 万维网 内科学
作者
Warisijiang Kuerbanjiang,Shengzhe Peng,Yiershatijiang Jiamaliding,Yuexiong Yi
出处
期刊:Journal of Medical Internet Research [JMIR Publications]
卷期号:27: e63626-e63626
标识
DOI:10.2196/63626
摘要

Cervical cancer remains the fourth leading cause of death among women globally, with a particularly severe burden in low-resource settings. A comprehensive approach-from screening to diagnosis and treatment-is essential for effective prevention and management. Large language models (LLMs) have emerged as potential tools to support health care, though their specific role in cervical cancer management remains underexplored. This study aims to systematically evaluate the performance and interpretability of LLMs in cervical cancer management. Models were selected from the AlpacaEval leaderboard version 2.0 and based on the capabilities of our computer. The questions inputted into the models cover aspects of general knowledge, screening, diagnosis, and treatment, according to guidelines. The prompt was developed using the Context, Objective, Style, Tone, Audience, and Response (CO-STAR) framework. Responses were evaluated for accuracy, guideline compliance, clarity, and practicality, graded as A, B, C, and D with corresponding scores of 3, 2, 1, and 0. The effective rate was calculated as the ratio of A and B responses to the total number of designed questions. Local Interpretable Model-Agnostic Explanations (LIME) was used to explain and enhance physicians' trust in model outputs within the medical context. Nine models were included in this study, and a set of 100 standardized questions covering general information, screening, diagnosis, and treatment was designed based on international and national guidelines. Seven models (ChatGPT-4.0 Turbo, Claude 2, Gemini Pro, Mistral-7B-v0.2, Starling-LM-7B alpha, HuatuoGPT, and BioMedLM 2.7B) provided stable responses. Among all the models included, ChatGPT-4.0 Turbo ranked first with a mean score of 2.67 (95% CI 2.54-2.80; effective rate 94.00%) with a prompt and 2.52 (95% CI 2.37-2.67; effective rate 87.00%) without a prompt, outperforming the other 8 models (P<.001). Regardless of prompts, QiZhenGPT consistently ranked among the lowest-performing models, with P<.01 in comparisons against all models except BioMedLM. Interpretability analysis showed that prompts improved alignment with human annotations for proprietary models (median intersection over union 0.43), while medical-specialized models exhibited limited improvement. Proprietary LLMs, particularly ChatGPT-4.0 Turbo and Claude 2, show promise in clinical decision-making involving logical analysis. The use of prompts can enhance the accuracy of some models in cervical cancer management to varying degrees. Medical-specialized models, such as HuatuoGPT and BioMedLM, did not perform as well as expected in this study. By contrast, proprietary models, particularly those augmented with prompts, demonstrated notable accuracy and interpretability in medical tasks, such as cervical cancer management. However, this study underscores the need for further research to explore the practical application of LLMs in medical practice.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
1秒前
庄默羽完成签到,获得积分10
1秒前
1秒前
无奈的香芦完成签到 ,获得积分10
6秒前
iman发布了新的文献求助10
6秒前
yunidesuuu完成签到,获得积分10
12秒前
Llllll发布了新的文献求助40
12秒前
田様应助Arueliano采纳,获得10
13秒前
科研通AI5应助热心金鱼采纳,获得20
15秒前
小白发布了新的文献求助10
15秒前
一澜透完成签到 ,获得积分10
17秒前
1am33in完成签到 ,获得积分10
19秒前
张昊坤完成签到 ,获得积分10
19秒前
CR完成签到,获得积分10
20秒前
lllllcc完成签到,获得积分10
20秒前
26秒前
小蘑菇应助破忒头采纳,获得30
26秒前
Llllll完成签到,获得积分10
28秒前
芊慧发布了新的文献求助10
29秒前
牛爱花发布了新的文献求助10
32秒前
wanci应助无限的葶采纳,获得30
34秒前
34秒前
35秒前
科研通AI5应助笑点低的靳采纳,获得10
36秒前
37秒前
CR发布了新的文献求助10
37秒前
破忒头发布了新的文献求助30
39秒前
科研通AI5应助wdb采纳,获得10
40秒前
LZ发布了新的文献求助10
40秒前
40秒前
wangmp66发布了新的文献求助10
42秒前
破忒头完成签到,获得积分10
46秒前
47秒前
慕青应助牛爱花采纳,获得10
48秒前
50秒前
我是老大应助FanKun采纳,获得10
50秒前
聪慧海蓝完成签到 ,获得积分10
52秒前
wdb发布了新的文献求助10
53秒前
56秒前
wdb发布了新的文献求助10
56秒前
高分求助中
【此为提示信息,请勿应助】请按要求发布求助,避免被关 20000
Continuum Thermodynamics and Material Modelling 2000
Encyclopedia of Geology (2nd Edition) 2000
105th Edition CRC Handbook of Chemistry and Physics 1600
Maneuvering of a Damaged Navy Combatant 650
Периодизация спортивной тренировки. Общая теория и её практическое применение 310
Mixing the elements of mass customisation 300
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3778901
求助须知:如何正确求助?哪些是违规求助? 3324431
关于积分的说明 10218443
捐赠科研通 3039495
什么是DOI,文献DOI怎么找? 1668204
邀请新用户注册赠送积分活动 798591
科研通“疑难数据库(出版商)”最低求助积分说明 758440