Enhanced Artificial Intelligence in Bladder Cancer Management: A Comparative Analysis and Optimization Study of Multiple Large Language Models

医学膀胱癌癌症人工智能肿瘤科内科学计算机科学

作者

Kunpeng Li,Li Wang,Shun Wan,Chenyang Wang,Siyu Chen,Shanhui Liu,Yang Li

出处

期刊：Journal of Endourology [Mary Ann Liebert]
日期：2025-03-18 被引量：2

链接

nih.govdoi.org

标识

DOI：10.1089/end.2024.0860

摘要

Background: With the rapid advancement of artificial intelligence in health care, large language models (LLMs) demonstrate increasing potential in medical applications. However, their performance in specialized oncology remains limited. This study evaluates the performance of multiple leading LLMs in addressing clinical inquiries related to bladder cancer (BLCA) and demonstrates how strategic optimization can overcome these limitations. Methods: We developed a comprehensive set of 100 clinical questions based on established guidelines. These questions encompassed epidemiology, diagnosis, treatment, prognosis, and follow-up aspects of BLCA management. Six LLMs (Claude-3.5-Sonnet, ChatGPT-4.0, Grok-beta, Gemini-1.5-Pro, Mistral-Large-2, and GPT-3.5-Turbo) were tested through three independent trials. The responses were validated against current clinical guidelines and expert consensus. We implemented a two-phase training optimization process specifically for GPT-3.5-Turbo to enhance its performance. Results: In the initial evaluation, Claude-3.5-Sonnet demonstrated the highest accuracy (89.33% ± 1.53%), followed by ChatGPT-4 (85.67% ± 1.15%). Grok-beta achieved 84.33% ± 1.53% accuracy, whereas Gemini-1.5-Pro and Mistral-Large-2 showed similar performance (82.00% ± 1.00% and 81.00% ± 1.00%, respectively). GPT-3.5-Turbo demonstrated the lowest accuracy (74.33% ± 3.06%). After the first phase of training, GPT-3.5-Turbo's accuracy improved to 86.67% ± 1.89%. Following the second phase of optimization, the model achieved 100% accuracy. Conclusion: This study not only establishes the comparative performance of various LLMs in BLCA-related queries but also validates the potential for significant improvement through targeted training optimization. The successful enhancement of GPT-3.5-Turbo's performance suggests that strategic model refinement can overcome initial limitations and achieve optimal accuracy in specialized medical applications.

求助该文献

最长约 10秒，即可获得该文献文件

Enhanced Artificial Intelligence in Bladder Cancer Management: A Comparative Analysis and Optimization Study of Multiple Large Language Models

今日热心研友