作者
Han Yang,Mingchen Li,Huixue Zhou,Yongkang Xiao,Qian Fang,Shuang Zhou,Rui Zhang
摘要
Abstract Background Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchmarked individual zero-shot LLMs (GPT-4, Llama2-13B, Vicuna-13B, MedLlama-13B, and MedAlpaca-13B) to assess their baseline performance. Within the benchmark, GPT-4 achieves the best 71% on MedMCQA (medical multiple-choice question answering dataset), Vicuna-13B achieves 89.5% on PubMedQA (a dataset for biomedical question answering), and MedAlpaca-13B achieves the best 70% among all, showing the potential for better performance across different tasks and highlighting the need for strategies that can harness their collective strengths. Ensemble learning methods, combining multiple models to improve overall accuracy and reliability, offer a promising approach to address this challenge. Objective To develop and evaluate efficient ensemble learning approaches, we focus on improving performance across 3 medical QA datasets through our proposed two ensemble strategies. Methods Our study uses 3 medical QA datasets: PubMedQA (1000 manually labeled and 11,269 test, with yes, no, or maybe answered for each question), MedQA-USMLE (Medical Question Answering dataset based on the United States Medical Licensing Examination; 12,724 English board-style questions; 1272 test, 5 options), and MedMCQA (182,822 training/4183 test questions, 4-option multiple choice). We introduced the LLM-Synergy framework, consisting of two ensemble methods: (1) a Boosting-based Weighted Majority Vote ensemble, refining decision-making by adaptively weighting each LLM and (2) a Cluster-based Dynamic Model Selection ensemble, dynamically selecting optimal LLMs for each query based on question-context embeddings and clustering. Results Both ensemble methods outperformed individual LLMs across all 3 datasets. Specifically comparing the best individual LLM, the Boosting-based Majority Weighted Vote achieved accuracies of 35.84% on MedMCQA (+3.81%), 96.21% on PubMedQA (+0.64%), and 37.26% (tie) on MedQA-USMLE. The Cluster-based Dynamic Model Selection yields even higher accuracies of 38.01% (+5.98%) for MedMCQA, 96.36% (+1.09%) for PubMedQA, and 38.13% (+0.87%) for MedQA-USMLE. Conclusions The LLM-Synergy framework, using 2 ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks. Through effectively combining the strengths of diverse LLMs, this framework provides a flexible and efficient strategy adaptable to current and future challenges in biomedical informatics.