Large Language Model Evaluation in Traditional Chinese Medicine for Stroke: Quantitative Benchmarking Study

作者

Hongyan Long,Yang Deng,Yaoguang Guo,Zhencai Shen,Yuzhu Zhang,Ji Bao,Yang He

出处

期刊：JMIR formative research [JMIR Publications Inc.]
日期：2025-11-21 卷期号：9: e81545-e81545

链接

doi.org nih.gov doaj.orgdoi.org

标识

DOI：10.2196/81545

摘要

Background The application of large language models (LLMs) in medicine is rapidly advancing. However, evaluating LLM capabilities in specialized domains such as traditional Chinese medicine (TCM), which possesses a unique theoretical system and cognitive framework, remains a sizable challenge. Objective This study aimed to provide an empirical evaluation of different LLM types in the specialized domain of TCM stroke. Methods The Traditional Chinese Medicine-Stroke Evaluation Dataset (TCM-SED), a 203-question benchmark, was systematically constructed. The dataset includes 3 paradigms (short-answer questions, multiple-choice questions, and essay questions) and covers multiple knowledge dimensions, including diagnosis, pattern differentiation and treatment, herbal formulas, acupuncture, interpretation of classical texts, and patient communication. Gold standard answers were established through a multiexpert cross-validation and consensus process. The TCM-SED was subsequently used to comprehensively test 2 representative LLM models: GPT-4o (a leading international general-purpose model) and DeepSeek-R1 (a large model primarily trained on Chinese corpora). Results The test results revealed a differentiation in model capabilities across cognitive levels. In objective sections emphasizing precise knowledge recall, DeepSeek-R1 comprehensively outperformed GPT-4o, achieving an accuracy lead of more than 17% in the multiple-choice section (96/137, 70.1% vs 72/137, 52.6%, respectively). Conversely, in the essay section, which tested knowledge integration and complex reasoning, GPT-4o’s performance notably surpassed that of DeepSeek-R1. For instance, in the interpretation of classical texts category, GPT-4o achieved a scoring rate of 90.5% (181/200), far exceeding DeepSeek-R1 (147/200, 73.5%). Conclusions This empirical study demonstrates that Chinese-centric models have a substantial advantage in static knowledge tasks within the TCM domain, whereas leading general-purpose models exhibit stronger dynamic reasoning and content generation capabilities. The TCM-SED, developed as the benchmark for this study, serves as an effective quantitative tool for evaluating and selecting appropriate LLMs for TCM scenarios. It also offers a valuable data foundation and a new research direction for future model optimization and alignment.

求助该文献

Large Language Model Evaluation in Traditional Chinese Medicine for Stroke: Quantitative Benchmarking Study

今日热心研友