作者
Zhen Wang,Zhongle Xu,Yongyong Shi,Junhua Xi,Yanbin Zhang
摘要
ABSTRACT Background Urodynamic studies (UDS) are essential diagnostic tools in urology, but their interpretation requires significant expertise and is subject to interobserver variability. Large language models (LLMs) have shown promise in various medical diagnostic applications, yet their utility in automated interpretation of urodynamic parameters remains unexplored. Objective To evaluate the diagnostic performance of large language models in the automated interpretation of urodynamic parameters compared to urologists with different experience levels. Methods We analyzed 320 urodynamic studies from patients with various lower urinary tract conditions. Two large language models (Deepseek‐R1 and GPT‐4) were employed to interpret the urodynamic data. Their diagnostic accuracy was compared with that of junior and senior urologists. Performance was evaluated using receiver operating characteristic (ROC) curves, area under the curve (AUC), diagnostic accuracy, and the QUEST framework (Quality of information, Understanding and reasoning, Expression style, Safety, and Trustworthiness). This study was designed and reported following the TRIPOD + AI statement for reporting prediction models using machine learning methods. Results Deepseek‐R1 demonstrated the highest diagnostic accuracy (92.50%) among the automated systems, followed by GPT‐4 (85.94%), comparable to junior urologists (83.75%) but lower than senior urologists (95.94%). The reference standard was established by consensus of three board‐certified urodynamics experts with median 15 years of experience (range 12–22 years). ROC analysis revealed strong performance across different urological conditions, with AUC values ranging from 0.89 to 0.92 for Deepseek‐R1, 0.84–0.88 for GPT‐4, 0.81–0.84 for junior urologists, and 0.94–0.95 for senior urologists. The QUEST framework evaluation showed that Deepseek‐R1 outperformed other systems in information quality, reasoning, expression style, safety, and trustworthiness. Both LLMs demonstrated high clinical utility, with Deepseek‐R1 scoring higher in decision support (4.38/5), time efficiency (2.10/5), and educational value (4.20/5) compared to GPT‐4. Conclusions Large language models, particularly Deepseek‐R1, demonstrate promising capabilities in the automated interpretation of urodynamic parameters, with performance exceeding that of junior urologists and approaching senior urologists. These findings suggest potential applications in clinical decision support, training, and quality assurance in urodynamic practice, which could enhance diagnostic consistency and accessibility of expert‐level interpretation. Clinical Trial Registration This study is a retrospective analysis of deidentified patient data and did not involve any direct patient contact or intervention. Therefore, ethics approval was waived in accordance with institutional and national guidelines.