作者
            
                Qingwen Yang,Jiahui Jiang,Xue Dong,Huai Yang,Qi Wang,Zhenghan Yang,Dawei Yang,Peng Liu            
         
            
    
            摘要
            
            Abstract The free-text format is widely used in radiology reports for its flexibility of expression; however, its unstructured nature leads to substantial amounts of report data remaining underutilized. A natural language processing (NLP) model for automatic extraction of information from free-text radiology reports can significantly contribute to the development of structured databases, thereby optimizing data utilization. This study aimed to perform a systematic review and meta-analysis that evaluates the performance of NLP systems in extracting information from free-text radiology reports. A systematic literature search was conducted from November 21 to 23, 2024, in PubMed/MEDLINE, Embase, EBSCO, Ovid, Web of Science, and the Cochrane Library. Study quality was assessed using the QUADAS-2 tool. A bivariate random-effects model was applied to obtain the pooled sensitivity, specificity, diagnostic odds ratio (DOR), positive likelihood ratio (PLR), negative likelihood ratio (NLR), and area under the summary receiver operating characteristic curve (AUC). Subgroup analyses (e.g., NLP model types, dataset source, and language types) and a random-effects multivariable meta-regression based on the restricted maximum likelihood (REML) method were conducted to explore potential sources of heterogeneity. Sensitivity analyses (excluding high-risk studies, leave-one-out method, and data integration strategy comparison) were performed to assess the robustness of the findings. A total of 28 studies were included in the final analysis, with 421,692 extracted entities in 51,187 free-text radiology reports. NLP systems achieved high pooled sensitivity (91% [95% CI: 87, 93]) and specificity (96% [95% CI: 93, 97]), with a diagnostic odds ratio of 220 (95% CI: 112, 435) and an area under the curve of 0.98 (95% CI: 0.96, 0.99). Subgroup analysis revealed significantly better performance for extracting single anatomical sites (AUC 0.99; 95% CI: 0.97, 0.99) compared with multiple sites (AUC 0.95; 95% CI: 0.93, 0.97; p = 0.001). No significant differences were observed across NLP model types, dataset sources, external validations, languages, or imaging modalities. Multivariable meta-regression further identified anatomical site as the only significant contributor to heterogeneity (coefficient = 2.26; 95% CI: 0.25, 4.27; p = 0.027). Sensitivity analyses confirmed the robustness of the findings, and no evidence of publication bias was detected. NLP models demonstrated excellent performance in extracting information from free-text radiology reports. However, the observed heterogeneity highlights the need for enhanced report standardization and improved model generalizability.