计算机科学
概率估计
考试(生物学)
词(群论)
统计
估计
自然语言处理
人工智能
数学
工程类
几何学
生物
古生物学
系统工程
作者
Yanjun Gao,Skatje Myers,Shan Chen,Dmitriy Dligach,Timothy A. Miller,Danielle S. Bitterman,Guanhua Chen,Anoop Mayampurath,Matthew M. Churpek,Majid Afshar
出处
期刊:JAMIA open
[Oxford University Press]
日期:2024-12-26
卷期号:8 (1)
标识
DOI:10.1093/jamiaopen/ooae154
摘要
To evaluate large language models (LLMs) for pre-test diagnostic probability estimation and compare their uncertainty estimation performance with a traditional machine learning classifier. We assessed 2 instruction-tuned LLMs, Mistral-7B-Instruct and Llama3-70B-chat-hf, on predicting binary outcomes for Sepsis, Arrhythmia, and Congestive Heart Failure (CHF) using electronic health record (EHR) data from 660 patients. Three uncertainty estimation methods-Verbalized Confidence, Token Logits, and LLM Embedding+XGB-were compared against an eXtreme Gradient Boosting (XGB) classifier trained on raw EHR data. Performance metrics included AUROC and Pearson correlation between predicted probabilities. The XGB classifier outperformed the LLM-based methods across all tasks. LLM Embedding+XGB showed the closest performance to the XGB baseline, while Verbalized Confidence and Token Logits underperformed. These findings, consistent across multiple models and demographic groups, highlight the limitations of current LLMs in providing reliable pre-test probability estimations and underscore the need for improved calibration and bias mitigation strategies. Future work should explore hybrid approaches that integrate LLMs with numerical reasoning modules and calibrated embeddings to enhance diagnostic accuracy and ensure fairer predictions across diverse populations. LLMs demonstrate potential but currently fall short in estimating diagnostic probabilities compared to traditional machine learning classifiers trained on structured EHR data. Further improvements are needed for reliable clinical use.
科研通智能强力驱动
Strongly Powered by AbleSci AI