计算机科学
成交(房地产)
集合(抽象数据类型)
注释
主题专家
弹丸
召回
人工智能
自然语言处理
心理学
专家系统
程序设计语言
认知心理学
有机化学
法学
化学
政治学
作者
Valentin Liévin,Christoffer Hother,Ole Winther
出处
期刊:Cornell University - arXiv
日期:2022-07-17
被引量:13
标识
DOI:10.48550/arxiv.2207.08143
摘要
Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether close- and open-source models (GPT-3.5, LLama-2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-USMLE, MedMCQA, and PubMedQA) and multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), few-shot and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions, but also reaches the passing score on three datasets: MedQA-USMLE 60.2%, MedMCQA 62.7% and PubMedQA 78.2%. Open-source models are closing the gap: Llama-2 70B also passed the MedQA-USMLE with 62.5% accuracy.
科研通智能强力驱动
Strongly Powered by AbleSci AI