生成语法
变压器
计算机科学
概念证明
机器学习
人工智能
自然语言处理
工程类
电压
电气工程
操作系统
作者
Mikkel Helding Vembye,Julian Christensen,Anja Bondebjerg Mølgaard,Frederikke Lykke Witthöft Schytt
摘要
Independent human double screening of titles and abstracts is a critical step to ensure the quality of systematic reviews and meta-analyses herein. However, double screening is a resource-demanding procedure that slows the review process. To alleviate this issue, we evaluated the use of OpenAI's generative pretrained transformer (GPT) application programming interface (API) models as an alternative to human second screeners of titles and abstracts. We did so by developing a new benchmark scheme for interpreting the performances of automated screening tools against common human screening performances in high-quality systematic reviews and by conducting three large-scale experiments on three psychological systematic reviews with different levels of complexity. Across all experiments, we show that the GPT API models can perform on par with and in some cases even better than typical human screening performance in terms of detecting relevant studies while showing high exclusion performance, as well. Hereto, we introduce the use of multiprompt screening, which is making one concise prompt per inclusion/exclusion criteria in a review, and show that it can be a valuable tool to use and support screenings in highly complex review settings. To consolidate future implementation, we develop a reproducible workflow and a set of tentative guidelines for when and when not to use GPT API models as independent second screeners of titles and abstracts. Moreover, we present the R package AIscreenR to standardize the suggested application. Our aim is ultimately to make GPT API models acceptable as independent second screeners within high-quality systematic reviews, such as the ones published in Psychological Bulletin. (PsycInfo Database Record (c) 2025 APA, all rights reserved).
科研通智能强力驱动
Strongly Powered by AbleSci AI