医学
内容(测量理论)
自然语言处理
语言学
数学
计算机科学
数学分析
哲学
作者
Rohan Khera,Aline F Pedroso,Vipina K. Keloth,Hua Xu,Gisele Sampaio Silva,Lee H. Schwamm
出处
期刊:PubMed
日期:2025-08-15
标识
DOI:10.1161/strokeaha.125.051913
摘要
Large language models (LLMs) are artificial intelligence (AI) tools that can generate human expert-like content and be used to accelerate the synthesis of scientific literature, but they can spread misinformation by producing misleading content. This study sought to characterize distinguishing linguistic features in differentiating AI-generated from human-authored scientific text and evaluate the performance of AI detection tools for this task. We conducted a computational synthesis of 34 essays on cerebrovascular topics (12 generated by large language models [Generative Pre-trained Transformer 4, Generative Pre-trained Transformer 3.5, Llama-2, and Bard] and 22 by human scientists). Each essay was rated as AI-generated or human-authored by up to 38 members of the Stroke editorial board. We compared the collective performance of experts versus GPTZero, a widely used online AI detection tool. We extracted and compared linguistic features spanning syntax (word count, complexity, and so on), semantics (polarity), readability (Flesch scores), grade level (Flesch-Kincaid), and language perplexity (or predictability) to characterize linguistic differences between AI-generated versus human-written content. Over 50% of the stroke experts who reviewed the study essays correctly identified 10 (83.3%) of AI-generated essays as AI, whereas they misclassified 7 (31.8%) of human-written essays as AI. GPTZero accurately classified 12 (100%) of AI-generated and 21 (95.5%) of human-written essays. However, the tool relied on only a few key sentences for classification. Compared with human essays, AI-generated content had lower word count and complexity, exhibited significantly lower perplexity (median, 15.0 versus 7.2; P<0.001), lower readability scores (Flesch median, 42.1 versus 26.4; P<0.001), and higher grade level (Flesch-Kincaid median, 13.1 versus 14.8; P=0.006). Large language models generate scientific content with measurable differences versus human-written text but represent features that are not consistently identifiable even by human experts and require complex AI detection tools. Given the challenges that experts face in distinguishing AI from human content, technology-assisted tools are essential wherever human provenance is essential to safeguard the integrity of scientific communication.
科研通智能强力驱动
Strongly Powered by AbleSci AI