强化学习
钢筋
计算机科学
认知科学
人工智能
心理学
社会心理学
作者
DeepSeek-AI,Daya Guo,Dejian Yang,Haowei Zhang,Junxiao Song,Ruoyu Zhang,Runxin Xu,Qihao Zhu,Shirong Ma,Peiyi Wang,Xiao Bi,Xiaokang Zhang,Xingkai Yu,Yu Wu,Zhenhua Wu,Zhibin Gou,Zhihong Shao,Zhuoshu Li,Ziyi Gao,Aixin Liu
出处
期刊:Cornell University - arXiv
日期:2025-01-01
被引量:434
标识
DOI:10.4230/oasics.icpec.2025.4
摘要
As Large Language Models (LLMs) become increasingly integrated into secure software development workflows, a critical question remains unanswered: can these models not only detect insecure code but also reliably classify vulnerabilities according to standardized taxonomies? In this work, we conduct a systematic evaluation of three state-of-the-art LLMs - Llama3, Codestral, and Deepseek R1 - using a carefully filtered subset of the Big-Vul dataset annotated with eight representative Common Weakness Enumeration categories. Adopting a closed-world classification setup, we assess each model’s performance in both identifying the presence of vulnerabilities and mapping them to the correct CWE label. Our findings reveal a sharp contrast between high detection rates and markedly poor classification accuracy, with frequent overgeneralization and misclassification. Moreover, we analyze model-specific biases and common failure modes, shedding light on the limitations of current LLMs in performing fine-grained security reasoning.These insights are especially relevant in educational contexts, where LLMs are being adopted as learning aids despite their limitations. A nuanced understanding of their behaviour is essential to prevent the propagation of misconceptions among students. Our results expose key challenges that must be addressed before LLMs can be reliably deployed in security-sensitive environments.
科研通智能强力驱动
Strongly Powered by AbleSci AI