计算机科学
规范化(社会学)
聚类分析
数据挖掘
源代码
人工智能
特征学习
编码器
特征提取
机器学习
模式识别(心理学)
人类学
操作系统
社会学
作者
Tongshuai Wu,Liwei Chen,Gewangzi Du,Chenguang Zhu,Ningning Cui,Gang Shi
标识
DOI:10.1093/comjnl/bxad080
摘要
Abstract The key to deep learning vulnerability detection framework is pre-processing source code and learning vulnerability features. Traditional source code representation techniques take a complete normalization to user-defined symbols but ignore the semantic information associated with vulnerabilities. The current mainstream vulnerability feature learning model is Recurrent Neural Network (RNN), whose time-series structure determines its insufficient remote information acquisition capability. This paper proposes a new vulnerability detection framework to solve the above problems. We propose a new data normalization method in the source code pre-processing phase. The user-defined symbols are clustered using the unsupervised clustering algorithm K-means. The normalized classification is performed according to the clustering results, which preserves the primary semantic information in the source code and ensures the smoothness of the sample data. In the feature extraction stage, we input the source code after performing text representation into Bidirectional Encoder Representations for Transformers (BERT) for feature automation learning, which enhances semantic information extraction and remote information acquisition. Experimental results show that the vulnerability detection precision of this method is 18.3% higher than that of the current mainstream vulnerability detection framework in the real-world data collected by ourselves. Further, our method improves the precision of the state-of-the-art method by 4.2%.
科研通智能强力驱动
Strongly Powered by AbleSci AI