计算机科学
抽象语法树
源代码
人工智能
抽象语法
自然语言处理
程序设计语言
语法
编码(集合论)
语句(逻辑)
自动汇总
人工神经网络
代表(政治)
机器学习
政治
政治学
集合(抽象数据类型)
法学
作者
Jian Zhang,Xu Wang,Hongyu Zhang,Hailong Sun,Kaixuan Wang,Xudong Lü
出处
期刊:International Conference on Software Engineering
日期:2019-05-01
被引量:338
标识
DOI:10.1109/icse.2019.00086
摘要
Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for follow-up analysis. Traditional information retrieval based methods often treat programs as natural language texts, which could miss important semantic information of source code. Recently, state-of-the-art studies demonstrate that abstract syntax tree (AST) based neural models can better represent source code. However, the sizes of ASTs are usually large and the existing models are prone to the long-term dependency problem. In this paper, we propose a novel AST-based Neural Network (ASTNN) for source code representation. Unlike existing models that work on entire ASTs, ASTNN splits each large AST into a sequence of small statement trees, and encodes the statement trees to vectors by capturing the lexical and syntactical knowledge of statements. Based on the sequence of statement vectors, a bidirectional RNN model is used to leverage the naturalness of statements and finally produce the vector representation of a code fragment. We have applied our neural network based source code representation method to two common program comprehension tasks: source code classification and code clone detection. Experimental results on the two tasks indicate that our model is superior to state-of-the-art approaches.
科研通智能强力驱动
Strongly Powered by AbleSci AI