计算机科学
编码(集合论)
克隆(Java方法)
人工智能
自然语言处理
情报检索
程序设计语言
万维网
数据挖掘
生物
基因
生物化学
集合(抽象数据类型)
作者
Zexuan Wan,Chunli Xie,Quanrun Lv,Y.-H. Fan
标识
DOI:10.1142/s0218194024500323
摘要
Semantic code clone detection is to find code snippets that are structurally or syntactically different, but semantically identical. It plays an important role in software reuse, code compression. Many existing studies have achieved good performance in non-semantic clone, but semantic clone is still a challenging task. Recently, several works have used tree or graph, such as Abstract Syntax Tree (AST), Control Flow Graph (CFG) or Program Dependency Graph (PDG) to extract semantic information from source codes. In order to reduce the complexity of tree and graph, some studies transform them into node sequences. However, this transformation will lose some semantic information. To address this issue, we propose a novel high-performance method that utilizes community detection to extract features of AST while preserving its semantic information. First, based on the AST of source code, we exploit community detection to split AST into different subtrees to extract the underlying semantics information of different code blocks, and use centrality analysis to quantify the semantic information as the weight of AST nodes. Then, the AST is converted into a sequence of tokens with weights, and a Siamese neural network model is used to detect the similarity of token sequences for semantic code clone detection. Finally, to evaluate our approach, we conduct experiments on two standard benchmark datasets, Google Code Jam (GCJ) and BigCloneBench (BCB). Experimental results show that our model outperforms the eight publicly available state-of-the-art methods in detecting code clones. It is five times faster than the tree-based method (ASTNN) in terms of time complexity.
科研通智能强力驱动
Strongly Powered by AbleSci AI