词汇表
计算机科学
文档
标识符
自然语言处理
领域(数学分析)
软件文档
人工智能
启发式
背景(考古学)
情报检索
集合(抽象数据类型)
内部文档
源代码
自然语言
软件
软件开发
程序设计语言
语言学
软件开发过程
软件建设
哲学
古生物学
数学分析
操作系统
生物
数学
作者
Chong Wang,Xin Peng,Mingwei Liu,Zhenchang Xing,Xuefang Bai,Bing Xie,Tuo Wang
标识
DOI:10.1145/3338906.3338963
摘要
A domain glossary that organizes domain-specific concepts and their aliases and relations is essential for knowledge acquisition and software development. Existing approaches use linguistic heuristics or term-frequency-based statistics to identify domain specific terms from software documentation, and thus the accuracy is often low. In this paper, we propose a learning-based approach for automatic construction of domain glossary from source code and software documentation. The approach uses a set of high-quality seed terms identified from code identifiers and natural language concept definitions to train a domain-specific prediction model to recognize glossary terms based on the lexical and semantic context of the sentences mentioning domain-specific concepts. It then merges the aliases of the same concepts to their canonical names, selects a set of explanation sentences for each concept, and identifies "is a", "has a", and "related to" relations between the concepts. We apply our approach to deep learning domain and Hadoop domain and harvest 5,382 and 2,069 concepts together with 16,962 and 6,815 relations respectively. Our evaluation validates the accuracy of the extracted domain glossary and its usefulness for the fusion and acquisition of knowledge from different documents of different projects.
科研通智能强力驱动
Strongly Powered by AbleSci AI