计算机科学
等级制度
情报检索
数据挖掘
信息抽取
数据科学
集合(抽象数据类型)
人口
知识抽取
解析
利用
光学(聚焦)
人工智能
物理
社会学
人口学
经济
光学
程序设计语言
计算机安全
市场经济
作者
Juraj Mavračić,Callum J. Court,Taketomo Isazawa,Stephen R. Elliott,Jacqueline M. Cole
标识
DOI:10.1021/acs.jcim.1c00446
摘要
The ever-growing abundance of data found in heterogeneous sources, such as scientific publications, has forced the development of automated techniques for data extraction. While in the past, in the physical sciences domain, the focus has been on the precise extraction of individual properties, attention has recently been devoted to the extraction of higher-level relationships. Here, we present a framework for an automated population of ontologies. That is, the direct extraction of a larger group of properties linked by a semantic network. We exploit data-rich sources, such as tables within documents, and present a new model concept that enables data extraction for chemical and physical properties with the ability to organize hierarchical data as nested information. Combining these capabilities with automatically generated parsers for data extraction and forward-looking interdependency resolution, we illustrate the power of our approach via the automatic extraction of a crystallographic hierarchy of information. This includes 18 interrelated submodels of nested data, extracted from an evaluation set of scientific articles, yielding an overall precision of 92.2%, across 26 different journals. Our method and associated toolkit, ChemDataExtractor 2.0, offers a key step toward the seamless integration of primary literature sources into a data-driven scientific framework.
科研通智能强力驱动
Strongly Powered by AbleSci AI