计算机科学
搜索引擎索引
情报检索
相关性(法律)
字符串度量
弦(物理)
余弦相似度
字符串搜索算法
文献检索
过程(计算)
相似性(几何)
语义相似性
匹配(统计)
相关性反馈
文件分类
文件处理
人工智能
数据挖掘
模式匹配
聚类分析
图像检索
物理
图像(数学)
数学
政治学
操作系统
统计
法学
量子力学
作者
Muzammil Hussain Shahid,Muhammad Arshad Islam
标识
DOI:10.1109/raeecs50817.2020.9265792
摘要
Portable Document Format (PDF) is a commonly used format for the scientific publication. Currently, an input document is used to test the compliance and relevance of the document or text in Automated Compliance Engines and Natural Language Processing(NLP) based system. The whole document text is used for searching the compliance rules which is computationally expensive and slow process. For speeding up the compliance checking process and making it cost efficient, this paper purposes a method based on Table of Content(TOC) Data Structure. This work proposed the PDFparser which performs Data Indexing, separate headings text, and non-heading text, create hierarchy of headings and generates TOC to reduce the semantic-based string searching time and space. Furthermore, in the NLP based system, mostly semantic-based string matching used. The proposed PDFparser uses the Cosine Similarity method for computing semantic based similarity. Our purposed method performs 47.2% better than the previous approach of searching in the non-indexed whole document and decreases the search time and space. In the worst-case scenario, where no string match found, our purposed method performs 20.5 % better.
科研通智能强力驱动
Strongly Powered by AbleSci AI