计算机科学
行话
文字嵌入
相似性(几何)
构造(python库)
词(群论)
自然语言处理
人工智能
质量(理念)
嵌入
语言学
图像(数学)
认识论
哲学
程序设计语言
作者
Liang Ke,Xinyu Chen,Haizhou Wang
标识
DOI:10.1145/3488560.3498469
摘要
With the continuous development of the darknet technology, the scale of darknet and have increased rapidly in recent years, leading to rampant crime in these anonymous trading markets. Monitoring these markets can effectively combat the criminal forces that hide behind them. One of the difficulties in understanding the darknet is that criminals usually use jargons to disguise transactions and thus avoid surveillance. These jargons usually distort the original meaning of innocent-looking words in obscure ways, posing significant challenges for crime tracking. Current research on Chinese jargon detection mainly adopts the method of keyword filtering, however, such methods have little effect on the complex and ever-changing structure of darknet jargons. We propose a Chinese jargon detection framework based on unsupervised learning. The main idea is to compare similarity with high-dimensional word embedding features from different corpus to find jargons. Firstly, we collect data from six Chinese Tor websites to build a dark corpus dataset. Afterwards, we build a word-based pre-training model called DC-BERT, which can generate high-quality contextual word embeddings. Finally, we construct a cross-corpus jargon detection framework based on similarity analysis, which can effectively detect Chinese jargons in the darknet. The experimental results show that the proposed framework is both innovative and practical, reaching a detection accuracy of 91.5%.
科研通智能强力驱动
Strongly Powered by AbleSci AI