计算机科学
人工智能
架空(工程)
论坛垃圾邮件
光学(聚焦)
垃圾邮件
机器学习
变压器
编码器
语言模型
深度学习
垃圾邮件程序
自然语言处理
互联网
语音识别
万维网
物理
电压
光学
操作系统
量子力学
标识
DOI:10.1109/globecom42002.2020.9347970
摘要
Spam has harassed Internet users for a long time, and how to detect spam accurately and efficiently is a critical problem. As yet, there are lots of research works proposed to detect spam, e.g., black and white lists, machine learning methods, and deep learning content-level measures, etc. Based on previous works, we find that most of methods' accuracy can reach 0.95 when they focus on one type and one language spam. Nevertheless, nowadays, people will receive spam messages of different types, different sources, and even different languages. Toward this, we develop a novel model, which is based on Google multilingual bidirectional encoder representations from transformers (M-BERT). Meanwhile, we design a brand new bilingual multi-type spam dataset to train our model. Particularly, we utilize optical character recognition (OCR) to extract text from image-based spam. Through the experiment, we find that the proposed model's accuracy can reach 0.9648, which outperforms the comparison models. In terms of time overhead, the proposed model only costs 0.3168 seconds per training step, which is an acceptable overhead. Therefore, these analysis results demonstrate that our approach can detect bilingual multi-type spam effectively.
科研通智能强力驱动
Strongly Powered by AbleSci AI