SILU: Strategy Involving Large-scale Unlabeled Logs for Improving Malware Detector

计算机科学标记数据人工智能机器学习再培训半监督学习分类器（UML）试验数据监督学习数据挖掘人工神经网络业务国际贸易程序设计语言

作者

Taishi Nishiyama,Atsutoshi Kumagai,Kazunori Kamiya,Kenji Takahashi

标识

DOI：10.1109/iscc50000.2020.9219571

摘要

Machine learning is becoming a key component to automatically detect malware-infected hosts by analyzing network logs in a security operations center (SOC). However, machine learning usually requires a large amount of labeled training data, which is difficult to acquire since labels are manually set by professional security analysts. On the other hand, abundant unanalyzed logs are kept stored in daily operation and stay unlabeled even though they could compensate for the lack of existing labeled training data. This paper proposes SILU, a novel semi-supervised learning method, which fully leverages unlabeled data and enhances detection capability without increasing manually labeled data. SILU learns from combined labeled and unlabeled training data to automatically augment labeled training data and then generates a classifier through the screening process. Unlike most semi-supervised learning methods used in cyber security, which use test data as unlabeled training data, SILU does not require retraining every time test data change since it can use different datasets for unlabeled training and test data. This helps SOC operation for practically suppressing detecting time. In addition, though SILU partially includes a supervised learning method, it does not require a specific supervised learning method. Therefore, SILU can be added on to any type of classifier of a supervised learning method. Moreover, SILU can suppress the deterioration of classification performance for test data through the screening process. We evaluated SILU using two types of real-world logs: proxy logs from a large enterprise and NetFlow from a large ISP. We demonstrated that by evaluating with different types of classifiers, SILU always improves detection capability for supervised learning methods. SILU also outperforms current semi-supervised methods. As a whole, SILU works as an add-on to existing supervised learning methods with little overhead and performs better than conventional supervised learning methods. Our evaluation also shows that using NetFlow from ISP as unlabeled training data works better than using only labeled proxy logs in the same enterprise. These results suggest that SILU can extend detection capability more when different organizations, e.g., SOCs and ISPs, collaborate and share unlabeled data.

求助该文献

最长约 10秒，即可获得该文献文件

SILU: Strategy Involving Large-scale Unlabeled Logs for Improving Malware Detector

今日热心研友