计算机科学
标记数据
人工智能
机器学习
再培训
半监督学习
分类器(UML)
试验数据
监督学习
数据挖掘
人工神经网络
业务
国际贸易
程序设计语言
作者
Taishi Nishiyama,Atsutoshi Kumagai,Kazunori Kamiya,Kenji Takahashi
标识
DOI:10.1109/iscc50000.2020.9219571
摘要
Machine learning is becoming a key component to automatically detect malware-infected hosts by analyzing network logs in a security operations center (SOC). However, machine learning usually requires a large amount of labeled training data, which is difficult to acquire since labels are manually set by professional security analysts. On the other hand, abundant unanalyzed logs are kept stored in daily operation and stay unlabeled even though they could compensate for the lack of existing labeled training data. This paper proposes SILU, a novel semi-supervised learning method, which fully leverages unlabeled data and enhances detection capability without increasing manually labeled data. SILU learns from combined labeled and unlabeled training data to automatically augment labeled training data and then generates a classifier through the screening process. Unlike most semi-supervised learning methods used in cyber security, which use test data as unlabeled training data, SILU does not require retraining every time test data change since it can use different datasets for unlabeled training and test data. This helps SOC operation for practically suppressing detecting time. In addition, though SILU partially includes a supervised learning method, it does not require a specific supervised learning method. Therefore, SILU can be added on to any type of classifier of a supervised learning method. Moreover, SILU can suppress the deterioration of classification performance for test data through the screening process. We evaluated SILU using two types of real-world logs: proxy logs from a large enterprise and NetFlow from a large ISP. We demonstrated that by evaluating with different types of classifiers, SILU always improves detection capability for supervised learning methods. SILU also outperforms current semi-supervised methods. As a whole, SILU works as an add-on to existing supervised learning methods with little overhead and performs better than conventional supervised learning methods. Our evaluation also shows that using NetFlow from ISP as unlabeled training data works better than using only labeled proxy logs in the same enterprise. These results suggest that SILU can extend detection capability more when different organizations, e.g., SOCs and ISPs, collaborate and share unlabeled data.
科研通智能强力驱动
Strongly Powered by AbleSci AI