计算机科学
聚类分析
故障排除
数据挖掘
绩效指标
云计算
匹配(统计)
微服务
服务(商务)
机器学习
操作系统
数学
统计
经济
经济
管理
作者
Shilin He,Qingwei Lin,Jian–Guang Lou,Hongyu Zhang,Michael R. Lyu,Dongmei Zhang
标识
DOI:10.1145/3236024.3236083
摘要
Logs are often used for troubleshooting in large-scale software systems. For a cloud-based online system that provides 24/7 service, a huge number of logs could be generated every day. However, these logs are highly imbalanced in general, because most logs indicate normal system operations, and only a small percentage of logs reveal impactful problems. Problems that lead to the decline of system KPIs (Key Performance Indicators) are impactful and should be fixed by engineers with a high priority. Furthermore, there are various types of system problems, which are hard to be distinguished manually. In this paper, we propose Log3C, a novel clustering-based approach to promptly and precisely identify impactful system problems, by utilizing both log sequences (a sequence of log events) and system KPIs. More specifically, we design a novel cascading clustering algorithm, which can greatly save the clustering time while keeping high accuracy by iteratively sampling, clustering, and matching log sequences. We then identify the impactful problems by correlating the clusters of log sequences with system KPIs. Log3C is evaluated on real-world log data collected from an online service system at Microsoft, and the results confirm its effectiveness and efficiency. Furthermore, our approach has been successfully applied in industrial practice.
科研通智能强力驱动
Strongly Powered by AbleSci AI