试验台
计算机科学
偏移量(计算机科学)
分类器(UML)
系统监控
星团(航天器)
分布式计算
嵌入式系统
人工智能
操作系统
计算机网络
作者
Andres Quan,Leah Howell,Hugh Greenberg
标识
DOI:10.1145/3624062.3624128
摘要
Identifying system hardware failures and anomalies is a unique challenge in heterogeneous testbed clusters because of variation in the ways that the system log reports errors and warnings. We present a novel approach for the real-time classification of syslog messages generated by a heterogeneous testbed cluster to proactively identify potential hardware issues and security events. By integrating machine learning models with high-performance computing systems, our system facilitates continuous system health monitoring. The paper introduces a taxonomy for classifying system issues into actionable categories of problems, while filtering out groups of messages that the system administrators would consider unimportant "noise". Finally, we experiment with using large language models as a message classifier, and share our results and experience with doing so. Results demonstrate promising performance, and more explainable results compared to currently available techniques, but the computational costs may offset the benefits.
科研通智能强力驱动
Strongly Powered by AbleSci AI