计算机科学
边缘计算
人工智能
机器学习
GSM演进的增强数据速率
容错
调度(生产过程)
可靠性(半导体)
延迟(音频)
故障检测与隔离
分布式计算
深度学习
数据挖掘
工程类
功率(物理)
运营管理
物理
电信
量子力学
执行机构
作者
Shreshth Tuli,Giuliano Casale,Ludmila Cherkasova,Nicholas R. Jennings
标识
DOI:10.1109/infocom53939.2023.10229049
摘要
The emergence of latency-critical AI applications has been supported by the evolution of the edge computing paradigm. However, edge solutions are typically resource-constrained, posing reliability challenges due to heightened contention for compute capacities and faulty application behavior in the presence of overload conditions. Although a large amount of generated log data can be mined for fault prediction, labeling this data for training is a manual process and thus a limiting factor for automation. Due to this, many companies resort to unsupervised fault-tolerance models. Yet, failure models of this kind can incur a loss of accuracy when they need to adapt to non-stationary workloads and diverse host characteristics. Thus, we propose a novel modeling approach, DeepFT, to proactively avoid system overloads and their adverse effects by optimizing the task scheduling decisions. DeepFT uses a deep-surrogate model to accurately predict and diagnose faults in the system and co-simulation based self-supervised learning to dynamically adapt the model in volatile settings. Experimentation on an edge cluster shows that DeepFT can outperform state-of-the-art methods in fault-detection and QoS metrics. Specifically, DeepFT gives the highest F1 scores for fault-detection, reducing service deadline violations by up to 37% while also improving response time by up to 9%.
科研通智能强力驱动
Strongly Powered by AbleSci AI