计算机科学
容错
架空(工程)
异步通信
帧(网络)
嵌入式系统
分布式计算
并行计算
操作系统
计算机网络
作者
Konstantinos Parasyris,Giorgis Georgakoudis,Leonardo Bautista-Gomez,Ignacio Laguna
标识
DOI:10.1109/ccgrid51090.2021.00020
摘要
HPC systems continue to scale by including more hardware components for supporting larger application deployments. Critically, this scaling tends to decrease the mean time between failures, thus renders fault tolerance an increasingly important challenge. The standard practice in HPC for fault tolerance is checkpoint/restart. There have been significant but separate efforts to create fast application-layer checkpoint recovery techniques and fast recovery techniques at the MPI layer. However, those techniques operate in isolation and although they presuppose each other they have not been designed to jointly optimize end-to-end application recovery.We present FRAME, a fault-tolerance solution that significantly reduces application recovery time by combining, for the first time, an asynchronous multi-level checkpoint library, called Fault Tolerant Interface (FTI), with an online, fault tolerance solution for MPI, called Reinit. Our approach co-designs optimizations that speed up application recovery. Specifically, FRAME leverages the Reinit-enabled MPI to extract the topology of failures and optimize checkpoint retrieval in FTI to save significant overhead from identifying and fetching the most recent available checkpoint in the system. FRAME optimization reduces the time to retrieve checkpoints up to 67% when compared with baseline FTI. Results that include Reinit-based recovery for MPI show that our approach reduces end-to-end recovery time up to 360% when recovering 1.3 TB of checkpointed data in a large scale execution deployment of 32,768 MPI ranks.
科研通智能强力驱动
Strongly Powered by AbleSci AI