摘要
Fault tolerance is becoming increasingly important for upcoming exascale systems, supporting distributed data processing, due to the expected decrease in the Mean Time Between Failures (MTBF). To ensure the availability, reliability, dependability, and performance of the system, addressing the fault tolerance challenge is crucial. It aims to keep the distributed system running at a reduced capacity while avoiding complete data loss, even in the presence of faults, with minimal impact on system performance. This comprehensive survey aims to provide a detailed understanding of the importance of fault tolerance in distributed systems, including a classification of faults, errors, failures, and fault-tolerant techniques (reactive, proactive, and predictive). We collected a corpus of 490 papers published from 2014 to 2023 by searching in Scopus, IEEE Xplore, Springer, and ACM digital library databases. After a systematic review, 17 reactive models, 17 proactive models, and 14 predictive models were shortlisted and compared. A taxonomy of ideas behind the proposed models was also created for each of these categories of fault-tolerant solutions. Additionally, it examines how fault tolerance capability is incorporated into popular big data processing tools such as Apache Hadoop, Spark, and Flink. Finally, promising future research directions in this domain are discussed.