Fault-Tolerance Strategies for HPC Platforms
1 : University of Tennessee
The NSF SMURFS project explores the impact of faults and failures, fault mitigation strategies and emerging technologies by providing new analytical and component models for predicting fault-tolerant application behavior at scale. In this talk, I will present the recent results coming from case studies developed in the context of SMURFS, focusing on resource sharing, node provisionning, and resilience strategies.