1;3409;0c Evaluating Performance Impacts of Delayed Failure Repairing on Large-Scale Systems

Evaluating Performance Impacts of Delayed Failure Repairing on Large-Scale Systems

2011 IEEE International Conference on Cluster Computing, 2011
Pages: 532-536DOI: 10.1109/CLUSTER.2011.71

CLUSTER

bibtex

With the fast improvement in technology, we are now moving toward exascale computing. Many experts predict that exascale computers will have millions of nodes, billions of threads of execution, hundreds of petabytes of inner memory and exabytes of persistent storage. For systems of such a scale, frequent failures are becoming a serious concern. One of the most important reasons is that in a large-scale system it is hard to detect failures. As a result, failure repair may take substantial time. In this paper, we investigate the effect of delayed repairing on two popular types of high-performance computing systems: IBM Blue Gene/P and general cluster. We analyze how delayed failure repairing will affect the performance of jobs when some computing units are at fault but not fixed in time. Our study is based on real workload traces and RAS logs collected from production supercomputing systems. Our Trace-based simulations indicate that fast failure detection and recovery is essential for moving towards petascale and beyond computing.