Realistic models and efficient algorithms for fault-tolerant scheduling on heterogeneous platforms
Author :
Benoit, Anne Hakem, Mourad Robert, Yves Laboratoire de l'informatique du parallélisme
Abstract :
(eng) Abstract
Most list scheduling heuristics rely on a simple platform model where
communication contention is not taken into account. In addition, it is
generally assumed that processors in the systems are completely safe. To
schedule precedence graphs in a more realistic framework, we introduce
an efficient fault tolerant scheduling algorithm that is both contentionaware
and capable of supporting " arbitrary fail-silent (fail-stop) processor
failures. We focus on a bi-criteria approach, where we aim at
minimizing the total execution time, or latency, given a fixed number of
failures supported in the system. Our algorithm has a low time complexity,
and drastically reduces the number of additional communications
induced by the replication mechanism. Experimental results fully
demonstrate the usefulness of the proposed algorithm, which leads to
efficient execution schemes while guaranteeing a prescribed level of fault
tolerance.