Iso-level CAFT : how to tackle the combination of communication overhead reduction and fault tolerance scheduling
Author :
Benoit, Anne Hakem, Mourad Robert, Yves Laboratoire de l'informatique du parallélisme
Abstract :
To schedule precedence task graphs in a more realistic framework, we
introduce an efficient fault tolerant scheduling algorithm that is both
contention-aware and capable of supporting " arbitrary fail-silent (failstop)
processor failures. The design of the proposed algorithm which we
call Iso-Level CAFT, is motivated by (i) the search for a better loadbalance
and (ii) the generation of fewer communications. These goals
are achieved by scheduling a chunk of ready tasks simultaneously, which
enables for a global view of the potential communications. Our goal
is to minimize the total execution time, or latency, while tolerating an
arbitrary number of processor failures. Our approach is based on an
active replication scheme to mask failures, so that there is no need for
detecting and handling such failures. Major achievements include a low
complexity, and a drastic reduction of the number of additional communications
induced by the replication mechanism. The experimental
results fully demonstrate the usefulness of Iso-Level CAFT.