Fault-tolerant parallel architectures.
Fusun Ozguner, Baback Izadi, Shobana Balakrishnan
This research utilizes the capabilities of the circuit-switched
communication to present reconfiguration schemes to make
multiprocessors based on the hypercube, the k-ary n-cube, and the
k-ary tree operational in the presence of faulty processor nodes
and/or faulty communication links. We require that the resultant
reconfiguration cause no modification to the existing communication or
computation algorithms. Our first hardware redundancy scheme is
called the cluster approach. The approach assigns one spare node
to each subset of regular nodes called a cluster. Local reconfiguration
is used to replace the faulty components with spares. Our simulation
results indicate that the approach tolerates moderate number of
faults. In our second hardware redundancy scheme, called the enhanced
cluster approach, spare nodes of neighboring clusters are
interconnected as well. By utilizing the circuit-switched capabilities
of the spare nodes' communication modules, multiple faulty nodes per
cluster are tolerated. Our theoretical and simulation results indicate
that the approach tolerates significantly more faults than other
proposed schemes in the literature.
To allow real-time fault tolerance, the two-stage redundant scheme is
proposed. The scheme uses global reconfiguration algorithm to assign a
non-local spare node to a faulty node when the task completion
deadline is soft and utilizes local reconfiguration to assign a local
spare node to a faulty node when the completion deadline hard. In
case there is no hardware redundancy, a graceful degradable approach
is presented to reconfigure a faulty d-dimensional hypercube. The
approach is optimal since it can always construct a (d-1)-subcube in
the presence of up to 2^{(d-1)} faulty nodes. The approach is extended
to tolerate combination of faulty nodes and faulty links. In case the
number of faulty nodes of a d-dimensional enhanced cluster hypercube
is more than the available spares, a graceful degradable approach is
presented to sustain a (d-1)-dimensional subcube. Finally, the
management of the hypercube in the presence of faulty nodes is
examined and a procedure is presented to convert a faulty
d-dimensional hypercube into an enhanced cluster hypercube of
dimension (d-1).
Broadcast schemes using unicast messages and multidestination messages
have been developed for faulty n-cubes with less than n faulty
nodes. The multidestination scheme results in 40% reduction in
broadcast completion time.