Incorporation of fault tolerance
We incorporate the capability of fault tolerance into the task
management system as follows.
We achieve fault tolerance in module allocation by
identifying critical modules whose completion is critical to the
timely completion of the task system, replicating them, and
allocating replicated modules to distinct processing
nodes. Three issues need to be considered:
- Which modules are critical to the timely completion of tasks,
and should be replicated;
- How many copies of each critical module are needed so that
the worst-case probability of dynamic failure and the degree
of fault tolerance can both be guaranteed;
- the assignment and scheduling of the replicas on nodes.
We determine (1) via critical path analysis; (2) by striking a balance
between the degree of fault tolerance and the system capacity; and (3)
by coupling message scheduling with module allocation.
We achieve fault tolerance in load sharing by
- developing an algorithm using a combinatorial
approach to adjust the preferred lists in case of
node failure to retain the desirable features of the
preferred list;
- devising an approach to coordinate nodes to keep
back-up copies of tasks arrived at their neighbor nodes
so that tasks executed/queued at a failed node can be
effectively restored at some other nodes.
- coordinating nodes to restart (from their most recent
checkpoints) tasks that were executed on failed nodes in the
case of node failure.
To exploit checkpointing and rollback recovery techniques,
we develop and implement an effective application-transparent
checkpointing/rollback scheme.
- The checkpointing scheme uses the un-forced
checkpointing policy to reduce process rollback propagation
by dynamically varying checkpoint intervals with respect to the
frequency of message sending.
Additional forced checkpoints are taken only to achieve
checkpoint consistency among processes and to avoid the
domino effect.
-
We incorporate both the global rollback strategy and the minimal
rollback strategy into the checkpointing scheme devised.
The combined checkpointing/rollback scheme is able to
handle out-of-order messages, achieves higher concurrency
during checkpointing/rollback operations, and allows multiple
checkpointing/rollback instances to be simultaneously invoked.
-
To reduce the space overhead, we also develop a garbage
collection approach to determine the global recovery line
before which processes shall never rollback in case of failure.
As a result, the checkpoints before the global recovery line can
be timely purged.
The research results have been reported in the following papers:
- Chao-Ju Hou, Kar Shun Tsoi, and Ching-Chih Han,
"Effective and concurrent checkpointing and recovery in distributed systems,"
IEE Proceedings on Computer and
Digital Techniques, Vol. 144, No. 5, pages 304-316, September, 1997.
- Chao-Ju Hou and Kang G. Shin, "Replication
and allocation of task modules in distributed real-time systems,"
IEEE 24th Annual Int'l Symposium on Fault-tolerant Computing,
pp. 26-35, Austin, Texas, June 15-17, 1995.
Return
to Project Home Page
Date last modified -- August 15, 1998
Direct comments concerning this WWW site to:
jhou@ece.osu.edu