Task Management Project Home Page ----------
Dreese Lab
DARPA

Design and Implementation of Task Management in Distributed Real-Time Systems

----------

Document Contents


More Information

o Project Investigator
A brief bio of the PI.
o People
People who work on the project.
o Our Lab
A sketch of our lab.
o Project Viewgraph
Briefings which explain our work.
o Recent Papers
A list of papers on the project.
o Demos
A set of demonstrations for the software layer implemented.
o Presentation
A set of slides presented in DARPA PI meetings.
o Research Facilities
A listing of the equipments in the Lab.
o Related Projects
Other real-time related projects led by the PI.

Overview:

The availability of inexpensive, high-performance processors and high-capacity memory chips has made it attractive to use distributed computing systems for real-time applications. These systems offer several advantages such as parallel computation, performance scaling, and graceful degradation in case of component failures. However, these potentially attractive features cannot be realized without careful coordination of processing nodes and judicious distribution and redistribution of application tasks in the system. In this project, we design, implement and evaluate a task management mechanism in a distributed real-time environment.

The primary objective of this research is to meet the requirement that the execution of both periodic and aperiodic tasks must be logically correct and must also be completed before their deadlines. To pursue this goal, we study the following problems:

  1. Design of task allocation and load redistribution schemes in distributed real-time systems;
  2. Development of analytic approaches (rather than ad hoc approaches) to incorporate fault tolerance capabilities into the proposed schemes;
  3. Implementation and experimentation of the proposed task management scheme;
  4. Investigation on the underlying time-constrained communication support needed.

Part one is concerned with the design of a task management mechanism in a well-defined analytic framework. Part two deals with strategies to ensure the timely completion of application tasks even in the presence of component failures. Part three implements the proposed mechanism on a specific experimental platform and empirically measures its performance. Part four investigates the type of communication support the underlying communication subsystem must export to facilitate efficient task management in a real-time distributed environment.

The proposed research is a combination of two synergistic components: development of effective schemes in an analytic framework and their validation with software system building and experiments. We will pursue this research by designing and building an experimental software layer that sits between the OS and the application programs, and acts as the agent for managing both periodic and aperiodic tasks using the proposed mechanism.


Recent Accomplishments:

Design of an allocation scheme for periodic task modules
We decompose periodic tasks into a set of communicating modules, represent them by a task flow graph, and then devise a module allocation scheme to allocate periodic task modules in a planning cycle, with the objective of maximizing the probability of completing each task with both logical and timing correctness, subject to task precedence and timing constraints.
Design of load sharing scheme for aperiodic tasks
We characterize load sharing with three component policies: the transfer policy, the location policy, and the information policy, and carefully tailor each policy to reduce the probabilities of (1) transferring an overflow task to an ``incapable node''; (2) multiple nodes sending their overflow tasks to the same node; (3) excessive task transfers; (4) excessive communication and time overheads.
Incorporation of fault tolerance into the proposed schemes
We achieve fault tolerance in module allocation by identifying critical modules whose completion is critical to the timely completion of the task system, and replicating and allocating to distinct processing nodes. In particular, we determine (1) critical modules via critical path analysis; (2) the number of copies of each critical module needed by striking a balance between the degree of fault tolerance and the system capacity; and (3) the assignment and scheduling of replicas on nodes. We achieve fault tolerance in load redistribution by (1) adjusting the preferred lists in case of node failure to retain its desirable features; (2) coordinating nodes to keep backup checkpoints of tasks arrived at their neighbor nodes; (3) coordinating nodes to restart (from their most recent checkpoints) tasks that were executed on failed nodes in the case of node failure.
Implementation of the proposed task management layer
We have implemented the task management system as a portable software layer in the Sun Solaris environment. To facilitate monitoring of the task management system, we have also implemented and incorporated a Java-integrated monitor. The software release information can be found here.
Building of a laboratory testbed
We have built a mini laboratory Myrinet testbed at the Ohio State University for development of the proposed software layers and for technology demonstration.

Current Work:

Testing, refinement, and enhancement of the implemented software layer
We are currently testing, refining, and enhancing the implemented software layer. We will collect empirical performance data for analysis. We also plan to extend process and memory management facilities into a kernel interface server in the OS kernel and support IPC, event handling, and signal facilities.
Investigation of the Communication support needed
We have identified the need of a underlying communication subsystem that supports time-constrained communications for all task management-related activities, and are currently laying out all the necessary network components in a unified QoS framework to provide temporal QoS.
Technology transfer to JPL
The OSU team has joined the JPL-DARPA team for development of Fault Tolerant Embedded Systems, and has presented to the X2000 spacecraft development team at JPL in the 2nd and 3rd DARPA/JPL fault-tolerant computing workshops the fault tolerant strategies used in this project. The implemented software layer has been ftp'ed to the JPL site and will be included as one of the X2000 demonstration efforts. The OSU will continue to provide technical consultation.

This page was last updated on August 12 1998 and comments should be sent to Jennifer Hou.