|
|
Project Investigator
People
Our Lab
Project Viewgraph
Recent Papers
Demos
Presentation
Research Facilities
Related Projects
The primary objective of this research is to meet the requirement that the execution of both periodic and aperiodic tasks must be logically correct and must also be completed before their deadlines. To pursue this goal, we study the following problems:
Part one is concerned with the design of a task management mechanism in a well-defined analytic framework. Part two deals with strategies to ensure the timely completion of application tasks even in the presence of component failures. Part three implements the proposed mechanism on a specific experimental platform and empirically measures its performance. Part four investigates the type of communication support the underlying communication subsystem must export to facilitate efficient task management in a real-time distributed environment.
The proposed research is a combination of two synergistic components: development of effective schemes in an analytic framework and their validation with software system building and experiments. We will pursue this research by designing and building an experimental software layer that sits between the OS and the application programs, and acts as the agent for managing both periodic and aperiodic tasks using the proposed mechanism.
Design of an allocation scheme for periodic task modulesWe decompose periodic tasks into a set of communicating modules, represent them by a task flow graph, and then devise a module allocation scheme to allocate periodic task modules in a planning cycle, with the objective of maximizing the probability of completing each task with both logical and timing correctness, subject to task precedence and timing constraints. |
Design of load sharing scheme for aperiodic tasksWe characterize load sharing with three component policies: the transfer policy, the location policy, and the information policy, and carefully tailor each policy to reduce the probabilities of (1) transferring an overflow task to an ``incapable node''; (2) multiple nodes sending their overflow tasks to the same node; (3) excessive task transfers; (4) excessive communication and time overheads. |
Incorporation of fault tolerance into the proposed schemesWe achieve fault tolerance in module allocation by identifying critical modules whose completion is critical to the timely completion of the task system, and replicating and allocating to distinct processing nodes. In particular, we determine (1) critical modules via critical path analysis; (2) the number of copies of each critical module needed by striking a balance between the degree of fault tolerance and the system capacity; and (3) the assignment and scheduling of replicas on nodes. We achieve fault tolerance in load redistribution by (1) adjusting the preferred lists in case of node failure to retain its desirable features; (2) coordinating nodes to keep backup checkpoints of tasks arrived at their neighbor nodes; (3) coordinating nodes to restart (from their most recent checkpoints) tasks that were executed on failed nodes in the case of node failure. |
|
Implementation of the proposed task management layer We have implemented the task management system as a portable software layer in the Sun Solaris environment. To facilitate monitoring of the task management system, we have also implemented and incorporated a Java-integrated monitor. The software release information can be found here. |
Building of a laboratory testbedWe have built a mini laboratory Myrinet testbed at the Ohio State University for development of the proposed software layers and for technology demonstration. |
Testing, refinement, and enhancement of the implemented software layerWe are currently testing, refining, and enhancing the implemented software layer. We will collect empirical performance data for analysis. We also plan to extend process and memory management facilities into a kernel interface server in the OS kernel and support IPC, event handling, and signal facilities. |
Investigation of the Communication support neededWe have identified the need of a underlying communication subsystem that supports time-constrained communications for all task management-related activities, and are currently laying out all the necessary network components in a unified QoS framework to provide temporal QoS. |
Technology transfer to JPLThe OSU team has joined the JPL-DARPA team for development of Fault Tolerant Embedded Systems, and has presented to the X2000 spacecraft development team at JPL in the 2nd and 3rd DARPA/JPL fault-tolerant computing workshops the fault tolerant strategies used in this project. The implemented software layer has been ftp'ed to the JPL site and will be included as one of the X2000 demonstration efforts. The OSU will continue to provide technical consultation. |
|