Research in Heterogeneous Computing

Heterogeneous distributed computing.

Sponsor: NASA Lewis research Center.

Fusun Ozguner, Mike Iverson

The research in heterogeneous computing began as a novel technique for increasing the performance of a large, coupled structural analysis application, through a grant from NASA's Lewis Research Center. The application in question consists of a number of dissimilar analysis modules, which share data. Initial studies found that the performance of this applications could be increased dramatically (with a minimum of programming effort) by executing the different modules on a network of distributed, heterogeneous machines. Much of this performance gain stems from the observation that some problems are well suited to a specific machine architecture. By utilizing a network with several different machine architectures, a net gain in performance can be realized. We further examined the problem of assigning the subtasks of the application to the machines (this is called the matching and scheduling problem). With the initiative to construct high performance global networks, we are now examining heterogeneous computing on a larger scale. This concept, called metacomputing, allows multiple users to use a global network of high performance machines as a single computational resource. This computational entity would greatly exceed the computational power of any single machine. We are specifically working on the following projects. In order to determine which machine architecture is best suited to a particular task, an estimate of its execution time is required. We have developed a statistical method to estimate the execution time of that task, and are developing improvements to this algorithm. Since there will be multiple users in any global network of machines, we are currently examining scheduling policies which will produce good application performance, while ensuring that the computational resources are allocated fairly among multiple users. Furthermore, we are examining the important issue of fault-tolerance, since network and machine failures would be quite common in such a large environment.