Research in Heterogeneous Computing
Heterogeneous distributed computing.
Sponsor: NASA Lewis research Center.
Fusun Ozguner, Mike Iverson
The research in heterogeneous computing began as a novel technique for
increasing the performance of a large, coupled structural analysis
application, through a grant from NASA's Lewis Research Center. The
application in question consists of a number of dissimilar analysis
modules, which share data. Initial studies found that the performance
of this applications could be increased dramatically (with a minimum
of programming effort) by executing the different modules on a network
of distributed, heterogeneous machines. Much of this performance gain
stems from the observation that some problems are well suited to a
specific machine architecture. By utilizing a network with several
different machine architectures, a net gain in performance can be
realized. We further examined the problem of assigning the subtasks
of the application to the machines (this is called the matching and
scheduling problem).
With the initiative to construct high performance global networks, we
are now examining heterogeneous computing on a larger scale. This
concept, called metacomputing, allows multiple users to use a global
network of high performance machines as a single computational
resource. This computational entity would greatly exceed the
computational power of any single machine.
We are specifically working on the following projects. In order to
determine which machine architecture is best suited to a particular
task, an estimate of its execution time is required. We have developed
a statistical method to estimate the execution time of that task, and
are developing improvements to this algorithm. Since there will be
multiple users in any global network of machines, we are currently
examining scheduling policies which will produce good application
performance, while ensuring that the computational resources are
allocated fairly among multiple users. Furthermore, we are examining
the important issue of fault-tolerance, since network and machine
failures would be quite common in such a large environment.