Guy E. Blelloch's Home Page

Research

My research has largely been in the interaction of Algorithms and
Programming Languages, much of it in the area of parallel computing.
Here are some of the more recent topics I have worked on with my
students and collaborators.

Parallelism

Benchmarking Parallel Algorithms.
We are developing a benchmark suite of parallel
algorithmic problems for testing machines and programming paradigms.
The ideas is that they will be defined in terms of the interface instead
of the code so they can be compared across machine types and programming
paradigms.

Cache efficient parallel algorithms.
We have been developing cost models, algorithms, and schedulers for
taking advantage of locality on parallel cache hierarchies without
worrying about details of the machine.

Synchronization and concurrent
algorithms. We are looking into the idea of adding transactions
on individual memory blocks to the memory system and how this can be
used to simplify many concurrent algorithms. We have also looked into
the idea "room-synchronizations".

Thread Scheduling.
We are developing scheduling dynamic parallel languages
and studying how they affect time, memory and cache performance.

Data structures and algorithms

Self-adjusting Computation. The idea is to
keep track of dependencies while executing an algorithm so that you
can go back, change history, and propagate the change to the output.

Compact Data Structures.
We look at how to represent graphs, meshes, indices and sets with
a number of bits that are asymptotically optimal and access/update times that are also asymptotically optimal.

We have been developing high-level algorithmic models for accounting
for locality on parallel machines, developing algorithms which are
efficient under these models, developing schedulers for the models,
and proving bounds on runtime based on the models. This includes
work reported in the following papers:

It is often the case that by slightly changing the input to an
algorithm, the execution path will only change slightly. The idea of
self-adjusting computation is to let the algorithm "adjust" to such small
changes by only updating the part of the computation that is required
to be updated. The idea is implemented by maintaining a dependence
graph of the computation and using this graph to propagate changes.
The propagation will both update the output and update the graph
itself (so that it is ready for another change).

We are interested both in techniques for maintaining the
graph, and in proving bounds for how much needs to be updated
for different algorithms.

A data structure is uniquely represented if its layout in memory depends only on the contents (as seen by the interface). In particular it cannot depend on the ordering that objects were added or deleted. Such data structures have applications to security (to make sure that no information is revealed about the order of operations) and concurrency (to make sure the data structure ends up in the same state, independent of how operations end up being interleaved).

Presents the first real-time multiprocessor garbage collection
algorithm with provable bounds on time and space. The algorithm
requires at most 2 (R(1 + 2/k) + N + 5PD) memory locations,
where P is the number of processors, R is the
maximum reachable space during a computation, N is the
maximum number of reachable objects, D is the maximum depth
of any data structure, and k is a parameter specifying how
many locations are copied each time a location is allocated.
Furthermore the client processes are never stopped for more than time
proportional to k non-blocking machine instructions.

Many of today's high level parallel languages support dynamic,
fine-grained parallelism. The language implementation typically requires
an efficient scheduling algorithm to assign
computations to processors at runtime.
However, in an attempt to expose a sufficient degree of parallelism to keep all
processors busy, schedulers often create many more parallel threads
than necessary, leading to excessive memory usage.
Further, the order in which the threads are scheduled can
greatly affect the total size of the live data at any
instance during the parallel execution, and unless the threads are scheduled
carefully, the parallel execution of a program may require
much more memory than its serial execution.

Our research involves finding theoretically space- and time-efficient
scheduling algorithms for multithreaded programs, including programs with
nested parallelism and synchronization variables.
In addition to proving the space and time bounds of the resulting parallel
schedules, we demonstrate that they are efficient in practice.
We have implemented a runtime system that uses our algorithm to schedule
threads in computations with nested parallelism.
The following papers describe our results.

Describes a parallel algorithm for scheduling based on the
notion of a parallel depth-first schedule. It proves that any
computation which does W work (total number of operations),
has D depth (longest chain of dependences, or critical path),
and requires S1 space if run sequentially can be
scheduled to run on P processors in O(W/P + D) time
and S1 + O(P D) total space. We show how to generate
such a schedule for any computation off-line and give an algorithm for
nested-parallel computations that can generate an efficient on-line
schedule. The previous best known bounds on space were
S1 P based on results by Blumofe and Leiserson.
Ours are therefore the first bounds that only require an additive term
instead of a multiplicative factor over the sequential space.

Describes an asynchronous version of the scheduling algorithm and the
first implementation of our scheduling ideas. This version is more practical the than the
above mentioned on-line scheduler since it allows processors to run mostly
asynchronously and does not require any preemption (threads can run unimpeded
until they decide to terminate). Experimental results are given for a variety
of applications. The experiments show improved space usage over previous schedulers,
and approximately equal performance.

Describes an on-line scheduling algorithm for languages with
synchronization variables (e.g. futures, i-structures, events).
Presents the first space bounds for such languages since previous
results only applied to nested-parallel or strict languages. The
technique uses the parallel depth-first schedule as described in our
original paper, but requires balanced-trees to properly maintain the
thread priorities on-line.

Applies the ideas from the previous paper to the NESL language by
specifying a provably-efficient language
implementation. The paper describes a "profiling" semantics for
NESL---which is an operational semantics extended with cost measures.
The profiling semantics tracks work and depth using by returning a
DAG, and tracks sequential space by threading a store in a fixed
order. The paper then proves relationship of these costs measures to
runtime on various machines models.

We have studied and experimented with parallel algorithms for a
variety of problems, including sorting, Delaunay-triangulation,
graph-connectivity, N-body simulations and set operations. The goal
of our work has been to look at key problems and see what the very
fastest algorithm we could develop for the problem is. Our work has
consisted of both looking at theoretical and practical consequences.
We have implemented all the algorithms we describe. In many cases our
algorithms are the fastest available (at least at the time they were
implemented).
I've also written the chapter in the CRC Press
Computer Science and Engineering Handbook (with Bruce Maggs).
Our work includes the following.

Describes a new divide-and-conquer parallel algorithm for 2-d Delaunay triangulation,
and its implementation in NESL and MPI.
The algorithm has O(n log n) work and has O(log^3 n) depth.
The main idea of the algorithm is to do all the partitioning work on the way down
the recursion as opposed to on the way up (as done by Shammos and Hoey's
algorithm -- the original O(n log n) time sequential algorithm).
Experimental results show a 32-fold speedup
on 64-processors (on a Cray T3D). The speedup is relative to the fastest sequential
implementation of 2-D Delaunay we could find.

Describes parallel algorithms for union, intersection and difference
on ordered sets using random balanced binary trees (treaps), and
presents experimental results for the algorithms on the SGI power
challenge and Sun Enterprise server. For two sets of size n
and m (m <= n) the algorithms run in expected
O(m log (n/m)) work and O(log n)
depth (parallel time) on an EREW PRAM with scan operations. The
experiments show speedups of 6.3 to 6.8 speedup on 8 processors of
the SGI power challenge.

In the early 90s we designed a parallel programming language called NESL (NESted-parallel Language) and implemented it on a
wide variety of parallel machines. The goal has been to experiment
with ideas of nested parallelism, and with language-based cost models.
We have coded up a large number of algorithms in NESL, and in most
cases the code is very much more concise than in other parallel
programming languages. Many of these algorithms are available off of
the NESL algorithm library page, and demos of some of
them are available off of the algorithm animation page.
An online
tutorial
and
interpreter
are also available.
The Chapter on Parallel Algorithms in the CRC Computer Science and
Engineering Handbook uses pseudocode based on NESL.

Perhaps the most important feature of NESL is the model it supplies
for analyzing costs. In particular it defines costs in terms of work
(total number of operations) and depth (longest sequence of
dependences, or critical path) and defines rules for composing these
costs across expression.
In general when making calls sequentially the sum is taken
of both the work and depth, and when making calls in parallel
the maximum is taken over the work and depth.
This idea of using work and depth as cost measures has
been adopted by other languages, such as the CILK programming language.

In addition to the NESL home page
the following papers describe our work on NESL.

Describes and motivates the two main ideas behind NESL: nested data
parallelism, and the language based performance model. It goes
through several examples of parallel algorithms written and analyzed
in NESL, including quicksort, primes, sparse matrix vector multiply,
FFT, quickmedian, and a convex-hull algorithm.

Applies the ideas from the previous paper to the NESL language by
specifying a provably-efficient language
implementation. The paper describes a "profiling" semantics for
NESL---which is an operational semantics extended with cost measures.
The profiling semantics tracks work and depth using by returning a
DAG, and tracks sequential space by threading a store in a fixed
order. The paper then proves relationship of these costs measures to
runtime on various machines models.

Formally define an abstract cost model from which the users of the language
can analyze their code.

Specify a mapping from these costs to running times on a concrete machine
model.

Describe an implementation and prove that this mapping is
achieved with the implementation.

This work was originally motivated
by the need to model time in the parallel programming language NESL,
but we have applied it to other parallel programming models.
We formalize the cost model using a profiling semantics, which
is an operational semantics extended with cost measures.
The following papers describe our results.

Describes a profiling semantics for the lambda-calculus assuming
that in an expression
e1 e2 the two subexpressions e1 and e2 can be evaluated
in parallel, but that they must both fully evaluate before the result e1
can be applied to e2. This is called call-by-value parallelism.
The semantics tracks work and depth as the
two cost measures. The paper shows several results including
the following.

An implementation of the lambda-calculus on a CREW PRAM that
guarantees that any expression that evaluates in W work and
D depth, will run on P processors in O(W/P + D log P) time.

A simulation of the PRAM on the lambda calculus that guarantees that
any computation that takes T time and M memory on a P processor
PRAM will run in T P log M work and T log M log P depth.

Shows how sorting can be implemented in O(n log n) work and O(log^2 n)
depth in the lambda-calculus.

The first two results imply that the class NC (Nick's class) can be
defined as computations in the lambda-calculus that can run within
polynomial work and polylogarithmic depth.

This extends the previous work to languages in which in an expression
e1 e2 the function resulting from evaluating e1 can
start running before e2 is complete. This is the parallelism
available with futures and with lenient languages. Such parallelism
can reduce the depth relative to call-by-value parallelism, but
implementing such parallelism requires synchronizing on individual
variables. The paper describes an implementation that guarantees that
any expression that evaluates in W work and D depth,
will run on P processors in O(W/P + D log P) time on
a CRCW PRAM.

Formally defines a cost model for NESL using a profiling semantics.
This model defines work (W), depth (D), and
sequential space (S1). It then proves a mapping from this
model onto various machine models. For example it shows that on a
CREW PRAM a computation will run in O(W/P + D log P) time and
S1 + O(D P) space.

Pipelining has been used in the design of many PRAM algorithms to
reduce their asymptotic running time, but properly maintaining the
pipeline is very complicated for the programmer. This paper shows
how futures (a parallel language construct) can be used to implement
pipelining without requiring the user to code it explicitly, allowing
for much simpler code and more asynchronous execution. It describes
and analyzes three algorithms using this technique: a parallel merging
algorithm on trees, a parallel version of insertion into and deletion
from randomized balanced trees (treaps), and insertion into 2-3 trees.

This work considers issues of memory performance in shared memory
multiprocessors for algorithms with irregular memory accessesses (e.g. pointer
based algorithms).
In particular it develops a model to account for the effect of contention and delay
at the memory banks.
The work was motivated by observed discrepancies between predicted and
actual performance in a number of irregular algorithms implemented for
the Cray C90 when the memory contention at a particular location is
high.
We show experimentally that our model much better accounts for the performance
of such algorithms than previous models.