The Titanium language draws on the experience from other global
address space languages like Split-C and AC.
Titanium takes its control model, SPMD parallelism, from message
passing libraries like MPI. It takes the communication model from
thread-based languages, such as POSIX threads, in which threads
communicate implicitly by reading and writing to shared variables.
Like Split-C and AC, though, the Titanium address space is partitioned
into separate areas, typically one per processor, to give programmers
control over data layout. Titanium also draws on the experience of
Colella, Hilfinger, and Semenzato on a language for finite
differencing techniques, called FiDIL. Titanium includes support for
multidimensional arrays and a separate abstraction for index spaces to
ease the programming of stencil operations, hierarchical algorithms
such as multigrid, and adaptive meshing refinement (AMR). It also
provides support for immutable classes, region-based memory management,
and operator overloading.
An overview paper of the Titanium
language is available, along with a
language reference and incomplete
tutorial. Several of the static
analysis techniques listed below involve language support along with
compiler and runtime support. An
early language design effort was based on C++, and the initial Java
version was a language called Split Java, designed by David Gay
as a course project.

Many parallel programs, including those written in Titanium, Split-C,
and message passing systems like MPI, are written in SPMD style,
i.e. by running the same sequential program on all processes. SPMD
programs include synchronization, but it is easy to write incorrect
synchronization patterns. We propose a system that verifies a
program's synchronization pattern. We also propose language features
to make the synchronization pattern more explicit and easily
checked. We have implemented a prototype of our system for Split-C and
successfully verified the synchronization structure of realistic
programs. This work is presented in a
POPL98 paper and a
master's thesis by David Gay.

Memory Management with Explicit Regions, Alexander Aiken and David Gay

In Titanium, as in Java, a kind of safe memory management is used to
keep programmers from accessing unallocated memory. Rather than
implementing this with garbage collection, Titanium uses a
region-based memory management scheme in which each allocation
specifies a region, and memory is reclaimed by destroying a region,
freeing all the storage allocated therein. This language feature is
based on a study done on region-based memory management in
allocation-intensive C programs, which shows that regions are
competitive with malloc/free and sometimes substantially faster due to
locality improvements. Our implementation uses static analysis to
improve safety and reduce the cost of runtime checking. Experience
with our benchmarks suggests that regions are easy to use. Regions
with static analysis are presented in a PLDI98 paper and a related report describing a fully
dynamic region-based approach.

Titanium, Split-C, and most thread-based extensions of C and Fortran
involve separate threads that communicate through a global address
space. Any type of code motion on such explicitly parallel programs
requires a new kind of analysis to ensure that operations reordered on
one processor cannot be observed by another. This analysis is similar
to dependence analysis, but is called cycle detection, because it
involves looking for cycles in the shared variable accesses between
processors. This work is related to a large body of research on
memory consistency models, and in particular, cycle detection can be
use to implement a strong semantic model (such as sequential
consistency) on top of hardware that support some weaker model. We
make no assumptions on the use of synchronization constructs: our
transformations preserve program meaning even in the presence of race
conditions, user-defined spin locks, or other synchronization
mechanisms built from shared memory. However, programs that use
linguistic synchronization constructs rather than their user-defined
shared memory counterparts will benefit from more accurate analysis
and therefore better optimization.
We demonstrate the use of this analysis for communication optimizations
on distributed memory machines by automatically transforming programs
written in a conventional shared memory style into a Split-C program,
which has primitives for non-blocking memory operations and one-way
communication. A paper
describing the basic analysis, along with a prototype implementation
for Split-C on the Connection Machine CM-5 appeared in the
Journal of Parallel and Distributed Computation, 1996. A
PhD Thesis by
Arvind Krishnamurthy describes more recent work, including
theoretical bounds on the algorithm and performance data from
the Berkeley NOW. The optimizations include message pipelining, to allow
multiple outstanding remote memory operations,
conversion of two-way to one-way communication, and elimination of
communication through data re-use.

Memory Hiearchy Optimizations in Titanium,
Geoff Pike

One of the most striking trends in machine architecture over the last
decade has been the widening gap between CPU and memory speeds.
Today, the number of cache misses is often a better measure of program
performance than the number of instructions executed. The Titanium
language has an unordered loop construct called "foreach" to simplify
the optimization of loops over arrays. We are developing analyses to
reorder the computation on some of the computational kernels that are
common on block-structured meshes.

Method Inlining Optimization in Titanium,
Dan Bonachea

In this paper,
we discuss the design and implementation of a method
inlining optimization added to the Titanium compiler. We address issues relevant to the
design of any inlining optimization, and specifics relevant to Java and Titanium. Two
heuristics are developed for guiding the inlining decisions, a manual, programmer-assisted
heuristic and an automatic heuristic based on call graph structure and an analysis
of method characteristics. Results are presented detailing the effectiveness of each
heuristic and the performance improvements realized.

Titanium also supports using Active Messages 2.0
as the standardized networking interface for some of the older cluster-based parallel backends.
Active Messages is a low-level, high-performance communication paradigm first
proposed by von Eicken et al.
that basically amounts to a super-lightweight RPC mechanism, which is generally implemented
as a zero-copy, fully user-level protocol that is highly tuned to the networking hardware.
Titanium uses several different AM 2.0 implementations for various backends:
Lanai AMAMMPIAMUDPAMLAPI

Global Address Space Support on the Cray T3D and T3E,
Arvind Krishnamurthy, David E. Culler and Katherine Yelick

We perform an empirical comparison of the Cray-T3E and the Cray-T3D
from the perspective of compiling a global address space language.
Both machines provide an elaborate shell to support different
forms of global memory access. We provide a detailed performance
characterization of the machine primitives and evaluate their utility
in code generation for a parallel language. We observe that the
changes made in the T3E result in fewer options for implementing
remote access and overall a much simpler compilation
strategy. Unfortunately, the raw hardware performance of some of the
remote access mechanisms have worse performance on the T3E due to
extra logic on the destination processor and the presence of a second
level cache on the source processor. However, the language
implementation adds less overhead than a corresponding implementation
on the T3D resulting in better end-to-end performance for some of the
primitives. Papers describing the T3D
support and a comparison
between the T3D and T3E are both available.

Bulk File I/O Extensions to Java
, Dan Bonachea

The file I/O classes present in Java have proven too inefficient to meet
the demands of high-performance applications that perform large amounts of
I/O. The inefficiencies stem primarily from the library interface which
requires programs to read arrays a single element at a time. We present two
extensions to the Java I/O libraries which alleviate this problem. The
first adds bulk (array) I/O operations to the existing libraries, removing
much of the overhead currently associated with array I/O. The second is a
new library that adds direct support for asynchronous I/O to en-able
masking I/O latency with overlapped computation. The extensions were
implemented in Titanium, a high-performance, parallel dialect of Java. We
present experimental results that compare the performance of the extensions
with the existing I/O libraries on a simple, external merge sort
application. The results demonstrate that our extensions deliver vastly
superior I/O per-formance for this array-based application.
The paper (
PDF,
gzipped-Postscript )
describing this work will appear in the Java Grande 2000 Conference.
This copy is posted by permission of ACM and may not be reditributed.
The definitive version of this paper is located
here.
A later paper (
Postscript ) on the same topic.
Also see the Bulk I/O Documentation

Large-scale parallel machines are incorporating increasingly
sophisticated architectural support for user-level messaging and
global memory access, including network co-processors that range
in capability for limited special-purpose communication devices
to general purpose co-processors with the same computational capabilities
as the main processor. We provide a systematic evaluation of a broad
spectrum of current design alternatives based on our implementations
of a global address language on the Thinking Machines CM-5, Intel
Paragon, Meiko CS-2, Cray T3D, and Berkeley NOW. This evaluation
includes a range of compilation strategies that make varying use of
the network processor; each is optimized for the target architecture
and the particular strategy. We analyze a family of interacting
issues that determine the performance trade-offs in each
implementation, quantify the resulting latency, overhead, and
bandwidth of the global access operations, and demonstrate the effects
on application performance. A
paper describing
this work appeared in ASPLOS '96.

Themis: Enforcing Data Consistency, Ben Liblit and Carleton Miyamoto

Titanium defines a data consistency model equivalent to the Java
specification. This consistency model calls for locally sequential
consistency, with global consistency attained at synchronization
points and arbitrary write reordering. The Titanium language is
targeted at homogeneous multiprocessors on distributed memory
architectures. Accesses to distant memory locations may result in
communication over a network. The Themis system builds a
Titanium-compliant consistency interface for Active Messages running
on a NOW. This paper examines several methods for designing a Themis
system that performs well for Titanium applications. Microbenchmark
tests and a Conjugate Gradient Method test show the viability of each
of these designs. One of the methods, a caching mechanism called
OrderCache, displays promise for reducing communication loads and
exploiting the locality of data within applications.
A paper is available in
postscript or
pdf.
See Ben Liblit's Themis page
for further information.

A Proposal for Threads in Titanium

Titanium is an explicitly-parallel SPMD language, which means that
there are fixed set of threads started at program startup time. In
most implementations, there is a single thread per processor. While
SPMD programming is easy to understand, it lacks the expressiveness of
multi-threading, which allows the programming to separate functionally
distinct segments of control flow. In this report we introduce a
multi-threaded extension of Titanium that can coexist with SPMD
programming and the global synchronization primitives, such as
barriers, that are common in SPMD program. We discuss the changes to
the Titanium language and the consequences on ensuring correct and
deadlock-free code. We also describe a prototype implementation of
this extended Titanium.

Titanium allows for both global synchronization, such as
barriers and reductions, and pairwise synchronization for
mutual exclusion. The language mechnanism for the latter is
taken from Java: methods or clode blocks may be annotated
with the keyword "synchronized" which guarantees that at
most one thread will be in that block or method at the same
time. The runtime systems supports this from of synchronization
by associating locks with objects. In Titanium implementations
on shared memory machines, this is built directly on the
underlying locking mechanism from a system like POSIX threads.
On distributed memory platforms, like the Berkeley NOW,
locking is integrated into the global referencin mechanism so
that the difference between shared and distributed memory
is functionally transparent.
a

As an initial study of extending Titanium to support unstructured
grid, we have developed an interface to the
PETSc
library for finite element problems. Petsc is implemented in C
using and object-oriented style; this work is also exploring
the general problem of interfacing to libraries written in other languages.

Floating Point in Java

Titanium targets scientific applications,
which are dominated by floating point computations. Research
on floating point support in Java and a
proposal for improvement was done
by Joe Darcy.

The software is currently being used in several
current and future immersed boundary simulation applications. Currently the two
main projects are a larger, more detailed model of the cochlea, and a
distributed implementation of C. Peskin and D. McQueen's heart model, which we
are working on refining. This work is part of the
NPACI Alpha program.

The goal of this project is twofold. First, we want
to build the CHOMBO
infrastructure for AMR (adaptive mesh
refinement) applications in the Titanium
environment, meanwhile providing a test case for
Titanium's performance and programmability. We hope
new AMR algorithms can be developed easier in
Titanium by leveraging its strength - language-level,
one-sided high-performance communication.

Currently, we are developing AMR algorithms for ocean modeling.

AMR Poisson Solver,
Luigi Semenzato, Geoff Pike, and Phillip Colella

This application is an elliptic solver that solves the Poisson
equation with variable coefficients; the algorithm is for a finite
domain in 3 dimensions. The algorithm uses block-structured adaptive
mesh refinement (AMR) that highlights the utility of Titanium's
multidimensional array, point, and domain abstractions. Some examples
of physical phenomenon that are modeled by the Poisson equations are:
electrostatic and gravitational potential, the heat equation, and
vorticity. We have worked with Astrophysists Richard Klein and Chris
McKee to use this solver on problems that arise in the collapse phase
of star formation. One of the challenges in achieving high performance
in a multigrid-style algorithm is the high ratio of communication to
computation. Values from adjacent grid cells sometimes live on
different processors.

This algorithm solves the Poisson equation for
constant coefficients over an infinite domain. The current
implementation is for two dimensional domains and is
non-adaptive; there are plans to extend it to three dimensions
in the near future, and possibly to develop an adaptive version
of the algorithm. One of the key features of this algorithm
is its scalability. Rather than performing some kind of
relaxation across the entire domain, complete solves are
done locally on a fine grid patches without communication, and
then a single coarse grid is constructed over the entire domain.
There are implementation of this algorithm in both Titanium
and in the Kelp library.

Gas Dynamics,
Peter McCorquodale and Phillip Colella

This code solves the compressible Euler equations for gas dynamics,
which is an example of a hyperbolic system of conservation force laws.
An example is advection. The algorithm is currently implemented for
two dimensional domains and is nonadaptive, although an adaptive
version is underway and the modification to three dimensions is
planned for the near future. The algorithm uses infrequent
communication, and should therefore scale well to large numbers
of processors. A paper
describing this work is available.

The combined availability of a rapidly growing database of genomic sequences, and the technology to create oligonucleotide
arrays provides novel opportunities for genome-wide experimental measurements of biological systems. One challenge in
exploiting this technology is the selection of oligonucleotide probe sequences for making arrays.
Because of the increasing size of genomic sequence databases, determining an optimal set of probe sequences is expensive
computationally. We've implemented a parallel application in Titanium for selecting optimal oligonucleotide sequences from
the entire genome of simple organisms. Oligos are rated and selected based on melting temperature, secondary structure
(folding properties), and uniqueness amongst the genome of interest.