Supplying simulation data to the world.

Three years since its completion, the Millennium Run
(Research Highlight August 2004)
remains the largest simulation
of its type ever carried out, four different galaxy formation models
have so far been built to populate it with galaxies, and well over 70
papers (follow
this link
for the most recent list)
have been submitted for publication based wholly or partially
on its numerical data. More than half of these are by authors who are
unassociated with the Virgo Consortium, the international
supercomputing collaboration which carried out the simulation. This
has been possible because of a concerted effort to release the data in
easily usable form. Their volume and complexity are such that
sophisticated databases with a high-level query language are needed to
promote effective public access. The Millennium Archive has been built
as one of the principal activities of
GAVO,
the German A strophysical
Virtual Observatory, and it is currently the largest and most complete
application of Virtual Observatory techniques to the publication of
theoretical data from numerical simulations.

Fig. 1:
The web interface to the Millennium Simulation archie. The
query shown selects all galaxies with redshifts between 1.0 and 1.03
in a 0.1 degree slice in declination from a database containing a deep
mock survey of galaxies on a 1.4° × 1.4° area of the sky.

Fig. 2:
A plot made using the web-tool VOPlot of the positions in
redshift and right-ascension of the galaxies returned by the query of
Fig.1

The public dissemination of a large and compex data set such as the
Millennium Run brings challenges which are different from and go
beyond those which must be faced when setting up a public archive for
observational data. Many of these result from the great variety of
relations between the various objects in the database, as well as from
the many properties that can be assigned to each one. In practice,
most users are interested in the properties of dark matter halos and
galaxies, objects created from the simulation output through
post-processing. Dark matter halos are the basic nonlinear units of
the simulated universe. They have properties such as mass, size and
position, in addition to internal substructures (subhalos) which are
the remnants of objects which fell into them during their growth. The
Millennium Archive contains information for about 750 million halos
and subhalos, all linked in a tree structure which describes how each
object was built from those present at the immediately preceding time.
This is the data structure used by the galaxy formation algorithms.

Galaxy formation is a complicated and uncertain process, and many
physical models for its various aspects must be tried in order to
establish those which best describe observed phenomena. A principal
goal of the Millennium Run project is to provide a framework for
comparing different galaxy formation models to observational data. It
is thus important to make available simulated galaxy catalogues with a
variety of assumptions about the physics of galaxy formation so that
users can get a feel for the uncertainties involved. A galaxy
catalogue for the full Millennium Run has about 1 billion entries. For
each of these galaxies many properties can be calculated by the
formation model and must be stored in the database.

In addition, pointers are
needed to connect the galaxies present at different times, and these
produce a tree data structure which gives the merger history of each
galaxy and which parallels (but is different from) the halo formation
trees.

An important issue which has to be addressed comes from the fact that
users wish to use the Millennium Run for a wide variety of purposes
and the view of the data which is most convenient for them depends
on their project. This requires that the data be delivered in a manner
that is more flexible than the traditional download of "flat files".
To this end the MPA/MPE/GAVO group decided to use a relational
database for storing the post-processing results of the Millennium
database. The main reason for this choice is that relational databases
support a flexible and intuitive query language (SQL), which allows
users to select out those objects that are of interest, in a form of
their own choice and without requiring knowledge of the physical
storage of the data. In the database this language is implemented by
efficient query engines that interpret the potentially complex
requests and execute these in the most efficient way.

Online access to the Millennium database is provided through a
web-based query interface
(see Fig. 1). Apart from
providing documentation and example queries, users can type in their
own SQL queries and execute them. The results can be directly returned to
the user, they can be plotted online (see Fig. 2), or they can
be stored for further analysis in a private database, that is assigned
to registered users. This approach is directly modelled on the highly
successful SDSS SkyServer database
(http://cas.sdss.org/dr6/en/).
At the time of writing there are over 160 registered users of the
Millennium Archive site with local disk space allocated for storage
and manipulation of the results of their queries. About 80% of these
have successfully executed queries on the main databases. Roughly half
appear to be already carrying out significant research programmes
(more than 50 successful queries), while about 20% can be
characterised as heavy users (more than 1000 successful queries).
On average over 500 million rows of data are being downloaded from the
site per week. The user group is still growing rapidly and it will
probably be several years before the archive's success in generating
new science from the Millennium Simulation can be properly assessed.