Software Distribution

Related Information

Middleware for Filtering Large Archival Scientific Datasets in a Grid Environment

Increasingly powerful computers have made it possible for computational scientists and engineers to
model physical phenomena in great detail. As a result, overwhelming amounts of data are being
generated by scientific and engineering simulations. In addition, large amounts of data are being
gathered by sensors of various sorts, attached to devices such as satellites and microscopes. The
primary goal of generating data through large scale simulations or sensors is to better understand the
causes and effects of physical phenomena. Thus, the exploration and analysis of large datasets plays
an increasingly important role in many domains of scientific research. The continuing increase in the
capabilities of high performance computers and sensor devices implies that datasets with sizes up to
petabytes will be common in the near future. Such vast amounts of data require the use of archival
storage systems distributed across a wide-area network. Simulation or sensor datasets generated or
acquired by one group may need to be accessed over a wide-area network by other groups. Efficient
storage, retrieval and processing of multiple large scientific datasets on remote archival storage
systems is therefore one of the major challenges that needs to be addressed for efficient exploration and
analysis of these datasets. Software support is needed to allow users to obtain needed subsets of very
large, remotely stored datasets.

DataCutter is a middleware infrastructure that enables processing of scientific datasets stored in
archival storage systems across a wide-area network. DataCutter provides support for subsetting of
datasets through multi-dimensional range queries, and application specific aggregation on scientific
datasets stored in an archival storage system.

DataCutter provides a core set of services, on top of which application developers can implement more
application-specific services or combine with existing Grid services such as metadata management,
resource management, and authentication services. The main design objective in DataCutter is to
extend and apply features of the Active Data Repository (ADR), namely support for
accessing subsets of datasets via range queries and user-defined filtering operations, for very large
datasets in a shared distributed computing environment. In ADR, data processing is performed where the
data is stored (i.e. at the data server). In a Grid environment, however, it may not always be feasible to
perform data processing at the server, for several reasons. First, resources at a server (e.g., memory,
disk space, processors) may be shared by many other competing users, thus it may not be efficient and
cost-effective to perform all processing at the server. Second, datasets may be stored on distributed
collections of storage systems, so that accessing data from a centralized server may be very
expensive. Moreover, distributed collections of shared computational and storage systems can provide
a more powerful and cost-effective environment than a centralized server, if they can be used
effectively. Therefore, to make efficient use of distributed shared resources within the DataCutter
framework, the application processing structure is decomposed into a set of processes, called filters.
DataCutter uses these distributed processes to carry out a rich set of queries and application specific
data transformations. Filters can execute anywhere (e.g., on computational farms), but are intended to
run on a machine close (in terms of network connectivity) to the archival storage server or within a proxy
server.

Another goal of DataCutter is to provide common support for subsetting very large datasets through
multi-dimensional range queries. Very large datasets may result in a large set of large data files, and
thus a large space to index. A single index for such a dataset could be very large and expensive to
query and manipulate. To ensure scalability, DataCutter uses a multi-level hierarchical indexing
scheme.

DataCutter is also being integrated with the
Storage Resource Broker (SRB), under
development at the San Diego Supercomputing Center through the NPACI
consortium. The SRB provides
transparent access to distributed storage resources in a Grid environment, and DataCutter will enhance
the SRB services to allow for subsetting and filtering of large archival datasets stored through the SRB.