T16 - Best Practices in Ingestion and Data Access at the Infrared Processing and Analysis Center

G. Bruce Berriman, California Institute of Technology

The Infrared Processing and Analysis Center (IPAC) at Caltech hosts the NASA/IPAC
Infrared Science Archive (IRSA) and the Michelson Science Center (MSC) Archive. IRSA is
the steward of the scientific data sets of NASA's Infrared missions, and the MSC facilitates
NASA's planet-finding and exo-planet science program, including multi-mission archives.
Together they serve nearly 30-TB of data across the entire electromagnetic spectrum from
17 missions and projects. They share a common hardware and software architecture. This
presentation describes their best practices in the areas of ingestion and user access.

Ingestion: While some providers are large missions, others are small groups of astronomers
inexperienced in delivering products. Provision of standards and interface specifications for
data delivery within a Submission Information Package are necessary for ingestion but have
proven insufficient. Communication with the provider starts at the beginning of the project,
and the provider is asked to deliver draft products for inspection before their pipelines have
entered production. On-line validation tools, whose design has been driven by common
mistakes in data delivery, have proven a powerful aid to providers. Functionality offered
includes validation of the structure and content of catalogs; generation of the
documentation of the attributes of catalogs; registration of images on the sky; and the
syntax, content and astrometric accuracy of astronomical images.

Access: The archives must return in real time subsets of large data sets (catalogs, images and
spectra) that it will curate for the indefinite future. The archive is optimized for efficient
access, maintainability, portability and is highly fault tolerant. Catalogs and are housed in
flat tables on a high-end EMC disk farm configured as RAID 0+1. An Informix DBMS offers
dynamic parallelization of queries, but indexing for spatial queries is resident in memory
outside the database. There are no stored procedures in the DBMS. All queries are composed
through "thin" interfaces that sit atop a component based architecture of re-usable ANI-C
modules that are "plugged" together for easy development of new applications. This
architecture enables cost-effective deployment of new access services, such as those provided
for NASA Stellar and Exo-planet Database, the Cosmic Evolution Survey Archive and the
Keck Observatory Archive.