Pig as a Solution for Accessing Peta-scale Astronomical Datasets

Zolotukhin, Ivan

Coming decades are recognized as an epoch of data intensive astronomy. Exponential growth of data volumes and clear evidence of public access advantages for bigger project's impact place the great challenge of providing public access interface to the huge astronomical datasets in a way that worldwide research community can consume. Parallel database products either become prohibitively expensive at peta-scales or, being developed for the industry needs, simply do not fit for scientific problems. In this talk we discuss MapReduce-based stack of open source technologies that seem to be capable to address these issues, namely Pig platform for analyzing large datasets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs in a parallel manner, that enables almost linear scalability, particularly for huge catalogs cross-matching problems.