My past projects during my graduation include ShuffleWatcher, Shuffle-aware scheduling in multi-tenant MapReduce clusters (Usenix ATC 2014), Tarazu, optimizing MapReduce On Heterogeneous Clusters (ASPLOS 2012), PowerTrade, a joint optimization of idle power and cooling power to reduce overall data center power (ASPLOS 2010), and MaRCO, a runtime performance optimization for MapReduce, the well-known programming model for large-volume data analysis in data centers (Tech Report 2007). During this work, I also developed a benchmark suite for MapReduce (details below). I have also worked on providing architecture support for debugging multithreaded programming in multicores (TimeTraveler, ISCA 2010).

PUMA: MapReduce Benchmarks

MapReduce is a well-known programming model, developed within Google, for processing large amounts of raw data, for example, crawled documents or web request logs. This data is usually so large that it must be distributed across thousands of machines in order to be processed in a reasonable time. The ease of programmability, automatic data management and transparent fault tolerance has made MapReduce a favorable choice for large-scale data centers batch processing. Map, written by a user of the MapReduce library, takes an input pair and produces a set of intermediate key/value pairs. The library groups together all intermediate values associated with the same intermediate key and passes them to the reduce function through an all-map-to-all-reduce communication called Shuffle. Reduce, also written by the user, receives intermediate key along with a set of values from Map and merges together these values to produce the final output. Hadoop is an open-source implementation of MapReduce which is being improved and developed regularly by software developers / researchers and is maintained by Apache Software Foundation. Despite being vast efforts on the development of Hadoop MapReduce, there has not been a very rigorous work done on the benchmarks side.

During our work on MapReduce, we developed a benchmark suite which represents a broad range of MapReduce applications exhibiting application characteristics with high/low computation and high/low shuffle volumes. The details of applications, their code (compatible with Hadoop-0.20 and Hadoop-1.0.0), and details about input datasets can be found below.