Transcription

1 MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able to crunch through the data (mining / processing) But, doing so efficiently requires a large-scale, parallel data processing system o How long does it take to go through 1 TB sequentially? 3 hours at 80 MB/s What if you have to do multiple scans or phases? o Parallel processing introduces a host of nasty distribution issues Communication and routing which nodes should be involved? what transport protocol should be used? how do you manage threads / events / connections? remote execution of your processing code? Fault tolerance and fault detection Goal of work more broadly, group membership Load balancing / partitioning of data heterogeneity of nodes skew in data network topology and bisection bandwidth issues o Also requires you to think about parallelization strategy How do you split work? Algorithmic issues. To solve these distribution issues once in a reusable library o To shield the programmer from having to resolve them each time they write a data analysis program To obtain adequate throughput and scalability out of the system To provide the programmer with a useful conceptual framework for designing their parallel algorithm

2 Map/reduce Two phases, conceptually, with hidden intermediate shuffle phase Map o For each key/value pair in the input file, the map() function is invoked set of input records is split into M different map splits the splits are processed in parallel on different worker invocations each worker serially iterates through its split each map() produces a set of intermediate key/value pairs as output into a local file on the map worker, bucketed by reduce partitioning function R; i.e., one bucket per reduce task Shuffle o for each reduce bucket, a reduce worker is identified for it. the worker slurps all of the buckets from the map worker local files o reduce worker then aggregates by key and sorts within the key, to get the (k2, list(v2)) form. Reduce o For each key in the intermediate output, the reduce() function is invoked Examples the invocation accepts the key and a sorted list of values reduce() does some kind of merge or filtering, and emits a list of values to an output file sort o map(linenum, line) { emit(sortkey(line), line) } o reduce(sortkey, list(line)) { emit(list(line)) } inverted index o map(docid, doctext) { foreach word(doctext) { emit(word, docid) } o reduce(word, list(docid)) { emit(word, sort(list(docid))); } grep o map(linenum, line) { if (grep(line, pattern) emit(linenum, line) } o reduce(linenum, list(line) { emit(list(line)) } join much harder ( A Comparison of Join Algorithms for Log Processing in MapReduce ) o standard is repartition join, aka partitioned sort-merge in database parlance imagine you have a log table L and some reference table R that contains user information you want to do an equijoin, L (equijoin where L.k = R.k) with R

3 L >> R implement in single MR job input is both L and R; each map task works on a split of L or R map task: o tags record with its originating table, and outputs: (join key, tagged record) as key/val shuffle phase will partition across join key, aggregate join key partitions, and sort reduce task: o for each join key: separate and buffer input records into two sets, according to the tag do a cross-product between the two sets problems both L and R end up having to be sorted by the shuffle phase both L and R end up having to be sent over the network during the shuffle phase reduce task: might not be able to keep entire input in memory, for skewed inputs o can do broadcast join if R is tiny, can replicate R on each map worker, and builds an inmemory hash table of R in each worker. then, hash-join against L in the map phase o several others as well which is best? depends on a bunch of things rest of MR job, data selectivity and skew, etc. Semantic issues In what order are the input values consumed? o Order determined by a split configuration value in user program o Right way to think about it is as unordered consumption order shouldn t matter for the correctness of your map/reduce program but, order could affect performance of the shuffle phase in the middle In what order are the intermediate values consumed? o This matters more: you might want the final output to be grouped across reduce keys in some additional way, so need to control the partitioning of the intermediate file across reduce tasks Partition function specifiable by mapreduce program (with a default that does hash partitioning) Should map or reduce have side-effects? o Its up to you, but. o Implementation does not guarantee exactly once semantics it provides at least once (except for deterministic bugs)

4 Implies side-effects must be idempotent for deterministic output Failure implies side-effects should be atomic for correct output o Should map and reduce functions be deterministic? Its up to you, but For same reason, non-determinism confuses semantics If deterministic, worker restart and atomic rename to commit output guarantees the execution is equivalent to some sequential execution If non-deterministic, get the blend of sequential executions (e.g., timestamps or counters go zooey) What mapreduce is built on Underlying GFS why? o Single namespace for input and output files; simplifies initial data distribution o Why not use GFS for the intermediate files?? Probably because they are transient and do not need to be globally readable More efficient to produce locally and special-purpose the RPC slurp to the reduce tasks Cluster management software o Failures o Group membership o Load balancing and scheduling Tweaks / improvements Combiner: basically a reduce run locally on the output of a map

5 I/O types: marshall/unmarshall libraries for parsing files Skip bad records: bad input deterministically crashing user function Status info: HTTP server with stats on progress, stdin/stdout of tasks, etc. Counters: propagated to master, included in HTTP server output Comparison to databases Huge source of controversy in DB community misconceptions in sys community about the degree of contribution of MR parallel databases have much more advanced data processing support that leads to much, much more efficient processing parallel databases support a much richer semantic model (arbitrary SQL, including joins) Paper by Stonebraker, DeWitt, and others: A Comparison of Approaches to Large-Scale Data Analysis parallel databases support a schema; MR does not o the structure of the data in MR must be built into map() and reduce() programs. hinders data sharing across programs, programmers. o if sharing is done, implicit agreement to a data model might as well codify in schema o lack of schema means lack of automated mechanism to enforce integrity constraints / typing parallel databases support indexes; MR does not o selection is accelerated massively with an index, including range queries (or other indexable subset operators) o no index at all in MR framework; if a programmer wants one, they have to implement the index themselves, and also enhance the data fetching mechanism in MR to use it. means it s hard to share indexes! parallel databases support full relational programming model (SQL); MR forces programmers to a more assembly-like model o e.g., have to implement your own join strategy if you want it in MR; SQL just naturally supports o difference between stating what you want (relational query), and how to do it (SQL) query optimization o parallel databases use existence of schema, indices, and declared query, as well as knowledge about how data is partitioned and the locality of it across the cluster, to support full query optimization: they calculate the most efficient query plan o MR forces programmers to do this, if they do it at all e.g., selectivity calculation try to push the least selective (i.e., most pruning) selection operator early, to minimize how much data has to flow between query operators

6 cannot do this in MR, and as a result, massive intermediate data sets end up materializing locally in files and being shipped / sorted in the shuffle phase e.g2, when the reduce phase of MR starts, all disks of map workers will be hammered by all of this pull of materialized local files. databases use push to avoid this, and avoid materializing files to disk where possible Where does MR win? loading data into system: easier to bulk load files into GFS than to pump records into DB, and current DBs are bad about parallelizing the load phase fault tolerance: unit of work (map or reduce) can fail and be restarted in MR. In DB, has fault tolerance, but unit of restart is the entire query, in part because intermediate results are not materialized on disk. o implication? as you scale up to larger and larger clusters, more likely that a query will need to be restarted because of a failure during the execution. o stonebraker paper: efficiency gains of DB means that you execute on much smaller clusters in practice, so this is a non-issue except for a very small number of players in the world approachability: more people are familiar with coding C++ than writing SQL Where do neither win? low-latency, incremental calculations (instead of high-throughput, batch-oriented programs) o google abandoned MR for its indexing operation, in favor of percolator, a system that can efficiently and transactionally update existing index as new pages are crawled. before percolator, had to recalculate entire index in a giant batch, so updates were delayed for weeks to accumulate new data in batch

7 Evaluation What questions are you curious about? What is their performance bottleneck for typical map/reduces? And what performance optimizations does this imply are possible? How well utilized are the physical machines during execution? What is the scaling bottleneck in the system? What are the effects of different scheduling policies on throughput? What happens if you schedule multiple mapreduces concurrently? What questions did they answer? How fast does it run on two programs? (grep, sort) Fast. I think. GREP Things to note from this graph exponential ramp up in rate o why? o hard to tell. artifact of their scheduler. my guess is that they took inspiration from slowstart to do congestion-based growth. is there equivalent of packet loss and backoff? yes two tasks co-scheduled on same node? peak input rate 30,000 MB/s o over 1764 nodes o à 17 MB/s per node. pretty good seq throughput! o why do they get this? o GFS property. heavy tail to the right o waiting for stragglers o very careful scheduling needed to avoid here o but they care about throughput, not latency, so who cares

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

NoSQL Thomas Neumann 1 / 22 What are NoSQL databases? hard to say more a theme than a well defined thing Usually some or all of the following: no SQL interface no relational model / no schema no joins,

Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

Programming Abstractions and Security for Cloud Computing Systems By Sapna Bedi A MASTERS S ESSAY SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty

RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction

Big Data Processing in the Cloud Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center Data is ONLY as useful as the decisions it enables 2 Data is ONLY as useful as the decisions it enables

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

A Survey of Cloud Computing Guanfeng Nov 7, 2010 Abstract The principal service provided by cloud computing is that underlying infrastructure, which often consists of compute resources like storage, processors,

Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel

Couchbase Server Under the Hood An Architectural Overview Couchbase Server is an open-source distributed NoSQL document-oriented database for interactive applications, uniquely suited for those needing

MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat Abstract MapReduce is a programming model and an associated implementation for processing and generating large