What is Map and Reduce Borrows from Functional Programming • Functional operations do not modify data structures create new ones • Stateless functional operations no side-effect order of operations does not matter

Big Picture of MapReduce Input Reader - Divides input into appropriate size 'splits' (16 to 128 MB) Map - partitioning of the data (compute part of a problem across several servers) Shuffle - together the values returned by the map function Reduce - processing of the partitions (aggregate the partial results from all servers into a single result-set) Output Writer - Writes the output of the Reducer

• Client submits the MapReduce job. • JobTracker coordinates the job run. • TaskTrackers run the tasks that the job has been split into. • HDSF is used for sharing job files between the other entities. 10

WordCount Java Code in Hadoop 11

General Considerations Map execution order is not deterministic Map processing time cannot be predicted Reduce tasks cannot start before all Maps have finished (dataset needs to be fully partitioned) Not suitable for continuous input streams There will be a spike in network utilization after the Map / before the Reduce phase Number & size of key/value pairs • Object creation & serialisation overhead (Amdahl’s law!) Aggregate partial results when possible! • Use Combiners 12

What is a “Design Pattern” Design Pattern a general reusable solution to a commonly occurring problem within a given context in software design. 14 GoF

Part II: MapReduce Design Patterns 1. Summarization: get a top-level view by summarizing and grouping data 2. Filtering: view data subsets such as records generated from one user 3. Data Organization: reorganize data to work with other systems, or to make MapReduce analysis easier 4. Join : analyze different datasets together to discover interesting relationships Total 23 patterns 5. Metapattern : piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job 6. Input and Output: customize the way you use Hadoop to load or store data A template for solving a common and general data manipulation problem with MapReduce. 15

Pattern Template in this Book Name: a well-selecting name of the pattern Intent: A quick problem description Motivation: Why you would want to solve this problem or where it would appear. Applicability: A set of criteria that must be true to be able to apply this pattern to a problem. Structure: The layout of the MapReduce job itself. Consequences: The end goal of the output this pattern produces. Resemblances: Show analogies of how this problem would be solved with other languages, like SQL and PIG. Known Uses: some common use cases Performance Analysis: Explains the performance profile of the analytic produced by the pattern.

18

2.1 Summarization Patterns Your data is large and vast, with more data coming into the system every day • ex. web user-logs • You want to produce a top-level, summarized view of the data • You can glean insights not available from looking at a localized set of records alone. Patterns • Numerical Summarizations • Inverted Index Summarizations • Counting with Counters 19

Numerical Summarizations 1/4 Intent - Group records together by a key field and calculate a numerical aggregate per group to get a top-level view of the larger data set. Motivation • Many data sets these days are too large for a human to get any real meaning out it by reading through it manually, e.g., terabytes of website log files. • minimum, maximum, average, median, and standard deviation Applicability • You are dealing with numerical data or counting. • The data can be grouped by specific fields 20

Numerical Summarizations 2/4 Structure • Mapper: outputs keys that consist of each field to group by, and values consisting of any pertinent numerical items. • Reducer: receives a set of numerical values (v1, v2, v3, …, vn) associated with a group-by key records to perform the aggregation function λ. The value of λ is output with the given input key.

21

Numerical Summarizations 3/4 Consequences • A set of part files containing a single record per reducer input group. Each record will consist of the key and all aggregate values. Known uses • Word count, Record count • Min, Max, Count of a particular event • Average, Median, Standard deviation Resemblances • SQL SELECT MIN(numericalcol1), MAX(numericalcol1), COUNT(*) FROM table • Pig GROUP BY groupcol2; b = GROUP a BY groupcol2; c = FOREACH b GENERATE group, MIN(a.numericalcol1), 22 MAX(a.numericalcol1), COUNT_STAR(a);

Numerical Summarizations 4/4 Performance analysis • Aggregations perform well when the combiner is properly used. • Data skew of reduce groups: many more intermediate key/value pairs with a specific key than other keys, one reducer is going to have a lot more work to do than others. 23

Inverted Index Summarizations 1/4 Intent - Generate an index from a data set to allow for faster searches. storing a mapping from content to its locations 24

Inverted Index Summarizations 2/4 Motivation • To index large data sets on keywords, so that searches can trace terms back to records that contain specific values. • Search performance of search engine Applicability • You are requiring quick query responses. • The results of such a query can be preprocessed and ingested into a database. 25

Inverted Index Summarizations 3/4 Structure 26

Inverted Index Summarizations 4/4 Consequences • “filed value” -> [unique IDs of records] Performance analysis • Parsing content in Mapper most computationally • The cardinality of the index keys increase the number of reducers increase parallelism • The number of content identifiers per key, “the” • a few reducers will take much longer than the others. • Require a custom partitioner 27

Counting with Counters 1/3 Intent • An efficient means to retrieve count summarizations of large data sets. Motivation • A count or summation can tell you a lot about your data as a whole. • Simply use the framework’s counters no reduce phase and no summation Applicability • You have a desire to gather counts or summations over large data sets. • The number of counters you are going to create is small 28

Counting with Counters 2/3 Structure • Mapper: processes each input record at a time to increment counters based on certain criteria. • Counter: (a) incremented by one if counting a single instance (b)incremented by some number if executing a summation. 29

Counting with Counters 3/3 Consequences • the final output is a set of counters grabbed from the job framework (no actual output) Known uses • Count number of records (over a given time period) • Count a small number of unique instances • Counters can be used to sum fields of data together. Performance analysis • Using counters is very fast, as data is simply read in through the mapper and no output is written. • Performance depends largely on the number of map tasks being executed and how much time it takes to process each record. 30

Filtering 1/4 Intent • Filter out records that are not of interest Motivation • Your data set is large and you want to take a subset of this data to focus in on it and perhaps do follow-on analysis. Applicability • The data can be parsed into “records” that can be categorized through some well-specified criterion determining whether they are to be kept. 32

Filtering 2/4 Structure • No “Reducer” map(key, record): if we want to keep record then emit key,value 33

Filtering 3/4 Consequences • A subset of the records that pass the selection criteria. • If the format was kept the same, any job that ran over the larger data set should be able to run over this filtered data set, as well. Known uses • Closer view of data • Tracking a thread of events • Distributed grep • Data cleansing • Simple random sampling • Removing low scoring data (if you can score your data) 34

Filtering 4/4 Resemblances • SQL: SELECT * FROM table WHERE VALUE < 3 • Pig: b = FILTER a BY value < 3; Performance analysis • NO reducers • Data never has to be transmitted between the map and reduce phase. • Most of the map tasks pull data off of their locally attached disks and then write back out to that node. • Both the sort phase and the reduce phase are cut out. 35

Bloom Filtering 1/4 Intent • Filter such that we keep records that are member of some predefined set of values (hot values). Motivation • To filter the record based on some sort of set membership operation against the hot values. • The set membership is going to be evaluated with a Bloom filter. • M = 18, k = 3 • w is not in the set {x, y, z} 36

Bloom Filtering 2/4 Applicability • Data can be separated into records, as in filtering. • A feature can be extracted from each record that could be in a set of hot values. • There is a predetermined set of items for the hot values. • Some false positives are acceptable (i.e., some records will get through when they should not have). 37

Bloom Filtering 3/4 Structure – training + actual filtering 38

Bloom Filtering 4/5 Consequences • a subset of the records in that passed the Bloom filter membership test. • Exists false positives records Known uses • Removing most of the non-watched values • Prefiltering a data set for an expensive set membership check

39

Bloom Filtering 5/5 Performance analysis • Loading up the Bloom filter is not that expensive since the file is relatively small. • Checking a value against the Bloom filter is also a relatively cheap operation – by O(1) hashing 40

Top Ten 1/4 Intent • Retrieve a relatively small number of top K records, according to a ranking scheme in your data set, no matter how large the data. Motivation • Finding records that are typically the most interesting • To find the best records for a specific criterion Applicability • It is able to compare one record to another to determine which is “larger” • The number of output records should be significantly fewer than the number of input records a total ordering of the data set. 41

Top Ten 4/4 Performance analysis – one single Reducer • How many records (K*M) the reducer is getting? • The sort can become an expensive operation when it has too many records and has to do most of the sorting on local disk, instead of in memory. • The reducer host will receive a lot of data over the network a network resource hot spot • Naturally, scanning through all the data in the reduce will take a long time if there are many records to look through. • Any sort of memory growth in the reducer has the possibility of blowing through the Java virtual machine’s memory • Writes to the output file are not parallelized 44

Distinct 1/4 Intent • To find a unique set of values from similar records Motivation • Reducing a data set to a unique set of values has several uses Applicability • You have duplicates values in data set; it is silly to use this pattern otherwise. 45

Distinct 2/4 Structure • It exploits MapReduce’s ability to group keys together to remove duplicates. • Mapper transforms the data and doesn’t do much in the reducer. • Duplicate records are often located close to another in a data set, so a combiner will deduplicate them in the map phase. • Reducer groups the nulls together by key, so we’ll have one null per key simply output the key map(key, record): emit(record, null)

reduce(key, records): emit(key); 46

Distinct 3/4 Consequences • The output records are guaranteed to be unique, but any order has not been preserved due to the random partitioning of the records. Known uses • Deduplicate data • Getting distinct values • Protecting from an inner join explosion Resemblances • SQL: SELECT DISTINCT * FROM table; • Pig: b = DISTINCT a; 47

Distinct 4/4 Performance analysis • The number of reducers you think you will need. • Basically, if duplicates are very rare within an input split, pretty much all of the data is going to be sent to the reduce phase.

48

2.3 Data Organization patterns The value of individual records is often multiplied by the way they are partitioned, sharded, or sorted, especially true in distributed systems. Patterns: • Structured to Hierarchical • Partitioning • Binning • Total Order Sorting • Shuffling 49

Structured to Hierarchical 2/3 Structure • Mapper load the data and parse the records into one cohesive format. • Combiner isn’t going to help • Reducer build the hierarchical data structure from the list of data items. 51

Structured to Hierarchical 3/3 Consequences • The output will be in a hierarchical form, grouped by the key that you specified Known uses • Pre-joining data • Preparing data for HBase or MongoDB Performance analysis • How much data is being sent to the reducers from the mappers • The memory footprint of the object that the reducer builds. • For a post that has a million comments?

52

Partitioning 1/3 Intent • Move the records into categories;; doesn’t care the order of records. • Take similar records in a data set and partition them into distinct, smaller data sets. Motivation • If you want to look at a particular set of data, the data items are normally spread out across the entire data set requires an entire scan of all of the data Applicability • Knowing how many partitions you are going to have ahead of time - by day of the week 7 partitions. 53

Partitioning 2/3 Structure - to determine what partition a record is going to go 54

Performance analysis • The resulting partitions will likely not have similar number of records. Perhaps one partition hold 50%. • If implemented naively, all of this data will get sent to one reducer and will slow down processing significantly. 55

Binning 1/3 Intent • For each record in the data set, file each one into one or more categories. Motivation • Binning is very similar to partitioning and often can be used to solve the same problem. • Binning splits data up in the map phase instead of in the partitioner. • Each mapper will now have one file per possible output bin • 1000 Bins x 1000 Mappers = 1000,000 files 56

Binning 2/3 Structure • Mapper: if the record meets the criteria, it is sent to that bin. • No combiner, partitioner, or reducer is used in this pattern. 57

Performance analysis • map-only jobs how efficient of processing records • No sort, shuffle, or reduce to be performed • Most of the processing is going to be done on data that is local. 58

Total Order Sorting 1/3 Intent • Sort your data in parallel on a sort key. Motivation • Reducer will sort its data by key - but not global across all data. • Sorting in parallel is not easy Applicability • Your sort key has to be comparable so the data can be ordered. 59

Total Order Sorting 2/3 Structure • Analyze phase - determines the ranges • idea: partitions that evenly split the random sample should evenly split the larger data set well. • Mapper does a random sampling. • the number of records in the total data set • percentage of records you’ll need to analyze • Only one reducer - collect the sort keys together into a sorted list the list of keys will be sliced into the data range boundaries. • Order phase - actually sorts the data. • # of Reducers === # of Partitions • A custom partitioner loads up the partition file data ranges 60

Shuffling 3/3 Performance analysis • Nice performance properties. • Data distribution across reducers is completely balanced. • With more reducers, the data will be more spread out. • The size of the files will also be very predictable: each is the size of the data set divided by the number of reducers. This makes it easy to get a specific desired file size as output 64

Reduce Side Join 1/3 Intent • Join large multiple data sets together by some foreign key. Motivation • Simple to implement in Reducers • Supports all the different join operations • No limitation on the size of your data sets. Applicability • Multiple large data sets are being joined by a foreign key. • You want the flexibility of being able to execute any join operation. • A large amount of network bandwidth 66

Replicated Join 1/3 Intent • Eliminates the need to shuffle any data to the reduce phase. Motivation • All the data sets except the very large one are essentially read into memory during the setup phase of each map task, which is limited by the JVM heap. Applicability • All of the data sets, except for the large one, can be fit into main memory of each map task. 69

Replicated Join 3/3 Consequences • # of part files == # of map tasks. • The part files contain the full set of joined records. Performance analysis • A replicated join can be the fastest type of join executed because there is no reducer required. • The amount of data that can be stored safely inside JVM. 71

Composite Join 1/4 Intent • Performed on the map-side with many very large formatted inputs. • Completely eliminates the need to shuffle and sort all the data to the reduce phase. • Data to be already organized or prepared in a very specific way. Motivation • Particularly useful if you want to join very large data sets together. • The data sets must first be sorted by foreign key, partitioned by foreign key, and read in a very particular manner. 72

Composite Join 2/4 Applicability • An inner or full outer join is desired. • All the data sets are sufficiently large. • All data sets can be read with the foreign key as the input key to the mapper. • All data sets have the same number of partitions. • Each partition is sorted by foreign key, and all the foreign keys reside in the associated partition of each data set. • The data sets do not change often (if they have to be prepared). 73

Composite Join 3/4 Structure • Map-only • Mapper is very trivial. • Two values are retrieved from the input tuple and output to file system 74

Composite Join 4/4 Consequences • Output # of part files == # of map tasks. Performance analysis • Can be executed relatively quickly over large data sets. • Data Preparation = sorting cost • The cost of producing these prepared data sets is averaged out over all of the runs. 75

Cartesian Product 1/3 Intent • Pair up and compare every single record with every other record in a data set. Motivation • Simply pairs every record of a data set with every record of all the other data sets. • To analyze relationships between one or more data sets Applicability • You want to analyze relationships between all pairs of individual records. • You’ve exhausted all other means to solve this problem. • You have no time constraints on execution time. 76

Cartesian Product 2/3 Structure • Map-only • RecordReader job 77

Cartesian Product 3/3 Consequences • The final data set is made up of tuples equivalent to the number of input data sets. • Every possible tuple combination from the input records is represented in the final output Resemblances • SQL: SELECT * FROM tableA, tableB; Performance Analysis • A massive explosion in data size O(n^2) • If a single input split contains a thousand records the right input split needs to be read a thousand times before the task can finish. • If a single task fails for an odd reason, the whole thing needs to be restarted. 78

Generating Data 1/3 Intent • You want to generate a lot of data from scratch. Motivation • it doesn’t load data generate the data and store it back in the distributed file system.

81

Generating Data 2/3 Structure • map-only

82

Generating Data 3/3 Consequences • Each mapper outputs a file containing random data. Performance analysis • How many worker map tasks are needed to generate the data. • In general, the more map tasks you have, the faster you can generate data. 83

External Source Output 1/3 Intent • To write MapReduce output to a nonnative location (outside of Hadoop and HDFS). Motivation • To output data from the MapReduce framework directly to an external source. • This is extremely useful for direct loading into a system instead of staging the data to be delivered to the external source. 84

External Source Output 2/3 Structure 85

External Source Output 3/3 Consequences • The output data has been sent to the external source and that external source has loaded it successfully. Performance analysis • The receiver of the data can handle the parallel connections. • Having a thousand tasks writing to a single SQL database is not going to work well.

86

External Source Input 1/3 Intent • You want to load data in parallel from a source that is not part of your MapReduce framework. Motivation • Typical model for using MapReduce to analyze the data is to store it into HDFS. • With this pattern, you can hook up the MapReduce framework into an external source, such as a database or a web service, and pull the data directly into the mappers. 87

External Source Input 2/3 Structure 88

External Source Input 3/3 Consequences • Data is loaded from the external source into the MapReduce job • Map phase doesn’t care where that data came from. Performance analysis • Bottleneck - the source or the network. • The source may not scale well with multiple connections (e.g., a single threaded SQL db). • If the source is not in the cluster’s network, the connections may be reaching out on a single connection on a slower public network. 89

Partition Pruning 1/3 Intent • You have a set of data that is partitioned by a predetermined value, which you can use to dynamically load the data based on what is requested by the application. Motivation • Loading all of the files is a large waste of processing time. • By partitioning the data by a common value, you can avoid significant amounts of processing time by looking only where the data would exist 90

Partition Pruning 2/3 Structure 91

Partition Pruning 3/3 Consequences • Partition pruning changes only the amount of data that is read by the MapReduce job, not the eventual outcome of the analytic. Performance analysis • Utilizing this pattern can provide massive gains by reducing the number of tasks that need to be created that would not have generated output anyways. • Outside of the I/O, the performance depends on the other pattern being applied in the map and reduce phases of the job. 92

The End (Finally…) Thanks for your attentions. • MapReduce has proven to be a useful abstraction • Greatly simplifies large-scale computations • Hadoop is widely used • Focus on problems, let MapReduce deal with messy details. Any Questions? 93