slides - network systems lab @ sfu

Map/Reduce Programming Model
Ahmed Abdelsadek
Outlines
• Introduction
• What is Map/Reduce?
• Framework Architecture
• Map/Reduce Algorithm Design
• Tools and Libraries built on top of Map/Reduce
Introduction
• Big Data
• Scaling ‘out’ not ‘up’
• Scaling ‘everything’ linearly with data size
• Data-intensive applications
Map/Reduce
• Origins
• Google Map/Reduce
• Hadoop Map/Reduce
• The Map and Reduce functions are both defined
with respect to data structured in (key, value)
pairs.
Mapper
• The Map function takes a key/value pair, processes it, and generates
zero or more output key/value pairs.
• The input and output types of the mapper can be different from each
other.
Reducer
• The Reduce function takes a key and a series of all values associated
with it, processes it, and generates zero or more output key/value
pairs.
• The input and output types of the reducer can be different from each
other.
Mappers/Reducers
• map: (k1; v1) ->
[(k2; v2)]
• reduce: (k2; [v2]) ->
[(k3; v3)]
WordCount Example
• Problem: count the number of occurrences of
every word in a text collection.
Map(docid a, doc d)
for all term t in doc d do
Emit(term t, count 1)
Reduce(term t; counts [c1, c2, …])
sum = 0
for all count c in counts [c1, c2, …] do
sum = sum + c
Emit(term t, count sum)
Map/Reduce Framework Architecture
and Execution Overview
Architecture - Overview
• Map/Reduce runs on top of DFS
Data Flow
Job Timeline
Job Work Flow
Job Work Flow
Job Work Flow
Job Work Flow
Job Work Flow
Job Work Flow
Job Work Flow
Job Work Flow
Job Work Flow
Job Work Flow
Fault Tolerance
• Task Fails
▫ Re-execution
• TaskTracker Fails
▫ Removes the node from pool of
TaskTrackers
▫ Re-schedule its tasks
• JobTracker Fails
▫ Singe point of failure. Job fails
Map/Reduce Framework Features
• Locality
▫ Move code to the data
• Task Granularity
▫ Mappers and reducers should be much larger than the
number of machines, however, not too much!
 Dynamic load balancing!
• Backup Tasks
▫ Avoid slow workers
▫ Near completion
Map/Reduce Framework Features
• Skipping bad records
▫ Many failures on the same record
• Local execution
▫ Debug in isolation
• Status information
▫ Progress of computations
• User Counters, report progress
▫ Periodically propagated to the master node
Hadoop Streaming and Pipes
• APIs to MapReduce that allows you to write your map
and reduce functions in languages other than Java
• Hadoop Streaming
▫ Uses Unix standard streams as the interface between Hadoop and
your program
▫ You can use any language that can read standard input and write
to standard output
• Hadoop Pipes (for C++)
▫ Pipes uses sockets as the channel to communicates with the
process running the C++ map or reduce function
▫ JNI is not used
Keep in Mind
• Programmer has little control over many aspects of
execution
▫ Where a mapper or reducer runs (i.e., on which node in the
cluster).
▫ When a mapper or reducer begins or finishes
▫ Which input key-value pairs are processed by a specific
mapper.
▫ Which intermediate key-value pairs are processed by a
specific reducer.
Map/Reduce Algorithm Design
Partitioners
• Dividing up the intermediate key space.
• Simplest: Hash value of the key mod the number of
reducers
▫ Assigns same number of keys to reducers
▫ Only considers the key and ignores the value
▫ May yield large differences in the number of
values sent to each reducer
• More complex partitioning algorithm to handle
the imbalance in the amount of data associated
with each key
Combiners
• In WordCount example: the amount of intermediate data is larger
than the input collection itself
• Combiners are an optimization for local aggregation before the
shuffle and sort phase
▫ Compute a local count for a word over all the documents
processed by the mapper
• Think of combiners as “mini-reducers”
▫ However, combiners and reducers are not always interchangeable
• Combiner input and output pair are same as mapper output pairs
▫ Same as reducer input pair
• Combiner may be invoked zero, one, or multiple times
• Combiner can emit any number of key-value pairs
Complete View of Map/Reduce
Local Aggregation
• Network and disk latency are high!
• Features help local aggregation
▫ Single (Java) Mapper object for multiple (key,value)
pairs in an input split (preserve state across multiple
calls of the map() method)
▫ Share in-object data structures and counters
▫ Initialization, and finalization code across all map()
calls in a single task
▫ JVM reuse across multiple tasks on the same machine
Basic WordCount Example
Per-Document Aggregation
• Associative array inside the map() call to sum up term counts within
a single document
• Emits a key-value pair for each unique term, instead of emitting a
key-value pair for each term in the document
▫ substantial savings in the number of intermediate key-value pairs
emitted
Per-Mapper Aggregation
• Associative array inside the Mapper object to sum up term counts
across multiple documents
In-Mapper Combining
• Pros
▫ More control over when local aggregation occurs and how it
exactly takes place (recall: no guarantees on combiners)
▫ More efficient than using actual combiners
 No additional overhead with object creation, serializing, reading, and
writing the key-value pairs
• Cons
▫ Breaks the functional programming (not a big deal!)
▫ Scalability bottleneck
 Needs sufficient memory to store intermediate results
 Solution: Block and flush, every N key-value pairs have been
processed or every M bytes have been used.
Correctness with Local Aggregation
• Combiners are viewed as optional optimizations
▫ Correctness of algorithm should not depend on its computations
• Combiners and reducers are not interchangeable
▫ Unless reduce computation is both commutative and associative
• Make sure of the semantics of your aggregation algorithm
▫ Notice for example
Pair and Stripes
• In some problems: common approach is to construct
complex keys and values to achieve more efficiency
• Example: Problem of building word co-occurrence
matrix from large document collection
▫ Formally, the co-occurrence matrix of a corpus is a square N x N
matrix where n is the number of unique words in the corpus
▫ Cell Mij contains the number of times word Wi co-occured with
word Wj
Pairs Approach
• Mapper: emits co-occurring words pair as the key and the integer
one
• Reducer: sums up all the values associated with the same cooccurring word pair
Pairs Approach
• Pairs algorithm generates a massive number of
key-value pairs
• Combiners have few opportunities to perform
local aggregation
• The sparsity of the key space also limits the
effectiveness of in-memory combining
Stripes Approach
• Store co-occurrence information in an associative array
• Mapper: emits words as keys and associative arrays as values
• Reducer: element-wise sum of all associative arrays of the same key
Stripes Approach
• Much more compact representation
• Much fewer intermediate key-value pairs
• More opportunities to perform local aggregation
• May cause potential scalability bottlenecks of the
algorithm.
Which approach is faster?
• APW (Associated Press Worldstream ): corpus of 2.27 million
documents totaling 5.7 GB
Computing Relative Frequencies
• In the previous example, (Wi,Wj) co-occurrence
may be high just because one of the words is
very common!
• Solution: Compute relative frequencies
Relative Frequencies with Stripes
• Straightforward!
• In Reducer:
▫ Sum all words counts co-occur with the key word
▫ Divide the counts by that sum to get the relative frequency!
• Lessons:
▫ Use of complex data structures to coordinate distributed
computations
▫ Appropriate structuring of keys and values, bring together all the
pieces of data required to perform a computation
• Drawback?
▫ As with before, this algorithm also assumes that each associative
array fits into memory (Scalability bottleneck!)
Relative Frequencies with Pairs
• Reducer receives (Wi,Wj) as the key and the counts as value
▫ From this alone it is not possible to compute f(Wj | Wi)
• Hint: Reducers like Mappers, can preserve state across multiple
keys
• Solution: at reducer side, buffer in memory all the words that cooccur with Wi
▫ In essence building the associative array in the stripes approach
• Problem?
▫ Word pairs can be in any arbitrary order!
• Solution: we must define the sort order of the pair
▫ Keys are first sorted by the left word, and then by the right word
• So That: when left word changes ->
▫ Sum, calculate and emit the results, flush the memory
Relative Frequencies with Pairs
• Problem?
▫ Same left-word pairs may be sent to different reducers!
• Solution?
▫ We must ensure that all pairs with the same left word are sent to
the same reducer
• How?
▫ Custom Paritioners!!
 Pays attention to the left word and partition based on its hash only
• Will it work?
▫ Yeah!
• Drawback?
▫ Still scalability bottleneck! 
Relative Frequencies with Pairs
• Another approach? With no bottlenecks?
• Can we compute or ‘have’ the sum before processing the pairs
counts?
• The notion of ‘before’ and ‘after’ can be seen in the ordering of the
key-value pairs
• This insight lies in properly sequencing the data presented to the
reducer
▫ Programmer should define the sort order of keys so that data needed
earlier is presented earlier to the reducer
• So now, we need two things
▫ Compute the sum for a give word Wi
▫ Send that sum to the reducer before any words pair where Wi is its left
side
Relative Frequencies with Pairs
• How?
• To get the sum
▫ Modify the Mapper to additionally emits a ‘special’ key of (Wi, *), with a
value of one
• To ensure the order
▫ defining the sort order of the keys so that pairs with the special symbol of
the form (Wi, *) are ordered before any other key-value pairs where the
left word is Wi
• In addition:
▫ Partitioner to pay attention to only the left word
Relative Frequencies with Pairs
• Example
• Memory bottlenecks?
▫ No!
Order Inversion Design Pattern
• To summarize
▫ Emitting a special key-value pair for getting the sum
▫ Controlling the sort order of the intermediate key
▫ Defining a custom partitioner
▫ Preserving state across multiple keys in the reducer
• Quite common in pattern in many problems
• The key insight
▫ Convert the sequencing of computations into a sorting problem
Secondary Sort
• In addition to sorting by key, we also need to sort by value
• Implemented in Google, but not in Hadoop
• Two main techniques
▫ Buffer all the readings in memory and then sort
 May lead to too much memory consumption
▫ Value-to-key conversion
 Move part of the value into the intermediate key to form a composite
key
 We must define the intermediate key sort order
 We must define the partitioner so that all pairs associated with the
same key are sent to the same reducer
 Reducer will need to preserve state across multiple pairs
 May lead to too much intermediate pairs
Relational Joins
• For databases, data-warehousing, and data analytics
• Semi-structured data
• Example of a join
• Where S and T are datasets (relations), k is the key we
want to join on, si and ti are the unique IDs of S and T
respectively, Si and Ti are the rest of the tuple attributes
Reduce-side Join
• One-to-one join
▫ Emit tuple’s join attribute as key, rest of attributes as value
• One-to-many join
▫ Buffer all tuple’s in memory
▫ Use Value-to-key pattern
Reduce-side Join
• Many-to-many join
▫ The previous algorithm works as well
▫ Smaller set should come first
▫ Reducer will buffer it in memory
• Lessons
▫ Basic idea is to repartition the two datasets by the join key
▫ Not efficient since it shuffles both datasets across the network
Map-side Joins
• Assume datasets are
▫
▫
▫
▫
Both sorted by the join key
Divided into same number of files
Partitioned in the same manner by the join key
In each file, tuples are sorted by the join key
• We can perform a join by scanning through both datasets
simultaneously
▫ This is known as a merge join
• Parallelize by partitioning and sorting both datasets in the same
time
▫ Map over one of the datasets (the larger one)
▫ Inside the mapper read the corresponding part of the other dataset
 Non-local read
▫ Perform the merge join
Map-side Joins
• More efficient than a reduce-side join
▫ Doesn’t shuffle all the datasets
• Drawback:
▫ Strong assumption on the input files format
• Advice
▫ If used in a workflow with multiple Map/Reduce jobs,
ensure the previous reducer writes its output in a
convenient format.
Memory-backed Join
•
•
•
•
If one of the datasets can fit in memory
Load it in memory
Map over the other dataset
Use random access to tuples based on the join
key
• Great performance improvement
Summary
• In-mapper combining
▫ Aggregates partial results
▫ Emit less intermediate pair
• Pair and Stripes
▫ Keep track of joint events
 One by one
 Stripe fashion
• Order inversion
▫ Convert the sequencing of computations into a sorting problem
• Value-to-key conversion
▫ Scalable solution for secondary sorting
▫ Moving part of the value into the key
Before we go!
• Remember: Limitations of Map/Reduce Model
▫ Map/Reduce mainly designed for batch processes, not for
online query
▫ Prevents modifying or adding input data while the job is
running, as well as modifying the number of machines.
▫ Map/Reduce job has a single entry and a single exit
 We can not keep it alive waiting for an event to trigger it
▫ Map/Reduce works on flat files
 Lack of scheme support
What’s Next?
Map/Reduce vs RDBM
• A living debate in databases and data analytics communities
• On 2008, D. DeWitt and M. Stonebraker write
▫
▫
▫
▫
▫
▫
“MapReduce: A major step backwards”
A giant step backward in the programming paradigm
An implementation uses brute force instead of indexing
Not novel at all -- well known techniques developed nearly 25 years ago
Missing most of the features that are routinely included in current DBMS
Incompatible with all of the tools DBMS users have come to depend on
• MapReduce is missing features
▫ Indexing, Bulk loader, Updates, Transactions, Integrity constraints,
Referential integrity, Views
• MapReduce is incompatible with the DBMS tools
▫ Report writers, Business intelligence tools, Data mining tools,
Replication tools, Database design tools
Map/Reduce vs RDBM
• On 2010, same authors and others write
“MapReduce and Parallel DBMSs:Friends or Foes?“
• Where they argue that
▫ Map/Reduce is a complement to DBMS not a competitive
▫ They are used in different application domain
• Parallel DBMSs excel at efficient querying of large data sets
• MR style systems excel at ETL(extract-transform-load) tasks
NoSQL
• Mechanism for storage and retrieval of data that use looser
consistency models than traditional relational databases
▫ To achieve higher scalability and availability
• Usually in form of Key-Value store
• Built on top of Distributed File Systems
• Examples
▫
▫
▫
▫
Google Big Table
Apache HBase
Apache Cassandra
Amazon Dynamo
Tools on top of Hadoop
• Apache Pig
▫ Apache Pig is a high-level procedural language platform
developed to simplify querying large data sets in Apache
Hadoop and MapReduce
▫ Apache Pig features a “Pig Latin”, a relational data-flow
language enables SQL-like queries to be performed on
distributed datasets within Hadoop applications.
▫ Pig originated as a Yahoo Research
▫ In 2007, Pig became an open source project of the Apache
Software Foundation.
Apache Pig
• Pig Latin Example
Apache Pig
• Pig execution flow
Tools on top of Hadoop
• Apache Hive
▫ Hive is a data warehouse system for the open source
Apache Hadoop project.
▫ Hive features a SQL-like HiveQL language that facilitates
data analysis and summarization for large datasets stored
in Hadoop-compatible file systems.
▫ Hive originated as a Facebook
▫ Later became an open source project under the Apache
Software Foundation.
Apache Hive
• HiveQL Example
Pig vs Hive
• They are/were independent projects and there was no centrally
coordinated goal.
• They were in different spaces early on and have grown to overlap
with time as both projects expand
• Some differences are
▫ Pig Latin is procedural, where HiveQL is declarative.
▫ Pig Latin allows developers to insert their own code almost anywhere in
the data pipeline.
• Both compiles to Map and Reduce jobs.
Libraries on top of Hadoop
• Mahoot
▫ Machine learning library to build scalable machine learning
algorithms.
Libraries on top of Hadoop
• HIPI (Hadoop Image Processing Interface)
▫ Framework that provides an API for performing image processing
tasks in a distributed computing environment
Summary
• Map/Reduce
• Framework Architecture
• Map/Reduce Algorithm Design
• Tools and Libraries built on top of Map/Reduce
Demo
• Starting Hadoop cluster
• Copying data to HDFS
• Compiling our Java Map/Reduce code and
create the Jar file.
• Submit Hadoop job
• Show progress and dash boards
• Retrieve the output from HDFS
• Shut down Hadoop cluster
Appendix
• Hadoop Configurations
• Single Node
▫ Simple guide
 http://hadoop.apache.org/docs/stable/single_node_setup.html
▫ More detailed:
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntulinux-single-node-cluster/
• Cluster setup
▫ Simple guide
 http://hadoop.apache.org/docs/stable/cluster_setup.html
▫ More detailed:
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntulinux-multi-node-cluster/
Appendix
• Packages to install on Linux
▫ Hadoop:
http://apache.mirror.nexicom.net/hadoop/common/hadoop1.1.2/hadoop-1.1.2.tar.gz
▫ Oracle Java 7:
http://download.oracle.com/otn-pub/java/jdk/7u25-b15/jdk-7u25linux-x64.tar.gz
▫ SSH
$ sudo apt-get install ssh
$ sudo apt-get install rsync
Appendix
• Studying materials
▫ “Data-Intensive Text Processing with MapReduce”
Jimmy Lin and Chris Dyer
▫ “Hadoop: The Definitive Guide” Tom White
▫ “MapReduce Design Patterns” Donald Miner and
Adam Shook
Questions?