CS 378 Big Data Programming. Lecture 2 Map- Reduce

Transcription

1 CS 378 Big Data Programming Lecture 2 Map- Reduce

2 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments For the most part, map and reduce tasks are stateless Write once, read muljple Jmes Data Warehouse has this intended usage (write once) Unstructured data vs. structured/normalized Data pipelines are common Chain of MR jobs, with intermediate results

4 MapReduce Tom White, in Hadoop: The Defini/ve Guide MapReduce works well on unstructured or semistructured data because it is designed to interpret the data at processing /me. In other words, the input keys and values for MapReduce are not intrinsic proper/es of the data, but they are chosen by the persona analyzing the data.

5 MapReduce When wrijng a MapReduce program You don t know the size of the data You don t know the extent of the parallelism MapReduce tries to collocate the data with the compute node Parallelize the I/O Make the I/O local (versus across network)

6 MapReduce As the name implies, for each problem we ll write Map method/funcjon Reduce method/funcjon Terms from funcjonal programming Map Apply a funcjon to each input, output the result Reduce Given a list of inputs, compute some output value

7 The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4. This diagram makes it clear why the data flow between map and reduce tasks is colloquially known as the shuffle, as each reduce task is fed by many map tasks. The MapReduce in Hadoop Figure 2.4, Hadoop - The DefiniJve Guide shuffle is more complicated than this diagram suggests, and tuning it can have a big impact on job execution time, as you will see in Shuffle and Sort on page 208. Figure 2-4. MapReduce data flow with multiple reduce tasks

9 Reduce FuncJon Reduce input is a stream of key/value- list pairs These are the key value pairs emieed by the map funcjon Reduce funcjon processes each input pair in turn For each input pair, the reduce funcjon can (but isn t required) to emit a key/value pair Key value pair derived from the input key/value- list pair Does not need to be the same key or value data type

10 WordCount Example For an input text file of arbitrary size, or MulJple text files of arbitrary size, or An arbitrary number of documents Count the number occurrences of all the words that appear in the input. Output: word1, count word2, count

11 WordCount Example - Map Map input is a stream of key/value pairs File posijon in bytes (key), line of text (value) Map funcjon processes each input pair in turn Extract each word from the line of text Emits a key/value pair for each word: <the- word, 1> For each input pair, the map funcjon emits muljple key/value pairs Key is a text string (the word), value is a number

12 WordCount Example - Reduce Reduce input is a stream of key/value- list pairs These are the key value pairs emieed by the map funcjon Key is a text string (the word), value is a list of some number of the value 1 Hadoop has grouped data together by key Reduce funcjon processes each input pair in turn Sums the values in the value- list For each input pair, the reduce funcjon emits a key/ value pair Key is a text string (the word), value is total count for that word

13 The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4. This diagram makes it clear why the data flow between map and reduce tasks is colloquially known as the shuffle, as each reduce task is fed by many map tasks. The MapReduce in Hadoop Figure 2.4, Hadoop - The DefiniJve Guide shuffle is more complicated than this diagram suggests, and tuning it can have a big impact on job execution time, as you will see in Shuffle and Sort on page 208. Figure 2-4. MapReduce data flow with multiple reduce tasks

16 Assignment ArJfacts For each assignment, there will be one or more arjfacts to submit: Java code Source files in one directory (for easy inspecjon) Source files in src/main/java/ structure (use tar ) Build info: pom.xml file used for maven An inijal pom.xml file will be provided, and we ll expand this during the semester Program outputs Extracted from HDFS ArJfacts required for each assignment will be listed.

CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used

Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

Hands-on Exercises with Big Data Lab Sheet 1: Getting Started with MapReduce and Hadoop The aim of this exercise is to learn how to begin creating MapReduce programs using the Hadoop Java framework. In

Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

To reduce or not to reduce, that is the question 1 Running jobs on the Hadoop cluster For part 1 of assignment 8, you should have gotten the word counting example from class compiling. To start with, let

Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer

Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free

David Moses January 2014 Paper on Cloud Computing I Background on Tools and Technologies in Amazon Web Services (AWS) In this paper I will highlight the technologies from the AWS cloud which enable you