4 Introduction Data analysis at a large scale Very large data collections (TB to PB) stored on distributed filesystems: Query logs Search engine indexes Sensor data Need efficient ways for analyzing, reformatting, processing them In particular, we want: Parallelization of computation (benefiting of the processing power of all nodes in a cluster) Resilience to failure Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

6 Introduction Pushing the program near the data Client node coordinator() process1 program () result disk program () process2 result program () process3 result disk disk MapReduce: A programming model (inspired by standard functional programming operators) to facilitate the development and execution of distributed tasks. Published by Google Labs in 2004 at OSDI [DG04]. Widely used since then, open-source implementation in Hadoop. Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

7 Introduction MapReduce in Brief The programmer defines the program logic as two functions: Map transforms the input into key-value pairs to process Reduce aggregates the list of values for each key The MapReduce environment takes in charge distribution aspects A complex program can be decomposed as a succession of Map and Reduce tasks Higher-level languages (Pig, Hive, etc.) help with writing distributed applications Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

11 Some terminology: job,task,mapper,reducer A MapReduce job is a unit of work that the client node seeks to perform. It consists mainly of the input data, the MapReduce program and configuration information. A MapReduce program might contain one or several jobs. Each job might consist of several map and reduce tasks. Mapper and Reducer: any node that executes either map or reduce tasks, respectively. Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

12 Example: term count in MapReduce (input) URL Document u 1 u 2 u 3 u 4 u 5 the jaguar is a new world mammal of the felidae family. for jaguar, atari was keen to use a 68k family device. mac os x jaguar is available at a price of us $199 for apple s new family pack. one such ruling family to incorporate the jaguar into their name is jaguar paw. it is a big cat. Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

16 More on the map function function map(docoffset, document) while (document.hasnext()) output (document.nextword(), 1) where docoffset is the physical address of document (handled by MapReduce). Why we output 1? Input is a data stream with sequential access (accessed through an Iterator in Java). More efficient: output partial sums rather than 1 s (keep some data in main memory). Best: use a combiner (more on this later). Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

17 More on the reduce function function reduce(word, parsums) tot=0 while (parsums.hasnext()) tot += parsums.next() output (word, tot) Each reduce task consists of aggregating partial sums. Note that reduce tasks are independent from each other and can be executed in parallel. If user requires more than one reducer, then reduce tasks are executed in parallel. E.g. two reducers, each taking care of 50% of words (ideal case). Otherwise no parallelism in the reduce tasks. Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

18 A MapReduce cluster Nodes inside a MapReduce cluster can be classified as follows: A jobtracker acts as a master node. It coordinates all the jobs run on the system, scheduling tasks to run on tasktrackers. It also keeps a record of the overall progress of each job. If a task fails, the job tracker can reschedule it. Several tasktrackers run the computation itself, i.e., map and reduce tasks and send progress report to the jobtracker. Tasktrackers usually also act as data nodes of a distributed filesystem (e.g., HDFS) + a client node where the application is launched. Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

19 Processing a MapReduce job the input is divided into fixed-size pieces called splits (default size:64mb). MapReduce assigns one map task for each split, which runs the user-defined map function (assignment is based on the data locality principle). Map output is stored on a local disk (not HDFS), as it can be thrown away after reducers processed it. Remarks: Data locality no longer hold for reducers, since they read from the mappers. Split size can be changed by the user. Small splits implies more parallelism, however, too many splits entails too much overhead in managing them. Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

20 Facts about mappers and reducers The number of mappers is function of the number of map tasks and the number of available nodes in the cluster. Assignment of data splits to mappers tries optimizing data locality: the mapper node in charge of a split is, if possible, one that stores a replica of this split. The number of reducers is set by the user. Map output is assigned to reducers by hashing of the key, usually uniformly at random to balance the load. No data locality possible. Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

22 Failure management Since tasks are distributed over thousands of machines, faulty machines are relatively common. Starting the job from the beginning is not a valid option. The jobtracker periodically checks the availability and reachability of the tasktrackers (heartbeats) and whether map or reduce jobs make any progress 1 if a reducer fails, its task is reassigned to another tasktracker; this usually requires restarting map tasks as well. 2 if a mapper fails, its task is reassigned to another tasktracker 3 if the jobtracker fails, the whole job should be re-executed Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

23 Adapting algorithms into MapReduce Prefer simple map and reduce functions try to balance computational load across the machines (reducers). Easier when computing matrix-vector multiplications. iterative algorithms: prefer small number of iterations. The output of each iteration must be stored in the HDFS large I/O overhead per iteration. A given application may have: many mappers and reducers (parallelism in HDFS I/O and computing). only one reducer. zero reducers (e.g. input preprocessing, filtering,... ). Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

25 MapReduce Optimization Combiners A mapper task can produce a large number of pairs with the same key (e.g. (jaguar,1)), which need to be sent over the network to reducers: costly. Example A combiner combine these pairs into a single key-value pair (jaguar,1), (jaguar, 1), (jaguar, 1), (jaguar, 2) (jaguar, 5) combiner : list(v ) V function executed (possibly several times) to combine the values for a given key, on a mapper node. No guarantee that the combiner is called Easy case: the combiner is the same as the reduce function. Possible when the aggregate function computed by reduce is commutative and associative. Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

26 MapReduce Optimization Compression Data transfers over the network: From datanodes to mapper nodes (usually reduced using data locality) From mappers to reducers From reducers to datanodes to store the final output Each of these can benefit from data compression Tradeoff between volume of data transfer and (de)compression time Usually, compressing map outputs using a fast compressor increases efficiency Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

27 MapReduce Optimization Optimizing the shuffle operation Sorting of pairs (needed to assign data to reducers) can be costly. Sorting much more efficient in memory than on disk Increasing the amount of memory available for shuffle operations can greatly increase the performance... at the downside of less memory available for map and reduce tasks. Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

30 MapReduce in Hadoop Hadoop programming interfaces Different APIs to write Hadoop programs: A rich Java API (main way to write Hadoop programs) A Streaming API that can be used to write map and reduce functions in any programming language (using standard inputs and outputs) A C++ API (Hadoop Pipes) With a higher-language level (e.g., Pig, Hive) Advanced features only available in the Java API Two different Java APIs depending on the Hadoop version; presenting the old one Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

31 MapReduce in Hadoop Testing and executing a Hadoop job Required environment: JDK on client JRE on all Hadoop nodes Hadoop distribution (HDFS + MapReduce) on client and all Hadoop nodes SSH servers on each tasktracker, SSH client on jobtracker (used to control the execution of tasktrackers) An IDE (e.g., Eclipse + plugin) on client Three different execution modes: local One mapper, one reducer, run locally from the same JVM as the client pseudo-distributed mappers and reducers are launched on a single machine, but communicate over the network distributed over a cluster for real runs Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

32 MapReduce in Hadoop Debugging MapReduce Easiest: debugging in local mode Web interface with status information about the job Standard output and error channels saved on each node, accessible through the Web interface Counters can be used to track side information across a MapReduce job (e.g., number of invalid input records) Remote debugging possible but complicated to set up (impossible to know in advance where a map or reduce task will be executed) IsolationRunner allows to run in isolation part of the MapReduce job Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

33 MapReduce in Hadoop Hadoop in the cloud Possibly to set up one s own Hadoop cluster But often easier to use clusters in the cloud that support MapReduce: Amazon EC2 Cloudera etc. Not always easy to know the cluster s configuration (in terms of racks, etc.) when on the cloud, which hurts data locality in MapReduce Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

35 Conclusions MapReduce limitations (1/2) High latency. Launching a MapReduce job has a high overhead, and reduce functions are only called after all map functions succeed, not suitable for applications needing a quick result. Batch processing only. MapReduce excels at processing a large collection, not at retrieving individual items from a collection. Write-once, read-many mode. No real possibility of updating a dataset using MapReduce, it should be regenerated from scratch Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

37 Conclusions Resources Original description of the MapReduce framework [DG04] Hadoop distribution and documentation available at Documentation for Pig is available at Excellent textbook on Hadoop [Whi09] Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, / 39

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich First, an Announcement There will be a repetition exercise group on Wednesday this week. TAs will answer your questions on SQL, relational

Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large

Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for