Transcription

2 OBJECTIVES OF THIS LAB SESSION The LSDS class has been mostly theoretical so far The objective of this lab session is to get hands-on experience with Hadoop I ll give you a short presentation (<< 1h), after that: exercises This is all just for fun, no grades! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 2

3 WHAT IS BIG? Companies & organisations gather *lots* of data nowadays They re able to store it because storage has become very cheap! The New York Stock Exchange (NYSE) generates 1 Terabyte of data each day Facebook stores ~250 billion pictures from users several Petabytes of data! The Large Hadron Collider (LHC) generates 15 million petabytes of data! Numbers from 2014, so probably even more now! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 3

4 WHAT IS BIG? This data comes from various sources. Re: Facebook, it comes users (as is often the case in the social web), Re: the LHC, it comes from machines Particle accelerator, but it can come from other machines, such as sensor networks (e.g., monitoring temperatures in your server farms all across the world, taxi companies who want to know where all their cars are at any moment, transactions, log files...) This data is often not very well structured. Images, text files, comments, health data (prescriptions...), etc. How do you store and process this data? LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 4

5 HOW DO YOU STORE THIS? You need many machines that will store a small part of the data «Oh, that was easy!» (part 1) (part 2) (part 3) (part 4) (part 5) (part 6) (part 7) (part 8) (part 9) LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 5

8 HOW DO YOU STORE THIS? So you need some replication. You can t handle it manually, of course. So you use a Distributed File System (DFS) that does the job for you! In Hadoop, this filesystem is called HDFS (Hadoop Distributed File System) foo.txt: 3,9,6 bar.data: 2,4 Client block #2 of foo.txt? 9 Name node Read block Data nodes HDFS Source: UPenn LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 8

9 HOW DO YOU PROCESS THIS? Now we know how to store the data, but how do we process it? Historically, we ve been using databases for this. It doesn t work anymore! First, because as we saw earlier : lack of structure! Images, comments, log files, prescriptions,... You can put that in a database, with a fixed structure, tables, relations, etc. Second, databases don t scale very well... Try doubling the number of nodes with a (distributed) database... You won t be twice as fast (far from it) LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 9

10 HOW DO YOU PROCESS THIS? So what s the alternative? Google invented MapReduce to make the indexer for their search engine scale! Idea of MapReduce : You write a function that you re going to run as a batch process on all of your data And you want to get one result (which can be large) MapReduce is really good at doing this efficiently! Different use case from databases that are better at accessing small bits of your data all the time frequently, instead of all of your data once in a while in a batch process! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 10

11 HOW DO YOU PROCESS THIS? How does MapReduce manage to be so efficient at what it does? A very old idea: execute things locally as much as possible and to avoid transfers between nodes as much as possible! MapReduce first runs a function f() on all data Of course, if two nodes contain the same data, you re only going to run the function on one of the nodes only (just ensure it s run on all of the data) And if a node is dead, you ll make sure you run the function on another node that has the same data All of this is done automatically! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 11

12 HOW DO YOU PROCESS THIS? After the Map phase, you have partial results located on all nodes... So you want to gather and aggregate all of these results into global results! Intermediary phase: the Shuffle phase brings all data to one machine (could be one of the previous ones) RES RES RES RES LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 12

13 HOW DO YOU PROCESS THIS? The Shuffle phase naïvely concatenates the results together Usually we want a new function g() that will take the concatenated data......and merge it in a smarter way, to produce the result. RES RES RES OUT PUT RES LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 13

14 HOW DO YOU PROCESS THIS? MapReduce can seem a bit restrictive: You have to express all of your algorithms with two functions, Map and Reduce. And actually, it is: you can t express everything with MapReduce. But in practice, you will see that many operations that are executed on large amounts of data can be expressed following this paradigm! And if you re able to, you can very easily implement an algorithm that scales well......without having to worry about how you distribute, replicate, or transfer the data! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 14

15 HOW DO YOU PROCESS THIS? In practice, it can get more complicated than this, among other things : You can alter the Shuffle phase with a Combiner that will prepare the data after the Map phase locally before it s sent to the Reducer (useful for reducing the amount of data transferred) You can use several Reducers: each will produce part of the data, results stored on the HDFS, so the results will just look like a bunch of files, which sometimes can be just what you want (just merge them or something)...but pretty often, when you have several Reducers, you ll want to combine the data again, so what do you do? You can run a Map phase on the output of your reducers again!...and you can do this over and over again (iterative MapReduce) LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 15

17 WAIT BUT THAT S MAPREDUCE, WHAT S HADOOP? MapReduce was invented and is used by Google. Hadoop is a free, open-source implementation of MapReduce. A bit of history... Early 2000s: Doug Cutting develops two open-source search projects: Lucene: Search indexer, used e.g., by Wikipedia Nutch: A spider/crawler (with Mike Carafella) Nutch: Aims to become a web-scale, crawler-based search Written by a few part-time developers Distributed, by necessity (too much data) Able to parse100mb of web pages on 4 nodes, but can t scale to the whole web... Source: UPenn LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 17

20 WAIT BUT THAT S MAPREDUCE, WHAT S HADOOP? Cutting is now at Cloudera... Originally a startup, started by three top engineers from Google, Facebook, Yahoo, and a former executive from Oracle Has its own version of Hadoop; software remains free, but company sells support and consulting services Was elected chairman of Apache Software Foundation Now Hadoop maintained by the Apache Foundation! Source: UPenn LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 20

22 HADOOP IN PRACTICE Let s start with an example: we have files that contain meteorological data These files contain records, each record is one line, containing: The code of a weather station on five digits The year when the temperature was recorded The average temperature for that year times ten on four digits (we ll suppose they re all positive to simplify things, multiplication to avoid floats). Only one data point per year here so it s not really Big Data, but this is just a toy example, we could have a lot more records, one per hour for instance. Many more fields such as the wind speed, humidity, etc. An example of a record: LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 22

23 HADOOP IN PRACTICE The input data will look like this: The data can be stored in many files, one per weather station, one per year... Etc. We ll use Hadoop to calculate the maximum average temperature for each year! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 23

24 HADOOP IN PRACTICE What will the input of the Map function be? Each line produces a (key, value) pair We can ignore the key (usually the character offset), the value is the contents of the line: (0, ) (20, ) (40, ) (60, ) (80, )... LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 24

25 HADOOP IN PRACTICE What will the Map function do? It will discard the key, parse the values, and return (key, value) pairs where the key is the year and the value is the average temperature. The output will be: (1950, 0163) (1950, 0134) (1950, 0131) (1949, 0223) (1949, 0165)... So basically our Map function will be a Java function that takes two parameters, the key (a number) and the value (a string), it will parse the string using the standard API, and produce the (key, value) pair as the output... LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 25

26 HADOOP IN PRACTICE What will the Shuffle phase do? As we ve seen earlier, it will concatenate the values for each key. Plus, keys are sorted: (1949, [0223, 0165]) (1950, [0163, 0134, 0131])... We don t have to implement this phase, it s done automatically... LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 26

27 HADOOP IN PRACTICE What will the Reduce phase do? It s just going to calculate the maximum of each list. The input was: (1949, [0223, 0165]) (1950, [0163, 0134, 0131])... The output will be: (1949, 0165) (1950, 0163)... And that s it, we have the result we want! All we have to do is to implement two very simple functions in Java, Map and Reduce, and everything else, distribution, replication, load-balancing (of keys), fault tolerance (tasks rescheduled to machines that work), etc., will be handled by Hadoop! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 27

29 HADOOP IN PRACTICE First thing to do when you write a MapReduce program : find the input types of the Map and Reduce functions. Map takes pairs like (0, ), and produces pairs like (1950, 0163). Input types = (LongWritable, Text), output types = (Text, IntWritable), for instance. (We could use an IntWritable for the year too, but we never use its numerical properties.) Consequently, the Map class will extend: Mapper<LongWritable, Text, Text, IntWritable> And contain this function: public void map(longwritable key, Text value, Context context) LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 29

32 YOUR TURN! That senough information to get you started! You can now start working on the exercises you will find here: You will probably need more information than just what we saw in these slides......you re expected to use Google and to figure things out on your own! Good luck! LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 32

Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm

Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Can you attempt to place reducers

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

Intro To Hadoop Outline What is Big Data? Hadoop HDFS MapReduce 2 What is big data? A bunch of data? An industry? An expertise? A trend? A cliche? 3 Wikipedia big data In information technology, big data

Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.

MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

What are Hadoop and MapReduce and how did we get here? Term Big Data coined in 2005 by Roger Magoulas of O Reilly Media But as the idea of big data sets evolved on the Web, organizations began to wonder

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured