Transcription

1 MapReduce and Hadoop Distributed File System 1 V I J A Y R A O

2 The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB a day (2008) 2010 census data is expected to be a huge gold mine of information mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance. We are in a knowledge economy. is an important asset to any organization Discovery of knowledge; Enabling discovery; annotation of data We are looking at newer programming models, and Supporting algorithms and data structures. NSF refers to it as data-intensive computing and industry calls it bigdata and cloud computing 2

3 MapReduce 3 CCSCNE 2009 Palttsburg, April B.Ramamurthy & K.Madurai

4 What is MapReduce? MapReduce is a programming model Google has used successfully is processing its big-data sets (~ peta bytes per day) Users specify the computation in terms of a map and a reduce function, Underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, and 4 Underlying system also handles machine failures, efficient communications, and performance issues.

5 From CS Foundations to MapReduce Consider a large data : {web, weed, green, sun, moon, land, part, web, green, } Problem: Count the occurrences of the different words in the. Lets design a solution for this problem; We will start from scratch We will add and relax constraints We will do incremental design, improving the solution for performance and scalability 5

10 Addressing the Scale Issue Single machine cannot serve all the data: you need a distributed special (file) system Large number of commodity hardware disks: say, 1000 disks 1TB each Issue: With Mean time between failures (MTBF) or failure rate of 1/1000, then at least 1 of the above 1000 disks would be down at a given time. Thus failure is norm and not an exception. File system has to be fault-tolerant: replication, checksum transfer bandwidth is critical (location of data) Critical aspects: fault tolerance + replication + load balancing, monitoring Exploit parallelism afforded by splitting parsing and counting Provision and locate computing at data locations 10

23 MapReduce programming model Determine if the problem is parallelizable and solvable using MapReduce (ex: Is the data WORM?, large data set). Design and implement solution as Mapper classes and Reducer class. Compile the source code with hadoop core. Package the code as jar executable. Configure the application (job) as to the number of mappers and reducers (tasks), input and output streams Load the data (or use it on previously available data) Launch the job and monitor. Study the result. Detailed steps. 23

24 MapReduce Characteristics Very large scale data: peta, exa bytes Write once and read many data: allows for parallelism without mutexes Map and Reduce are the main operations: simple code There are other supporting operations such as combine and partition (out of the scope of this talk). All the map should be completed before reduce operation starts. Map and reduce operations are typically performed by the same physical processor. Number of map tasks and reduce tasks are configurable. Operations are provisioned near the data. Commodity hardware and storage. Runtime takes care of splitting and moving data for operations. Special distributed file system. Example: Hadoop Distributed File System and Hadoop Runtime. 24

28 What is Hadoop? 28 At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose. GFS is not open source. Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS). The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop. This is open source and distributed by Apache.

29 Basic Features: HDFS Highly fault-tolerant High throughput Suitable for applications with large data sets Streaming access to file system data Can be built out of commodity hardware 29

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing