5 Distributed File Systems Past - most computing is done on a single processor: - one main memory - one cache - one local disk, New Challenges: - Files must be stored redundantly: - If one node fails, all of its files would be unavailable until the node is replaced (see File Management) - Computations must be divided into tasks: - a task can be restarted without affecting other tasks (see ) - use of commodity hardware 101

9 Distributed File Systems Parallel computing architecture - Large-Scale File-System Organisation: - Organisation: - files are divided into chunks (typically 16-64MB in size) - chunks are replicated n times (i.e default in HDFS: n=3) at n different nodes (optimally: replicas are located on different racks optimising fault tolerance) - how to find files? - existence of a master node - holds a meta-file (directory) about location of all copies of a file -> all participants using the DFS know where copies are located 105

13 Motivation: Large Scale Data Processing In General: - can be used to manage large-scale computations in a way that is tolerant of hardware faults - System itself manages automatic parallelisation and distribution, I/O scheduling, coordination of tasks that are implemented in map() and reduce() and copes with unexpected system failures or stragglers - several implementations: Google s internal implementation, open-source implementation Hadoop (using HDFS), 109

15 Programming Model Recap: Functional Programming - is inspired by similar primitives in LISP, SML, Haskell and other languages - The general idea of higher order functions (map and fold) in functional programming (FP) languages are transferred in the environment of : - map in <-> map in FP - reduce in <-> fold in FP 111

18 Programming Model - General Processing of - 1. Chunks from a DFS are attached to Map tasks turning each chunk into a sequence of key-value pairs key-value pairs are collected by a master controller and sorted by key. The keys are divided among all Reduce tasks Reduce tasks work on each key separately and combine all the values associated with a specific key. 114

22 Programming Model - General Processing Partitioning the input data - data files are divided into blocks (default in GFS/HDFS: 64 MB) and replicas of each are stored on different nodes - Master schedules map() tasks in close proximity to data storage - map() tasks are executed physically on the same machine where one replica of an input file is storaged (or, at least on the same rack -> communication via network switch) - > Goal: conserve network bandwidth (c.f Grid Computing) > achieves to read input data at local disk speed, rather than limiting read rate by rack switches 118

24 Programming Model - General Processing Shuffle and Sort (performing the group-by-key step) - input to every reducer is sorted by key - Shuffle: sort and transfer the map outputs to the reducers as inputs - Mappers need to separate output intended for different reducers - Reducers need to collect their data from all(!) mappers - keys at each reducer are processed in order 120

28 Programming Model - General Processing Handling machine failures and stragglers - Stragglers - slow workers lengthen the termination of a task - close to completion, backup copies of the remaining in-progress tasks are created - Causes: hardware degradation, software misconfiguration, - if a task is running slower than expected, another equivalent task will be launched as backup -> speculative execution of tasks - when a task completes successfully, any duplicate task are killed 124

31 Programming Model - General Processing - Workflow Workflow of (as original implemented by Google): 1. Initiate the environment on a cluster of machines 2. One Master, the rest are workers that are assigned tasks by the master 3. A map task reads the contents of an input split and passes them to the MAP-function. The results are buffered in memory 4. The buffered (key,value)-pairs are written to local disk. The location of these (intermediate) files are passed back to the master 5. A reduce worker who has been notified by the master, uses remote procedure calls to read the buffered data. 6. Reduce worker iterates over the sorted intermediate (key,value)-pairs and passes them to the REDUCE-function > On completion of all tasks, the master notifies the user program. 127

38 Example #2 k-means - Classification Step As Map Classify: Assign each point to nearest center: Map: Input: - subset of d-dimensional objects of in each mapper - initial set of centroids Output: - list of objects assigned to nearest centroid. This list will later be read by the reducer program 134

MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich First, an Announcement There will be a repetition exercise group on Wednesday this week. TAs will answer your questions on SQL, relational

20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the

Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments

Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is

Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

LARGE-SCALE DATA PROCESSING WITH MAPREDUCE The Petabyte age The Petabyte age Plucking the diamond from the waste Credit-card companies monitor every purchase and can identify fraudulent ones with a high

Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,