4 Introduction There are huge datasets of heterogeneous data available which are growing fast Most of world s data were created in the last 2 years (IBM source) 2.5 exabytes are created every day Wallmart collects 2.5 petabytes of data each hour 340 millions of tweets are sent every day

6 Terminology Batch data Static snapshot of a data set Batch computation has a start and an end Fast datasets processing Stream data Stream of events that flows into the system at a given data rate over which we have no control Stream computation never ends The processing system must keep up with the event rate or degrade gracefully Near-real time answers

11 MapReduce Phases Data placement Data are split in storage blocks First replica is located in the same node as the client Second replica is placed on a different rack chosen at random Third replica is placed on the same rack than the second but in different node Balancer daemon Input Reader Input data can be retrieved from several datasources (file system, database, main memory) Data are split in FileSplits The unit of data processed by a map task Storage blocks (by default)

12 MapReduce Phases Map function Mandatory function A new map task is created per FileSplit (block) The user can not manage the number of mappers Each FileSplit is divided into records and the map processes each record <key,value> in turn Map function outputs the result as a new <key,value> pair.

13 MapReduce Phases Combiner function It does partial merging of data before sending them over the network It is executed on each machine that performs a map task Same code than the reducer function

14 MapReduce Phases Shuffle and Sort phase

15 MapReduce Phases Reduce function To merge map outputs The number of reducers can be managed by the user The Reduce function is invoked once for each distinct intermediate key Pairs with the same key will be processed as one group The input to each reduce task is guaranteed to be processed in increasing key order Output writer It is responsible for writing the output to stable storage Data storage could be modified

16 MapReduce Weaknesses and Limitations Large files optimization How to deal with images? HIPI Data format management Optimized for text inputs HIPI Selective access to data Hadoop++ provides indexing functionality Non intrusive Indexes are created at data load time and thus have no penalty at query time We must know the schema and MapReduce jobs High communication cost CoHadoop Related data are stored in the same node HDFS is extended with file-level property

18 MapReduce Weaknesses and Limitations Load Balancing The runtime of the slowest machine will easily dominate the total runtime. Plain partitioning schemes that are not data-aware don t get good results Even when the data is equally split to the available machines, equal runtime may not always be guaranteed Real-time processing MapReduce runs on a static snapshot of a data set The input data set cannot change. No reducer s input is ready to run until all mappers have finished A MapReduce computation has a start and an end MapReduce online

22 MapReduce Online Main modifications Map tasks were modified to push data to reducers Map buffer Fixed threshold Combiners are applied over buffer data Buffer data are sorted Data are written into the disk Files are registered in the TaskTracker TaskTracker sends files ASAP to the reducer

23 MapReduce Online Main Characteristics Online aggregation Reduce function is applied over the pipelined map outputs Snapshots are stored in HDFS Snapshots can be used as inputs for the next task Iteration Reducers can pipeline their output to the next map operator To avoid HDFS storage JobTracker was modified to accept a list of jobs

24 MapReduce Online Main Characteristics Continuous queries Mappers and reducers are fixed Reducers are configured to be executed periodically Map outputs are maintained in a buffer with unique id Reducer informs to the jobtraker when its task is finished Jobtracker informs mappers that data are no longer necessary

25 Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Main goal To treat a streaming computation as a series of deterministic batch computations on small time intervals Data are received and stored in intervals Model advantages It is easy to unify with batch systems Users only need to write one version of their analytic task Fault tolerant. Similar recovery mechanisms to batch systems Consistency is well-defined since each record is processed atomically with the interval in which it arrives

27 Muppet MapReduce-Style Processing of Fast Data Framework specifically developed for fast data Components Event<stream_id, timestamp, key, value> Stream is a sequence of events with the same stream_id and increasing order of timestamp Map function: map(event)=event* Memoryless Update function: update (event,slate)=event* Slate A slated is determined by the tuple <update U,key k> SLATE uk is an in-memory data structure which summarizes all events with key K that an update function U has seen so far Time-to-live parameter

28 Muppet: MapReduce-Style Processing of Fast Data Distributed execution The work flow is modeled as a direct graph Muppet starts up a set of workers on each machine A hash function is used to distribute events A special mapper is used to read from the input stream Slates All events with the same key will go to the same update Key-value storage - Cassandra Slates may outgrow the memory Persistent slates help recovering the application from crashes Slates could be queried long after the termination of the application

29 Muppet: MapReduce-Style Processing of Fast Data handling failures A worker A determines the worker B to which to send an event by hashing the key and destination updater function of the event If A cannot contact B, then it assumes the machine has failed, and A contacts the master to report The master broadcasts the machine failure to all workers Hash function is updated If updater fails then temporary slate data are lost.

30 ZooKeeper Wait-free coordination for Internet-scale systems Centralized service to coordinate distributed processes Shared hierarchical name space of data registers (znodes) Data are kept in-memory Znodes are limited to the amount of data that they can have The service is replicated over a set of machines / /cluster_1 /cluster_2 /cluster_1/node01 /cluster_1/node02 /cluster_1/node03 ip: x.x port:8045

31 Storm Distributed and fault-tolerant realtime computation Storm cluster Master node The Nimbus daemon is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures Worker nodes The Supervisor daemon listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Communication - Zookeeper Supervisor Zookeeper Supervisor Nimbus Zookeeper Supervisor Zookeeper Supervisor Supervisor

32 Storm Components Storm runs topologies Graph of computation Each node in a topology contains processing logic Stream Unbounded sequence of tuples Spout It reads input data from an external source and emits them as a stream It is capable of replaying a tuple Bolt Input streams > some processing > new streams. Spout Bolt Bolt Bolt Spout Bolt

34 Storm Fault tolerant If a worker process dies then the supervisor will restart it If a node dies then Nimbus will reassign those tasks to other machine If a daemon dies (Nimbus or Supervisor) then they restart State of Nimbus and workers is saved on Zookeeper Storm guarantees that each message will be fully processed. A tuple is considered "fully processed" when the tuple tree has been completely processed. User must specify links in the tree of tuples User must specify when an individual tuple is done

40 Conclusions Typically, systems are developed to solve an specific problem Lack of heterogeneous systems Attempting to build a general-purpose platform for both batch and stream computing would result in a highly complex system that may end up not being optimal for either task

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following

MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

PERFORMANCE ENHANCEMENT OF BIG DATA PROCESSING IN HADOOP MAP/REDUCE A report submitted in partial fulfillment of the requirements for the award of the degree of MASTER OF TECHNOLOGY in COMPUTER SCIENCE

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured

MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large

Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples