2 Aims and Scope Provide an overview of join processing in MapReduce/Hadoop, focusing on complex join types, besides equi-joins e.g. theta-join, similarity join, top-k join, k-nn join On top of MapReduce Binary joins Identify common techniques (at high level) that can be used as building blocks when designing a parallel join algorithm 2

9 MapReduce A programming framework for parallel and scalable processing of huge datasets A job in MapReduce is defined as two separate phases Function Map (also called mapper) Function Reduce (also called reducer) Data is represented as key-value pairs Map (k1, v1) List (k2, v2) Reduce (k2, List(v2)) List (k3, v3) The data output by the Map phase are partitioned and sort-merged in order to reach the machines that compute the Reduce phase This intermediate phase is called Shuffle 9

12 MapReduce/Hadoop: 6 Basic Steps Source: C.Doulkeridis and K.Nørvåg. A Survey of Large-Scale Analytical Query Processing in MapReduce. In the VLDB Journal, June

13 1. Input Reader Splits Reads data from files and transforms them to key-value pairs It is possible to support different input sources, such as a database or main-memory Doc1 Doc2 Brazil, Germany, Argentine Chile, Germany, Brazil M The data form splits which is the unit of data handled by a map task A typical size of a split is the size of a block (default: 64MB in HDFS, but is customizable) Doc3 Doc4 Doc5 Greece, Japan, Brazil Chile, Germany, Chile, Brazil, Germany M 13

14 2. Map Function Takes as input a key-value pair from the Input Reader Runs the code (logic) of the Map function on the pair Produces as result a new key-value pair The results of the Map function are placed in a buffer in memory, and written on disk when full (e.g. 80%) spill files The files are merged in one sorted file 14

16 4. Partition Function Default: a hash function is used to partition the result of Map tasks to Reduce tasks Usually, works well for load balancing However, it is often useful to employ other partition functions Such a function can be user-defined Brazil Germany Argentine Chile Greece R R 16

17 5. Reduce Function The Reduce function is called once for each discrete key and is applied to all values associated with the key All pairs with the same key will be processed as a group The input to each Reduce task is given in increased key order It is possible for the user to define a comparator which will be used for sorting 17

18 6. Output Writer Responsible for writing the output to secondary storage (disk) Usually, this is a file However, it is possible to modify the function to store data, e.g., in a database 18

23 Map-side Join Operates on the map side, no reduce phase Requirements for each input Divided into the same number of partitions Sorted by the join key Has the same number of keys Advantages Very efficient (no intermediate data, no shuffling, simply scan and join) Disadvantages Very strict requirements, in practice an extra MR job is necessary to prepare the inputs High required memory (buffers both input partitions) Source: T.White Hadoop: The Definitive Guide 23

25 Reduce-side Join a.k.a. repartition join The join computation is performed on the reduce side Advantages The most general method (similar to a traditional parallel join) Smaller memory footprint (only tuples of one dataset with the same key) Disadvantages Cost of shuffling, I/O and communication costs for transferring data to reducers High memory requirements for skewed data Source: T.White Hadoop: The Definitive Guide 25

29 Equi-Joins: Broadcast Join Advantages Efficient: only map-phase Load the small dataset R in hash table: faster probes No pre-processing (e.g., sorting) Disadvantages One dataset must be quite small Fit in memory Distribute to all mappers 29

31 Equi-Joins: Semi-join Advantages When R is large, reduces size of intermediate data and communication costs Useful when many tuples of one dataset do not join with tuples of the other dataset Disadvantages Three MR jobs overheads: job initialization, communication, local and remote I/O Multiple accesses of datasets 31

33 Equi-Joins: Per-Split Semi-join Advantages (compared to Semi-join) Makes the 3 rd phase cheaper, as it moves only the records in R that will join with each split of L Disadvantages (compared to Semi-join) First two phases are more complicated 33

36 Join Matrix Representation M(i,j)=true, if the i-th tuple of S and the j-th tuple of T satisfy the join condition Problem: find a mapping of join matrix cells to reducers that minimizes job completion time Each join output tuple should be produced by exactly one reducer 36 Okcan et al. SIGMOD 11

39 In the paper How to perform near-optimal partitioning with strong optimality guarantees For cross-product SxT More efficient algorithms (M-Bucket-I, M-Bucket- O) when more statistics are available Okcan et al. SIGMOD 11 39

65 Analysis of Join Processing Source: C.Doulkeridis and K.Nørvåg. A Survey of Large-Scale Analytical Query Processing in MapReduce. In the VLDB Journal, June

66 Summary and Open Research Directions 66

67 Summary Processing joins over Big Data in MapReduce is a challenging task Especially for complex join types, large-scale data, skewed datasets, However, focus on the techniques(!), not the platform New techniques are required Old techniques (parallel joins) need to be revisited to accommodate the needs of new platforms (such as MapReduce) Window of opportunity to conduct research in an interesting topic 67

SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework Fariha Atta MSc Informatics School of Informatics University of Edinburgh 2010 T Abstract he Map/Reduce

Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr

Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

International Conference on Applied Science and Engineering Innovation (ASEI 2015) An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi Institute of Computer Forensics,

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?