Oracle Blog

The State of the Dolphin...

Thursday Jul 26, 2012

Introduction

"Improving MySQL performance using Hadoop" was the talk which I and Manish Kumar gave at Java One & Oracle Develop 2012, India. Based on the response and interest of the audience, we decided to summarize the talk in a blog post. The slides of this talk can be found here. They also include a screen-cast of a live Hadoop system pulling data from MySQL and working on the popular 'word count' problem.

MySQL and Hadoop have been popularly considered as 'Friends with benefits' and our talk was aimed at showing how!

The benefits of MySQL to developers are the speed, reliability, data integrity and scalability. It can successfully process large amounts of data (upto terabytes). But for applications that require massive parallel processing we may need the benefits of a parallel processing system such as Hadoop.

Hadoop is a highly scalable distributed framework and extremely powerful in terms of computation. Hadoop is fault tolerant and parallelizes data processing across many nodes. Popular users of Hadoop are Yahoo, Facebook, Twitter and Netflix.

Combine the speed and reliability of MySQL with the high scalability of Hadoop and what you get is a powerful system capable of using relational data and crunching it at tremendous speeds!

Hadoop

Hadoop leverages its distributed file system - the Hadoop Filesystem (HDFS) - to cheaply and reliably replicate chunks of data. HDFS is a distributed, scalable and portable file system written in Java. Each node in a Hadoop instance typically has a single name-node and a cluster of data-nodes form the HDFS cluster. HDFS uses TCP/IP for communication and stores large files across multiple machines. One drawback of HDFS is that a single name-node stores all the meta-data in memory.

Submitting a task to a Hadoop system is done by writing a map-reduce job. A map-reduce job splits input data into independent chunks where each chunk is processed by the map task in a parallel manner. During the 'Reduce' phase, data from data nodes is merge sorted so that the key-value pairs for a given key are contiguous. The merged data is read sequentially and the valus are passed to the reduce method with an iterator reading the input file until the next key in encountered.

Sqoop

One of the popular ways to use MySQL with Hadoop is to 'Sqoop' it! Sqoop was originally developed by Cloudera and serves as an import tool for Hadoop. Recently it graduated from Apache Incubator to a Top Level Project. It allows users to easily import the data from MySQL database into Hadoop's HDFS so it can be used in Hive. Sqoop uses mysqldump to fetch the data stored in MySQL. One useful feature of Sqoop is that it can also be installed on Amazon Elastic Map Reduce (EMR).

Demo

In the demo session, we showed a live hadoop system with 1 name node and 2 data nodes. The idea was to show how easily a live Hadoop cluster can be brought up. We demonstrated the various phases of setting up the Hadoop system: installation, formatting the HDFS, editing the configuration files, starting the Hadoop cluster and finally running the map reduce job. We also addressed questions related to trouble shooting Hadoop deployments and common mistakes done while setting up a Hadoop cluster. After this, we wrote a basic mapreduce job for the word count problem in Java and ran the job on our Hadoop cluster. Although a 2-node cluster did not give much improvement in the time required to complete the task, the increased speed at which the map-reduce job returned the results was clearly noticeable in comparison to the same problem solved using simple SQL queries.

The following image depicts the word count problem solved using a map-reduce job:

Response

Overall the response to our session was overwhelming. Several interesting questions came up during the session and many attendees were interested to know about the other possible ways of integrating MySQL with Hadoop. Some of them came up with unique ideas and use cases in which the benefits of MySQL proved to be the perfect missing pieces in Hadoop deployments using typical non-relational databases.

Apart from MySQL, Hadoop, HDFS and Sqoop some of the related technologies that were also discussed in the session were Hive, Scribe and Thrift. Following are some helpful resources: