just share what can be shared, and write what can be written

Introduction to Apache Sqoop

To use Hadoop for analytics requires loading data into Hadoop clusters and processing it in conjunction with data that resides on enterprise application servers and databases. Loading GBs and TBs of data into HDFS from production databases or accessing it from map reduce applications is a challenging task. While doing so, we have to consider things like data consistency, overhead of running these jobs on production systems and at the end if this process would be efficient or not. Using batch scripts to load data is an inefficient way to go with.

Typical scenario, we want to process data stored in relational Database Management Systems with Map reduce capabilities. E.g. we have say, Legacy data or Lookup Tables etc. to process. Once solution we can think about is directly reading data from RDBMS tables in your mapper and the process the same.

But this would lead to the equivalent of a distributed denial of service (DDoS) attack on your RDBMS. So in practice, DON’T DO IT.

So what could be the possible solution to process data stored in relational database using Map reduce? Answer is importing that data on HDFS!!

Sqoop is nothing but SQL to Hadoop. Sqoop allows users to import data from their relational databases into HDFS and vice versa. Sqoop is an open source tool written at Cloudera .It allows us to

Import one table

Import complete database

Import selected tables

Import selected columns from a particular table

Filter out certain rows from certain table etc

Sqoop uses Map reduce to fetch data from RDBMS and stores that on HDFS. While doing so, it throttles the number of mappers accessing data on RDBMS to avoid DDoS. By default it uses four mappers but this value is configurable. It’s recommended not to set higher number of mappers, as it may lead to consuming all spool space of the source database.

Sqoop internally uses JDBC interface so it should work with any JDBC compatible database

Starting Sqoop

Once we have installed sqoop, it’s very easy to use sqoop command line tool. To start sqoop, you have to type sqoop with arguments and tools to run.

$ sqoop tool-name [tool-arguments]

Here tool-name would be the operation you would like to perform, e.g. import, export etc. While tool-arguments would be the extra parameters you need to specify which would help execution of tool. Example of tool-arguments could be –connect which would specify the source database URL or –username would allow you to specify the user with which we want to connect the source database.

Sqoop has help tool with it, so list down all available tools, we can run following command