Downloading and Installing Apache Sqoop

This guide helps you in downloading and installing apache sqoop. Apache Sqoop supports the Linux operating system, and there are several installation options. One option is the source tarball that is provided with every release. This tarball contains only the source code of the project. You can’t use it directly and will need to first compile the sources into binary executables.

For your convenience, the Sqoop community provides a binary tarball for each major supported version of Hadoop along with the source tarball. In addition to the tarballs, there are open source projects and commercial companies
that provide operating system-specific packages. One such project, Apache Bigtop, provides rpm packages for Red Hat, CentOS, SUSE, and deb packages for Ubuntu and Debian. The biggest benefit of using packages over tarballs is their seamless integration with the operating system: for example, Configuration files are stored in /etc/ and logs
in /var/log.

You can download the binary tarballs from the Apache Sqoop website. All binary tarballs contain a .bin__hadoop string embedded in their name, followed by the Apache Hadoop major version that was used to generate them. For Hadoop 1.x, the archive name will include the string .bin__hadoop-1.0.0. While the naming convention suggests
this tarball only works with version 1.0.0, in fact, it’s fully compatible not only with the entire 1.0.x release branch but also with version 1.1.0. It’s very important to download the binary tarball created for your Hadoop version. Hadoop has changed internal interfaces between some of the major versions; therefore, using a Sqoop tarball that was compiled against Hadoop version 1.x with, say, Hadoop version 2.x, will not work.

Installing packages is simpler than using tarballs. They are already integrated with the operating system and will automatically download and install most of the required dependencies during the Sqoop installation.

Installing Apache Sqoop via Bigtop

Bigtop provides repositories that can be easily added into your system in order to find and install the dependencies. Bigtop installation instructions can be found in the Bigtop project documentation. Once Bigtop is successfully deployed, installing Sqoop is very simple and can be done with the following commands:

To install Sqoop on a Red Hat, CentOS, or other yum system:

$ sudo yum install sqoop

To install Sqoop on an Ubuntu, Debian, or another deb-based system:

$ sudo apt-get install sqoop

To install Sqoop on a SLES system:

$ sudo zypper install sqoop

Sqoop’s main configuration file sqoop-site.xml is available in the configuration directory (conf/ when using the tarball or /etc/sqoop/conf when using Bigtop packages). While you can further customize Sqoop, the defaults will suffice in a majority of cases. All available properties are documented in the sqoop-site.xml file

Apache Sqoop & Cloudera Quick Start

Cloudera Quick Starts already has sqoop available and you don’t need to do anything. Cloudera quick start VM install mysql jdbc driver and you can practice it.

Installing JDBC Drivers

Sqoop requires the JDBC drivers for your specific database server (MySQL, Oracle, etc.) in order to transfer data. They are not bundled in the tarball or packages.

You need to download the JDBC drivers and then install them into Sqoop. JDBC drivers are usually available free of charge from the database vendors’ websites. Some enterprise data stores might bundle the driver with the installation itself. After you’ve obtained the driver, you need to copy the driver’s JAR file(s) into Sqoop’s lib/ directory. If you’re using the Sqoop tarball, copy the JAR files directly into the lib/ directory after unzipping
the tarball. If you’re using packages, you will need to copy the driver files into the /usr/lib/sqoop/lib directory.

Each database vendor has a slightly different method for retrieving the JDBC driver. Most of them make it available as a free download from their websites

Installing Specialized Connectors

Some database systems provide special connectors, which are not part of the Sqoop distribution, and these take advantage of advanced database features. If you want to take the advantage of these optimizations, you will need to individually download and install those specialized connectors.

On the node running Sqoop, you can install the specialized connectors anywhere on the local filesystem. If you plan to run Sqoop from multiple nodes, you have to install the connector on all of those nodes. To be clear, you do not have to install the connector on all nodes in your cluster, as Sqoop will automatically propagate the appropriate JARs
as needed throughout your cluster.

In addition to installing the connector JARs on the local filesystem, you also need to register them with Sqoop. First, create a directory manager.d in the Sqoop configuration directory (if it does not exist already). The configuration directory might be in a different location, based on how you’ve installed Sqoop. With packages, it’s usually in the /etc/ sqoop directory, and with tarballs, it’s usually in the conf/ directory. Then, inside this directory, you need to create a file (naming it after the connector is a recommended best practice) that contains the following line:

connector.fully.qualified.class.name=/full/path/to/the/jar

In addition to the built-in connectors, there are many specialized connectors available for download. Some of them are further described in this book. For example, OraOop and Cloudera Connector for Teradata. More advanced users can develop their own connectors by following the guidelines listed in the Sqoop Developer’s Guide.