Hadoop in Practice author Alex Holmes explains how you can use the Sqoop relational database import / export system with Hadoop.

WEBINAR:On-Demand

Data splitting

How is Sqoop able to parallelize imports across multiple mappers[1]? In figure 1, we showed how Sqoop's first step is to pull metadata from the database. It inspects the table being imported to determine the primary key and runs a query to determine the lower and upper bounds of the data in the table (shown in figure 3). A somewhat even distribution of data within the minimum and maximum keys is assumed by dividing the delta by the number of mappers. Each mapper is then fed a unique query containing a range of the primary key.

Figure 3. Sqoop preprocessing steps to determine query splits

You can configure Sqoop to use a non-primary key with the --split-by argument. This can be useful in situations where the primary key doesn't have an even distribution of values between the min and max values. For large tables, however, you need to be careful that the column specified in --split-by is indexed to ensure optimal import times.

You can use the --boundary-query argument to construct an alternative query to determine the minimum and maximum values.

Incremental imports

You can also perform incremental imports. Sqoop supports two types: "append," which works for numerical data that is incrementing over time, such as auto-increment keys, and "lastmodified," which works on time-stamped data. In both cases, you need to specify the column using --check-column, the mode via the --incremental argument (the value must be either "append" or "lastmodified"), and, finally, the actual value to use to determine the incremental changes, --last-value. Using our example, if we want to import stock data that is newer than January 1, 2005, we would do the following:

Sqoop jobs and the metastore

You can see in the command output the last value that was encountered for the increment column. How can we best automate a process that can reuse that value? Sqoop has the notion of a job, which can save this information and reuse it in subsequent executions:

This merely saves the notion of this command as a job in something called the Sqoop metastore. A Sqoop metastore keeps track of all jobs. By default, the metastore is contained in your home directory under.sqoop and is only used for your own jobs. If you want to share jobs, you would need to install a JDBC-compliant database and use the --meta-connect argument to specify its location when issuing job commands.

The job create command executed in the previous example didn't do anything other than add the job to the metastore. To run the job, you need to explicitly execute it as shown in this listing:

#4 — The metadata includes the last value of your incremental column. This is actually the time that the command was executed, and not the last value seen in the table. If you are using this feature, make sure that the database server and any clients interacting with the server (including the Sqoop client) have their clocks synced with the Network Time Protocol (NTP).

Unfortunately, the --options-file argument, which referred to your local file with your username and password, doesn't work with jobs in Sqoop. The password also can't be specified when creating the job. Sqoop will instead prompt for the password when running the job. To make this work in an automated script, you need to use Expect, a Linux automation tool, to supply the password from a local file when it detects Sqoop prompting for a password. The source of an Expect script that works with Sqoop is on our GitHub at http://goo.gl/yL4KQ.

Fast MySQL imports

What if you want to bypass JDBC altogether and use the "fast" MySQL Sqoop connector for a high-throughput load into HDFS? This approach uses the utility shipped with MySQL to perform the mysqldump load. You must make sure that mysqldump is in the PATH of the user running the MapReduce job. To enable use of the "fast" connector you must specify the --direct argument.

What are the disadvantages of the fast connectors? First, only MySQL and PostgreSQL are currently supported. Fast connectors also only work with text output files—specifying Avro or Sequence File as the output format of the import won't work.

Importing to Hive

The final step in this technique is to use Sqoop to import your data into a Hive table. The only difference between an HDFS import and a Hive import is that the Hive import has a post-processing step where the Hive table is created and loaded, as shown in figure 4.

Figure 4. Sequence of events of a Sqoop Hive import

When data is loaded into Hive from an HDFS file or directory, such as in the case of Sqoop Hive imports (step 4 in the previous diagram), for the sake of efficiency, Hive moves the directory into its warehouse rather than copying the data (step 5). The HDFS directory that the Sqoop MapReduce job writes to won't exist after the import.

Hive imports are triggered via the --hive-import argument. Just as in the fast connector's case, this option isn't compatible with the --as-avrodatafile and --as-sequencefile options.

You will likely have downstream processing issues if you are importing columns that can contain any of Hive's delimiters ( \n, \r and \01 characters). You have two options in such cases: either specify --hive-drop-import-delims, which will remove conflicting characters as part of the import, or specify --hive-delims-replacement, which will replace them with a different character.

If the Hive table already exists, the data will be appended to the existing table. If this is not the desired behavior, you can use the --hive-overwrite argument to indicate that the existing table should be replaced with the imported data.

Data in Hive can also be compressed. Since the LZOP compression codec is the only splittable codec[2] in Hadoop, it is the codec that should be used for Hive compression. The following example shows how to use the --hive-overwrite in conjunction with enabling LZOP compression. For this to work, you'll need to have built and installed LZOP on your cluster because it isn't bundled with Hadoop (or CDH) by default.

Finally, you can use the --hive-partition-key and the --hive-partition-value to create different Hive partitions based on the value of a column being imported. For example, if you want to partition your input by date, you would do the following:

Now, the previous example is not optimal by any means. Ideally, a single import would be able to create multiple Hive partitions. Because we're limited to specifying a single key and value, we'd need to run the import once per unique partition value, which is laborious. We would be better off importing into a non-partitioned Hive table and then retroactively creating partitions on the table after it had been loaded.

Also, the SQL query that we supply to Sqoop must also take care of filtering out the results, such that only those that match the partition are included. In other words, it would have been useful if Sqoop would have updated the WHERE clause with symbol = "AAPL" rather than having to do this ourselves.

Discussion

Obviously, for Sqoop to work, your Hadoop cluster nodes need to have access to the MySQL database. A common source of error is either misconfiguration or lack of connectivity from the Hadoop nodes. It's probably wise to log onto one of the Hadoop nodes and attempt to connect to the MySQL server using the MySQL client, and/or attempt access with the mysqldump utility (if using a fast connector).

Another important note when using a fast connector is that it is assumed that mysqldump is installed on each Hadoop node and is in the PATH of the user running the map tasks.