Configuring Transient Hive ETL Jobs to Use the Amazon S3 Filesystem

Hive is a popular choice for batch extract-transform-load (ETL) jobs such as cleaning, serializing, deserializing, and transforming data. In
on-premise deployments, ETL jobs operate on data stored in a permanent Hadoop cluster that runs HDFS on local disks. However, ETL jobs are frequently transient and can benefit from cloud deployments
where cluster nodes can be quickly created and torn down as needed. This approach can translate to significant cost savings.

Important: Cloudera components writing data to S3 are constrained by the inherent limitation of Amazon S3 known as "eventual consistency." For
more information, see Data Storage
Considerations.

About Transient Jobs

Most ETL jobs on transient clusters run from scripts that make API calls to a provisioning service such as Cloudera Director. They can be triggered by external events, such as IoT (internet of
things) events like reaching a temperature threshold, or they can be run on a regular schedule, such as every day at midnight.

Transient Jobs Hosted on Amazon S3

Data residing on Amazon S3 and the node running Cloudera Director are the only persistent components. The computing nodes and local storage come and go with each transient workload.

Configuring and Running Jobs on Transient Clusters

Using AWS to run transient jobs involves the following steps, which are documented in an end-to-end example you can download from this Cloudera GitHub repository. Use this example to test transient clusters with Cloudera Director.

Configuring AWS Settings

Best Practices

Networking

Cloudera recommends deploying clusters within a VPC, using Security Groups to control network traffic. Each cluster should have outbound internet connectivity through a NAT (network
address translation) server when you deploy in a private subnet. If you deploy in a public subnet, each cluster needs direct connectivity. Inbound connections should be limited to traffic from
private IPs within the VPC and SSH access through port 22 to the gateway nodes from approved IP addresses. For details about using Cloudera Director to perform these steps, see Setting up the AWS Environment.

Data Access

Create an IAM role that gives the cluster access to S3 buckets. Using IAM roles is a more secure way to provide access to S3 than adding the S3 keys to Cloudera Manager by configuring
core-site.xml safety valves.

AWS Placement Groups

To improve performance, place worker nodes in an AWS placement group. See Placement Groups in the AWS documentation set.

Create the Cluster Configuration File

The cluster configuration file contains the information that Director needs to bootstrap and properly configure a cluster:

Deployment environment configuration.

Instance groups configuration.

List of services.

Pre- and post-creation scripts.

Custom service and role configurations.

Billing ID and license for hourly billing for Director use from Cloudera. See Usage-Based Billing.

Creating the cluster configuration file represents the bulk of the work of configuring Hive to use the S3 filesystem. This GitHub repository contains sample configurations for different cloud providers.

Testing the Cluster Configuration File

After defining the cluster configuration file, test it to make sure it runs without errors:

Use secure shell (SSH) to log in to the server running Cloudera Director.

Run the validate command by passing the configuration file to it:

cloudera-director validate <cluster_configuration_file_name.conf>

If Cloudera Director server is running in a separate instance from the Cloudera Director client, you must run:

Prepare the CDH AMIs

It is not a requirement to have preloaded AMIs containing CDH parcels that are already extracted. However, preloaded AMIs significantly speed up the cluster provisioning process. See
this repo in GitHub for instructions and scripts that create preloaded
AMIs.

After you have created preloaded AMIs, replace the AMI IDs in the cluster configuration file with the new preloaded AMI IDs to ensure that all cluster instances use the preloaded
AMIs.

Add the following code to the cluster configuration file in the cloudera-manager section to make sure that Cloudera Manager does not try to download the
parcels:

Run the Cloudera Director validate command again to test bringing up the cluster. See Testing the Cluster Configuration File. The cluster should come up significantly faster than it did when you tested it before.

Prepare the Job Wrapper Script

Define the Hive query or job that you want to execute and a wrapper shell script that runs required prerequisite commands before it executes the query or job on the transient cluster.
The Director public GitHub repository contains simple examples of a MapReduce job wrapper script and an Oozie job
wrapper script.

For example, the following is a Bash shell wrapper script for a Hive query:

Where query.q contains the Hive query. After you create the job wrapper script, test it to make sure it runs without errors.

Log Collection

Save all relevant log files in S3 because they disappear when you terminate the transient cluster. Use these log files to debug any failed jobs after the cluster is terminated. To save
the log files, add an additional step to your job wrapper shell script.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.