Data Engineering with Cloudera Altus

With modern businesses dealing with an ever-increasing volume of data, and an expanding set of data sources, the data engineering process that enables analysis, visualization, and reporting only becomes more important.

When considering running data engineering workloads in the public cloud, there are capabilities which enable different operational models from on-premises deployments. The key factors here are the presence of a distinct storage layer within the cloud environment, and the ability to provision compute resources on-demand (e.g.: with Amazon’s S3 and EC2 respectively). In such an environment, it becomes possible to decouple data storage from computation without creating data silos. When you combine these environmental characteristics with the observation that many common data engineering workloads are repeated and periodic, we can envisage a model where we create clusters for specific workloads and then terminate them when the work is done.

This is attractive, both in the way it aligns with paying for resources only when they are being used and in avoiding the operational overhead of long lived clusters with their concomitant lifecycle management overhead (upgrades, reconfiguration, etc.).

Cloudera Altus Data Engineering

This where Cloudera’s new Altus DataEngineering service comes into the picture. Altus is our new cloud service platform, and our first service is Data Engineering. Coming out of the previous discussion, the key capabilities we need to have are creating and terminating clusters when they are needed, and submitting jobs to those clusters. With our Data Engineering service, we have made these fundamental actions as simple and as streamlined as possible, so that customers should only need to worry about what really matters to them: writing their data engineering workloads and running them.

Connecting Altus to AWS

Once you are logged in to Altus, the first task is to connect your AWS Account to it. We take advantage of AWS cross-account access roles to establish a trust relationship that allows Altus to take actions on your behalf to provision clusters while not requiring Altus to retain any persistent credentials. Alongside this delegated role, you also need to identify certain resources on the AWS side, such as which subnets you want to deploy clusters into and which S3 bucket to store cluster logs in.

Within Altus, we define “Environments” to encapsulate these AWS resources and the trust relationship. You can have as many environments as you need, and you can use them to logically separate where work is done, for example, to separate development and production workloads.

The easiest way to create an Environment is to use the Environment QuickStart, which will actually create all the AWS-side resources for you, thus requiring no explicit pre-setup. It does this through CloudFormation in AWS so that you get a separate VPC and associated resources, which keeps any Altus clusters segregated from anything else in your AWS Account. The Environment QuickStart is very simple, as can be seen below:

If you require greater control over AWS resources, for example if you want to use an existing VPC, Altus also provides a step-by-step Configuration Wizard which guides you through the process of creating, or modifying, the necessary AWS resources.

Creating a cluster with Altus Data Engineering

With an Environment in place, you can now create a cluster. This is a straightforward process: requiring far less information than setting up a traditional shared-use cluster; as we know what kind of workload will be run on the cluster, and that it will only be running that kind of workload, we can establish an optimized configuration for the cluster automatically. The primary questions to be answered are: what compute service (Hive, Spark, MapReduce2) you want to use, which environment to run in, and the physical characteristics of the cluster (instance type, and node count). The final section covers specifying the SSH key and Cloudera Manager credentials you want your cluster to use (Altus Data Engineering clusters include read-only access to Cloudera Manager).

Click ‘Create Cluster’ and off we go.

Submitting Jobs with Altus Data Engineering

But you don’t need to wait to start submitting jobs; Altus Data Engineering maintains an internal job queue, independent of the cluster itself, so we can immediately queue up any submitted jobs until the cluster is ready.

Job Submission is a flexible process – you can submit jobs individually, or as a group, and you can also choose whether to submit them to an existing cluster, a brand new one (defined in-line) or a clone of an existing cluster. If submitting to a new cluster, you can also specify that the cluster should be terminated when all jobs have completed. In this way, it’s possible to submit a complete workload and have it processed without having to do any synchronization or polling to manage cluster lifecycle.

Job Parameters

Regardless of the kind of job you are submitting, you will generally be working with job resources stored in S3, both data and program files (e.g.: jars, hql scripts, etc.). Program files can be loaded from S3 just like data, by providing an s3a:// url to those files.

The Altus CLI

All of our discussion so far as been around using the UI, but Altus also comes with a comprehensive CLI that exposes all the functionality Altus provides, and which can readily be used for scripted or automated scenarios. Additional details on installing and using the CLI are available in the Altus documentation.

Connecting to a Cluster

Under normal conditions it will not be necessary to connect directly to a cluster; but during development, you’ll probably want to look at execution details on Cloudera Manager or the YARN/Spark History Servers. Network connectivity to cloud clusters can be a complex scenario, and so we’ve made it easier by implementing a SOCKS proxy helper in our CLI. A SOCKS proxy helps by working over a single SSH connection to the cluster and allowing internal DNS names (that don’t resolve outside the cloud environment) to be used normally. Our CLI can setup the proxy automatically, and even start a Chrome browser session configured to use the proxy (this part is optional and you can manually use your preferred browser if you wish)

Now, from inside your browser, you can reach Cloudera Manager and then access the History Server UIs as desired.

Finally, you can SSH to the cluster using the key provided during cluster creation.

Looking Forward

This blog post has provided a short introduction into the capabilities of Altus Data Engineering. You can consult the in-product documentation for an extended discussion of these capabilities and more complex scenarios.

In future blog posts, we’ll dig deeper into more of these capabilities as well as exciting new features that will be showing up over time. Additional information on Cloudera Altus can be found on our Vision blog: Simplifying Big Data in the Cloud with Cloudera Altus.

3 responses on “Data Engineering with Cloudera Altus”

This is a very interesting service but I am trying to compare this service to DataBricks offering. How would you differentiate that vs. this. Databricks spins up a AWS cluster underneath gives you a environment to run your spark and other jobs and manages things for you.

This blog comment section isn’t really the right medium for an extensive discussion of the two, but at a high level I’d say the basic concept of managing clusters doing work in the customer AWS account is common between the two, but the actual clusters and their capabilities are very different, when you compare the set of Hadoop ecosystem components available in a CDH cluster vs that provided by Databricks.

Sudhir,

The most obvious difference between Altus and Director is the form factor. Director still requires you install and manage some software (Director itself) to use it, while Altus is a fully managed service. Director also provides a more general toolkit that can be used to deploy clusters for a variety of different use-cases. Altus is tightly focused, and is addressing one primary use case today (Data Engineering) , although the set will grow over time. In exchange, it provides a more streamlined user experience than Director does, with a reduced management footprint. Which is right for you will depend on what your use-case is.