Cloudera now supports Azure Data Lake Store

With the release of Cloudera Enterprise Data Hub 5.12, you can now run Spark, Hive, HBase, Impala, and MapReduce workloads in a Cloudera cluster on Azure Data Lake Store (ADLS). Running on ADLS has the following benefits:

Grow or shrink a cluster independent of the size of the data.

Data persists independently as you spin up or tear down a cluster. Other clusters and compute engines, such as Azure Data Lake Analytics or Azure SQL Data Warehouse, can execute workload on the same data.

Data is encrypted at rest by default using service-managed or customer-managed keys in Azure Key Vault, and is encrypted with SSL while in transit.

High data durability at lower cost as data replication is managed by Data Lake Store and exposed from HDFS compatible interface rather than having to replicate data both in HDFS and at the cloud storage infrastructure level.

Add a Data Lake Store for cluster wide access

Step 1: ADLS uses Azure Active Directory for identity management and authentication. To access ADLS from a Cloudera cluster, first create a service principal in Azure AD. You will need the Application ID, Authentication Key, and Tenant ID of the service principal.

Step 2: To access ADLS, assign the permissions for the service principal created in the previous step. To do this, go to the Azure portal, navigate to the Data Lake Store, and select Data Explorer. Then navigate to the target path, select Access and add the service principal with appropriate access rights. Refer to this document for details on access control in ADLS.

Specify a Data Lake Store in the Hadoop command line

Instead of, or in addition to, configuring a Data Lake Store for cluster wide access, you could also provide ADLS access information in the command line of a MapReduce or Spark job. With this method, if you use an Azure AD refresh token instead of a service principal, and encrypt the credentials in a .JCEKS file under a user’s home directory, you gain the following benefits:

Each user can use their own credentials instead of having a cluster wide credential

Nobody can see another user’s credential because it’s encrypted in .JCEKS in the user’s home directory

No need to store credentials in clear text in a configuration file

No need to wait for someone who has rights to create service principals in Azure AD

The following steps illustrate an example of how you can set this up by using the refresh token obtained by signing in to the Azure cross platform client tool.

Step 1: Sign in to Azure cli by running the command “azure login”, then get the refreshToken and _clientId from .azure/accessTokens.json under the user’s home directory.

Step 2: Run the following commands to set up credentials to access ADLS: