Configuring Extraction for Altus Clusters on AWS

Follow the steps below to configure Cloudera Navigator to extract metadata and lineage from single-user transient clusters deployed to Amazon Web Services using Cloudera Altus. The
Cloudera Navigator extraction process for clusters launched by Cloudera Altus works as follows:

Any HDFS paths in a job, query, or data entity are extracted as proxy entities for the path, similar to how Hive entities are extracted. That means that HDFS is not bulk extracted from
an Altus cluster.

Hive Metastore (HMS) entities are also not bulk extracted. Cloudera Navigator extracts Hive entities used in queries that generate lineage, such as databases, tables, and so on.

Requirements

Cloudera Navigator collects metadata and lineage entities from transient clusters deployed to AWS by Cloudera Altus users. The metadata and lineage data is not collected directly from
the transient clusters but rather from an Amazon S3 bucket that serves as the storage mechanism for the Telemetry Publisher running in the cluster (see How it Works: Background to the Setup Tasks for details).

Configuring access permissions to the S3 bucket from Cloudera Altus (IAM role with access policy) and Cloudera Navigator (AWS access key credentials). Transient clusters instantiated
by Altus users must have read and write permissions to the Amazon S3 bucket used by Telemetry Publisher. The on-premises centralized Cloudera Navigator instance must have read permissions on the same
Amazon S3 bucket.

A Cloudera Altus user account that can run jobs on transient clusters deployed to AWS.

An AWS IAM user account.

An Amazon S3 bucket to use as the storage location for metadata and lineage.

AWS Credentials for the AWS account hosting the S3 bucket.

Access to the on-premises or persistent Cloudera Manager cluster running Cloudera Navigator. The Cloudera Manager user role of Full Administrator and the ability to log in to the Cloudera Manager Admin Console is required.

Obtaining AWS Access Key Credentials for the Amazon S3 Bucket

AWS Access Keys are available to be
downloaded whenever you create an IAM user account through the AWS Management Console. If you are configuring an existing Amazon S3 bucket and you do not have the AWS Access Keys for it, you can
generate new AWS Access Keys from the AWS account using either the AWS Management Console or the AWS CLI.

Important: The AWS account where the access keys are generated must have read and write access to the Amazon S3 bucket.

Generating new AWS access keys deactivates any previously issued credentials and makes the newly generated credentials Active for the AWS account. Keep that
in mind if you obtain new AWS access keys to use for the Cloudera Navigator-Cloudera Altus integration.

Note: If you have the AWS access keys obtained when the account was created, do not regenerate a new set of AWS access keys unless you want to
change the credentials.

Navigate to the Security credentials section of the Users page in IAM for this account. For example:

Click the Create access key button to generate new AWS access keys. Extract the credentials (the Access Key Id and Secret Key) from the user interface
or download the credentials.csv for later use.

Cloudera Altus Configuration

Cloudera Altus instantiates single-user transient clusters focused on data engineering workloads that use compute services such as Hive or MapReduce2. The typical deployment scenario
involves running scripts that invoke the Cloudera Altus CLI to instantiate the cluster, in this case, using Amazon Web Services according to the details specified in the Altus environment. An Altus
environment specifies all resources needed by the cluster, including the AWS account that will be used to instantiate the cluster. The Cloudera Altus user account is configured to provide
cross-account access to the AWS account that has permissions to launch AWS Elastic Compute Cloud (EC2) instances and use other AWS resources, including Amazon S3 buckets.

Use the Environment Wizard to specify the Amazon S3
bucket that clusters will use to store metadata and lineage information for collection by Cloudera Navigator. Specifically, the Instance Profile Role page of the
Configuration Wizard lets you enable integration with Cloudera Navigator and specify the Amazon S3 bucket that will hold collected metadata and lineage information.

On the Instance Profile Role page of the Configuration Wizard, complete the following steps:

Click the Enable checkbox for Cloudera Navigator Integration.

In the Cloudera Navigator S3 Data Bucket field, enter the path to the Amazon S3 bucket, including the final /, which
identifies the target as an S3 bucket. For example:

s3a://cluster-lab.example.com/cust-input/

To provide the correct access to the S3 bucket from Altus, you must also create the appropriate policy in the AWS
Management Console and apply the policy to the Amazon S3 bucket. To provide the correct access to the S3 bucket from Navigator, follow the steps in Cloudera Navigator Configuration.

Cloudera Navigator Configuration

The Cloudera Navigator runs in the context of Cloudera Manager Server. Its two role instances, the Navigator Audit Server and Navigator Metadata Server, run on the Cloudera Management
Service. The Navigator Metadata Server role instance is the component that extracts metadata and lineage from the Amazon S3 bucket using the AWS Credentials configured for connectivity in the steps
below:

Follow the steps in Configuring Connectivity for AWS Credentials to configure connectivity for
AWS access keys that are already available to be used for the Amazon S3 bucket but have not yet been configured for connectivity.

Important: Cloudera Navigator extracts metadata and lineage for clusters deployed using Altus from one Amazon S3 bucket only. In addition, for
any given Amazon S3 bucket collecting metadata and lineage from Altus clusters, configure only one Cloudera Navigator instance to extract from that Amazon S3 bucket. Using multiple Cloudera Navigator
instances to extract from the same Amazon S3 bucket is not supported and has unpredictable results.

Adding AWS Credentials and Configuring Connectivity

The AWS access keys must be added to the Cloudera Manager Server for use by Cloudera Navigator. These credentials must be from the AWS account hosting the Amazon S3 bucket that is
configured in the Altus environment.
Note: The AWS account associated with these credentials must have cross-account access permissions from the Altus user account that will launch
clusters on AWS and run jobs. These credentials must also have read and write permissions on the S3 bucket because the clusters launched must be able to write metadata and lineage information to the
Amazon S3 bucket as jobs run.

Log in to the Cloudera Manager Admin Console.

Select Administration > External Accounts.

In the AWS Credentials tab, click Add Access Key Credentials.

Enter a meaningful name for the AWS Credential, such as the type of jobs the associated clusters will run (for example, etl-processing). This name is for
your own information and is not checked against any Cloudera Altus or AWS attributes.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.