Overview

Shared Data Lake Services provide a way for you to centrally apply and enforce authentication, authorization, and audit policies across multiple ephemeral workload clusters. After "attaching" your workload cluster to the Data Lake Services, the attached cluster workloads will run in that security context.

While workloads are temporary, the security policies are long-running and shared for all workloads. As your workloads come and go, the instance of Data Lake Services lives on, providing consistent and available security policy definitions that are available for current and future ephemeral workloads.

Once you’ve created an instance of Data Lake Services - which for simplicity is referred to in the cloud
controller web UI as a "Data Lake" or "DLS" - you have an option to attach it to one or more ephemeral clusters.
This allows you to apply the authentication, authorization, and audit across multiple workload clusters.

Term

Description

Authentication Source

User source for authentication and definition of groups for authorization.

Data Lake Services

Runs Ranger, which is used for configuring authorization policies and is used for audit capture.

Attached Clusters

The clusters that get attached to the data lake. This is where you run workloads via JDBC and Zeppelin..

Architecture

The components of the Shared Data Lake Services include:

Component

Technology

Description

Schema

Apache Hive

Provides Hive schema (tables, views, and so on). If you have two or more workloads accessing the same Hive data, you need to share schema across those workloads.

Policy

Apache Ranger

Defines security policies around Hive schema. If you have two or more users accessing the same data, you need security policies to be consistently available and enforced.

Overview of Steps

Once you’ve created a data lake, you can associate it with one or more ephemeral clusters. This option is available when you create a cluster in the DATA LAKE SERVICES section.

Prerequisites

In order to set up a data lake, you must first set up the following resources:

Prerequisite

Description

Amazon S3 bucket

You must have an existing Amazon S3 bucket. This bucket will be used to store Ranger audit logs and it will serve as the default Hive warehouse location.

LDAP/AD Instance

You must have an existing LDAP/AD instance. Next, you must register it as an authentication source in the cloud controller web UI. For instructions on how to register an existing LDAP/AD, refer to Registering an Authentication Source.

Amazon RDS

You must have an existing Amazon RDS instance. You have two options:

Create an Amazon RDS instance (PostgreSQL). When creating a data lake, you will provide your endpoint and master credentials, and the cloud controller will automatically create databases for Hive and Ranger.

Create an Amazon RDS instance (PostgreSQL) with two databases on it - one for Hive and one for Ranger. You will register them when creating a data lake.