Introduction

Hortonworks Data Cloud (HDCloud) for Amazon Web Services (AWS) is a service that allows you to quickly
launch ephemeral clusters for workloads analyzing and processing data. Powered by
the Hortonworks Data Platform, Hortonworks Data Cloud is an
easy-to-use solution for handling big data use cases in the cloud for Interactive Analytics (Apache Hive LLAP),
Data Science (Apache Spark and Apache Zeppelin) and ETL (Apache Hive).

Use Cases

Ephemeral on-demand clusters: Spin up a Hadoop cluster within minutes and start running workloads immediately. Instead of going through infinite configuration options, choose from a set of prescriptive cluster configurations. Add additional nodes on demand and when you are done with your analysis, give the resources back to the cloud.

Integrate with Amazon S3: Collect and publish data across applications to Amazon S3, and then use this data for analysis. Collect and store data on S3 even while no Hadoop clusters are active and publish data to S3 for access so that your data persists after you terminate the cluster.

Automation: Automatically create clusters, run specific jobs, and then terminate the clusters.

Architecture

The following graphic illustrates high-level architecture of the Hortonworks Data Cloud:

Primary Components

The two primary components of Hortonworks Data Cloud are the cloud controller and one or more clusters being managed by that controller. The cloud controller and the cluster nodes run on EC2 instances.

The cloud controller is a web application that communicates with the AWS Services to create AWS resources on your behalf. Once the AWS resources are in place, the cloud controller uses Apache Ambari to deploy and configure the cluster to the AWS instances, based on your choice of HDP version and cluster configuration. Once your cluster is deployed, you can use the cloud controller to manage the cluster.

A cluster, used for running workloads, includes three node types: master, worker, and compute.

A master node runs the components for managing the cluster resources (including Ambari), storing intermediate data (e.g. HDFS), processing tasks, as well as other master components.

A worker node runs the components that are used for executing processing tasks (e.g. NodeManager) and handling storing data in HDFS (e.g. DataNode).

A compute node can optionally be used for running data processing tasks (e.g. NodeManager). Compute nodes can run on standard on-demand instances or on spot instances.

For the purposes of instance scaling and management, cluster instances are deployed into three auto scaling groups: one for the master node, one for the worker nodes, and another ones for the compute nodes. For more information on auto scaling groups, see AWS documentation.

AWS Lambda is a utility service for running code in AWS. This service is used when deploying the cloud controller into a new VPC to validate if the VPC and subnet specified exist and if the subnet belongs to that VPC.

Amazon RDS provides a relational database in AWS. This service is used for managing reusable, shared Hive Metastores and as a configuration option when launching the cloud controller.

Network and Security

In addition to the Amazon EC2 instances created for the cloud controller and cluster nodes, Hortonworks Data Cloud deploys the following network and security AWS resources on your behalf:

An Amazon VPC configured with a public subnet: When deploying the cloud controller, you have two options: (1) you can specify an existing VPC, or (2) have the cloud controller create a new VPC. Each cluster is launched into a separate subnet. For more information, see Security documentation.

An Internet gateway and a route table (as part of VPC infrastructure): An Internet gateway is used to enable outbound access to the Internet from the control plane and the clusters, and a route table is used to connect the subnet to the Internet gateway. For more information on Amazon VPC architecture, see AWS documentation.

Security groups: to control the inbound and outbound traffic to and from the control plane instance. For more information, see Security documentation.

IAM instance roles: to hold the permissions to create certain resources. For more information, see Security documentation.

Amazon RDS

When creating a cluster, you have an option to have a Hive Metastore database created with the cluster or to use an external
Hive Metastore that is backed by Amazon RDS. Using an external Amazon RDS database for the Hive Metastore allows
you to preserve the Hive Metastore metadata and reuse between clusters. For more information, see Managing Metastores documentation.

Furthermore, you have an option to use an external Amazon RDS database to store cloud controller configuration information for upgrade and recovery purposes. For more information, see Amazon RDS Instance documentation.

Amazon S3

Hortonworks Data Cloud provides seamless access to Amazon S3 buckets, in which you can store data for an extended period of time. You
can copy the data sets to HDFS for analysis and then copy back to S3 when done. For more information, see Data Storage on Amazon S3 documentation.

Get Started

This section will get you running Hortonworks Data Cloud in your AWS environment.

The Hortonworks Data Cloud software runs in your AWS environment. You are
responsible for AWS charges while running Hortonworks Data Cloud and the clusters being managed by Hortonworks Data Cloud. To learn more about AWS pricing, see service-specific pricing pages or AWS Simple Monthly Calculator.

Prerequisites

To use Hortonworks Data Cloud, you need the following:

AWS account: If you already have an AWS account, log in to the AWS Management Console.
Alternatively, you can create a new AWS account.

A key pair in a selected region: The Amazon EC2 instances that you create for Hortonworks Data Cloud
will be accessible by the key pair that you provide during installation. Refer to
the AWS documentation
for instructions on how to create a key pair in a selected region.

AWS Regions

Not all AWS Services are supported in all regions (For details, see the AWS Region Table).
Therefore, Hortonworks Data Cloud can only be launched in the following regions:

This cluster configuration includes a Technical Preview of Hive 2 LLAP.

For a full list of services included in each of the configurations, refer to Cluster Services.

Choosing Your Configuration

When creating a cluster, you can choose a more stable cluster configuration for a predicable experience.
Alternatively, you can try the latest capabilities by choosing a cluster configuration
that is much more experimental. The following configuration classification applies:

Stable configurations are the best choice if you want to avoid issues and other problems with launching and using clusters.

If you want to use a Technical Preview version of a component in a release of HDP, use these configurations.

These are the most cutting edge of the configurations, including Technical Preview components in a Technical Preview HDP release.