How To Set Up a Shared Amazon RDS as Your Hive Metastore

Before CDH 5.10, every CDH cluster had to have its own Apache Hive Metastore (HMS) backend database. This model is ideal for clusters where each cluster contains the data locally along with the metadata. In the cloud, however, many CDH clusters run directly on a shared object store (like Amazon S3), making it possible for the data to live across multiple clusters and beyond any cluster’s lifespan. In this scenario clusters need to regenerate and coordinate metadata for the underlying shared data individually.

From CDH 5.10 onward, clusters running in AWS cloud can share a single persistent instance of RDS as the HMS backend database. This enables persistent sharing of metadata beyond a cluster’s life cycle so that subsequent clusters need not regenerate metadata as before.

Advantages of This Approach

Using a shared Amazon RDS server as your HMS backend enables you to deploy and share data and metadata across multiple transient as well as persistent clusters, provided they adhere to restrictions outlined in the “Supported Scenarios” section below. For example, you can have multiple transient Hive or Apache Spark clusters writing table data and metadata which can be subsequently queried by a persistent Apache Impala (incubating) cluster. Or you might have 2-3 different transient clusters, each dealing with different types of jobs on different datasets that spin up, read raw data from S3, do the ETL, write data out to S3, and spin down. In this scenario, you want each cluster to be able to simply point to a permanent HMS and do the ETL. Using RDS as a shared HMS backend database greatly reduces your overhead because you no longer need to recreate the HMS again and again for each cluster, every day, for each transient ETL job.