How Amazon EMR Uses AWS KMS

When you use an Amazon EMR cluster, you can configure
the cluster to encrypt data at rest, which means the cluster encrypts data
before saving it to a persistent storage location. You can encrypt data at rest on
the EMR File
System (EMRFS), on the storage volumes of cluster nodes, or both. To encrypt data
at rest, you
can use a customer master key (CMK) in AWS KMS. The following topics explain how an
Amazon EMR cluster
uses a CMK to encrypt data at rest.

Amazon EMR clusters also encrypt data in transit, which means the cluster
encrypts data before sending it through the network. You cannot use a CMK to encrypt
data in
transit. For more information, see In-Transit Data Encryption in the Amazon EMR Release Guide.

Encrypting Data on the EMR File System (EMRFS)

Amazon EMR clusters use two distributed files systems:

The Hadoop Distributed File System (HDFS). HDFS encryption does not use a CMK in
AWS KMS.

The EMR File System (EMRFS). EMRFS is an implementation of HDFS that allows Amazon
EMR
clusters to store data in Amazon Simple Storage Service (Amazon S3). EMRFS supports
four encryption options, two of
which use a CMK in AWS KMS. For more information about all four of the EMRFS encryption
options, see At-Rest Encryption for Amazon S3 with EMRFS in the Amazon EMR Release
Guide.

The two EMRFS encryption options that use a CMK use the following encryption features
offered by Amazon S3:

When you configure an Amazon EMR cluster to encrypt data on EMRFS with SSE-KMS or
CSE-KMS, you
choose the CMK in AWS KMS that you want Amazon S3 or the Amazon EMR cluster to use.
With SSE-KMS, you can
choose the AWS-managed CMK for Amazon S3 with the alias aws/s3,
or a custom CMK that you create. With CSE-KMS, you must choose a custom CMK that you
create.
When you choose a custom CMK, you must ensure that your Amazon EMR cluster has permission
to use
the CMK. For more information, see Add the EMR Instance Role to an AWS KMS CMK in the Amazon EMR Release
Guide.

For both SSE-KMS and CSE-KMS, the CMK you choose is the master key in an envelope encryption workflow. This means the data is encrypted
with a unique data encryption key (or data key), and this data key is
encrypted under the CMK in AWS KMS. The encrypted data and an encrypted copy of its
data key are
stored together as a single encrypted object in an S3 bucket. For more information
about how
this works, see the following topics.

Process for Encrypting Data on EMRFS with
SSE-KMS

When you configure an Amazon EMR cluster to use SSE-KMS, the encryption process works
like
this:

The cluster sends data to Amazon S3 for storage in an S3 bucket.

Amazon S3 sends a GenerateDataKey request to AWS KMS, specifying the key ID of the CMK that you
chose when you configured the cluster to use SSE-KMS. The request includes encryption
context; for more information, see Encryption Context.

AWS KMS generates a unique data encryption key (data key) and then sends two copies
of
this data key to Amazon S3. One copy is unencrypted (plaintext), and the other copy
is
encrypted under the CMK.

Amazon S3 uses the plaintext data key to encrypt the data that it received in step
1, and
then removes the plaintext data key from memory as soon as possible after use.

Amazon S3 stores the encrypted data and the encrypted copy of the data key together
as a
single encrypted object in an S3 bucket.

The decryption process works like this:

The cluster requests an encrypted data object from an S3 bucket.

Amazon S3 extracts the encrypted data key from the S3 object, and then sends the
encrypted data key to AWS KMS with a Decrypt request. The request includes encryption context; for more
information, see Encryption Context.

AWS KMS decrypts the encrypted data key using the same CMK that was used to encrypt
it, and then sends the decrypted (plaintext) data key to Amazon S3.

Amazon S3 uses the plaintext data key to decrypt the encrypted data, and then removes
the
plaintext data key from memory as soon as possible after use.

Amazon S3 sends the decrypted data to the cluster.

Process for Encrypting Data on EMRFS with
CSE-KMS

When you configure an Amazon EMR cluster to use CSE-KMS, the encryption process works
like
this:

When it's ready to store data in Amazon S3, the cluster sends a GenerateDataKey request to AWS KMS,
specifying the key ID of the CMK that you chose when you configured the cluster to
use
CSE-KMS. The request includes encryption context; for more information, see Encryption Context.

AWS KMS generates a unique data encryption key (data key) and then sends two copies
of
this data key to the cluster. One copy is unencrypted (plaintext), and the other copy
is
encrypted under the CMK.

The cluster uses the plaintext data key to encrypt the data, and then removes the
plaintext data key from memory as soon as possible after use.

The cluster combines the encrypted data and the encrypted copy of the data key
together into a single encrypted object.

The cluster sends the encrypted object to Amazon S3 for storage.

The decryption process works like this:

The cluster requests the encrypted data object from an S3 bucket.

Amazon S3 sends the encrypted object to the cluster.

The cluster extracts the encrypted data key from the encrypted object, and then
sends the encrypted data key to AWS KMS with a Decrypt request. The request includes encryption context; for more
information, see Encryption Context.

AWS KMS decrypts the encrypted data key using the same CMK that was used to encrypt
it, and then sends the decrypted (plaintext) data key to the cluster.

The cluster uses the plaintext data key to decrypt the encrypted data, and then
removes the plaintext data key from memory as soon as possible after use.

Encrypting Data on the Storage Volumes of Cluster
Nodes

An Amazon EMR cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2)
instances. Each instance in the
cluster is called a cluster node or node. Each node
can have two types of storage volumes: instance store volumes, and Amazon Elastic
Block Store (Amazon EBS) volumes.
You can configure the cluster to use Linux Unified Key Setup
(LUKS) to encrypt both types of storage volumes on the nodes (but not the boot
volume of each node). This is called local disk encryption.

When you enable local disk encryption for a cluster, you can choose to encrypt the
LUKS
master key with a CMK in AWS KMS. You must choose a custom CMK that you create; you
cannot use
an AWS-managed CMK. When you choose a custom CMK, you must ensure that your Amazon
EMR cluster
has permission to use the CMK. For more information, see Add the EMR Instance Role to an AWS KMS CMK in the Amazon EMR Release
Guide.

When you enable local disk encryption using a CMK, the encryption process works like
this:

When each cluster node launches, it sends a GenerateDataKey request to AWS KMS,
specifying the key ID of the CMK that you chose when you enabled local disk encryption
for
the cluster.

AWS KMS generates a unique data encryption key (data key) and then sends two copies
of
this data key to the node. One copy is unencrypted (plaintext), and the other copy
is
encrypted under the CMK.

The node uses a base64-encoded version of the plaintext data key as the password that
protects the LUKS master key. The node saves the encrypted copy of the data key on
its
boot volume.

If the node reboots, the rebooted node sends the encrypted data key to AWS KMS with
a
Decrypt request.

AWS KMS decrypts the encrypted data key using the same CMK that was used to encrypt
it,
and then sends the decrypted (plaintext) data key to the node.

The node uses the base64-encoded version of the plaintext data key as the password
to
unlock the LUKS master key.

Encryption Context

Each AWS service that is integrated with AWS KMS can specify encryption
context when it uses AWS KMS to generate data keys or to encrypt or decrypt data.
Encryption context is additional authenticated information that AWS KMS uses to check
for data
integrity. When a service specifies encryption context for an encryption operation,
it must
specify the same encryption context for the corresponding decryption operation or
decryption
will fail. Encryption context is also written to AWS CloudTrail log files, which can
help you
understand why a given CMK was used. For more information about encryption context,
see Encryption Context.

The following section explain the encryption context that is used in each Amazon EMR
encryption
scenario that uses a CMK.

Encryption Context for EMRFS Encryption with
SSE-KMS

With SSE-KMS, the Amazon EMR cluster sends data to Amazon S3, and then Amazon S3 uses
a CMK to encrypt
the data before saving it to an S3 bucket. In this case, Amazon S3 uses the Amazon
Resource Name
(ARN) of the S3 object as encryption context with each GenerateDataKey and Decrypt request that it sends to AWS KMS. The
following example shows a JSON representation of the encryption context that Amazon
S3
uses.

{ "aws:s3:arn" : "arn:aws:s3:::S3_bucket_name/S3_object_key" }

Encryption Context for EMRFS Encryption with
CSE-KMS

With CSE-KMS, the Amazon EMR cluster uses a CMK to encrypt data before sending it
to Amazon S3 for
storage. In this case, the cluster uses the Amazon Resource Name (ARN) of the CMK
as
encryption context with each GenerateDataKey and Decrypt
request that it sends to AWS KMS. The following example shows a JSON representation
of the
encryption context that the cluster uses.