Configuring CDH Services for HDFS Encryption

Note: This page contains references to CDH 5 components or features that have been removed from CDH 6. These references are only applicable if you
are managing a CDH 5 cluster with Cloudera Manager 6. For more information, see Deprecated Items.

Important: HDFS encryption does not support file transfer (reading, writing files) between zones through WebHDFS. For web-based file transfer
between encryption zones managed by HDFS, use HttpFS with a load balancer instead.

HBase

Recommendations

Make /hbase an encryption zone. Do not create encryption zones as subdirectories under /hbase, because HBase may need to
rename files across those subdirectories. When you create the encryption zone, name the key hbase-key to take advantage of auto-generatedKMS ACLs .

Steps

On a cluster without HBase currently installed, create the /hbase directory and make that an encryption zone.

On a cluster with HBase already installed, perform the following steps:

Stop the HBase service.

Move data from the /hbase directory to /hbase-tmp.

Create an empty /hbase directory and make it an encryption zone.

Distcp all data from /hbase-tmp to /hbase, preserving user-group permissions and extended attributes.

Start the HBase service and verify that it is working as expected.

Remove the /hbase-tmp directory.

KMS ACL Configuration for HBase

In theKMS ACL, grant the hbase user and group DECRYPT_EEK permission for the HBase key:

Hive

HDFS encryption has been designed so that files cannot be moved from one encryption zone to another or from encryption zones to unencrypted directories. Therefore, the landing zone for
data when using the LOAD DATA INPATH command must always be inside the destination encryption zone.

To use HDFS encryption with Hive, ensure you are using one of the following configurations:

Single Encryption Zone

With this configuration, you can use HDFS encryption by having all Hive data inside the same encryption zone. In Cloudera Manager, configure the Hive Scratch Directory (hive.exec.scratchdir) to be inside the encryption zone.

Recommended HDFS Path:/user/hive

To use the auto-generated KMS ACL, make sure you name the encryption key hive-key.

For example, to configure a single encryption zone for the entire Hive warehouse, you can rename /user/hive to /user/hive-old, create an encryption zone at /user/hive, and then distcp all the data from /user/hive-old to /user/hive.

In Cloudera Manager, configure the Hive Scratch Directory (hive.exec.scratchdir) to be inside the encryption zone by setting it to /user/hive/tmp, ensuring that permissions are 1777 on /user/hive/tmp.

Multiple Encryption Zones

With this configuration, you can use encrypted databases or tables with different encryption keys. To read data from read-only encrypted tables, users must have access to a temporary
directory that is encrypted at least as strongly as the table.

Other Encrypted Directories

LOCALSCRATCHDIR: The MapJoin optimization in Hive writes HDFS tables to a local directory and then uploads them to the
distributed cache. To ensure these files are encrypted, either disable MapJoin by setting hive.auto.convert.join to false, or encrypt the
local Hive Scratch directory (hive.exec.local.scratchdir) using Cloudera
Navigator Encrypt.

DOWNLOADED_RESOURCES_DIR: JARs that are added to a user session and stored in HDFS are downloaded to hive.downloaded.resources.dir on the HiveServer2 local filesystem. To encrypt these JAR files, configure Cloudera Navigator Encrypt to encrypt the directory specified by hive.downloaded.resources.dir.

NodeManager Local Directory List: Hive stores JARs and MapJoin files in the distributed cache. To use MapJoin or encrypt JARs and other resource files,
the yarn.nodemanager.local-dirs YARN configuration property must be configured to a set of encrypted local directories on all nodes.

Changed Behavior after HDFS Encryption is Enabled

Loading data from one encryption zone to another results in a copy of the data. Distcp is used to speed up the process if the size of the files being copied is higher than the value
specified by HIVE_EXEC_COPYFILE_MAXSIZE. The minimum size limit for HIVE_EXEC_COPYFILE_MAXSIZE is 32 MB, which you can modify by changing
the value for the hive.exec.copyfile.maxsize configuration property.

When loading data to encrypted tables, Cloudera strongly recommends using a landing zone inside the same encryption zone as the table.

Example 1: Loading unencrypted data to an encrypted table - Use one of the following methods:

If you are loading new unencrypted data to an encrypted table, use the LOAD DATA ... statement. Because the source data is not inside the encryption
zone, the LOAD statement results in a copy. For this reason, Cloudera recommends landing data that you need to encrypt inside the destination encryption zone. You can
use distcp to speed up the copying process if your data is inside HDFS.

If the data to be loaded is already inside a Hive table, you can create a new table with a LOCATION inside an encryption zone as follows:

The location specified in the CREATE TABLE statement must be inside an encryption zone. Creating a table pointing LOCATION to an
unencrypted directory does not encrypt your source data. You must copy your data to an encryption zone, and then point LOCATION to that zone.

Example 2: Loading encrypted data to an encrypted table - If the data is already encrypted, use the CREATE TABLE statement
pointing LOCATION to the encrypted source directory containing the data. This is the fastest way to create encrypted tables.

Users reading data from encrypted tables that are read-only must have access to a temporary directory which is encrypted with at least as strong encryption as the table.

Temporary data is now written to a directory named .hive-staging in each table or partition

Previously, an INSERT OVERWRITE on a partitioned table inherited permissions for new data from the existing partition directory. With encryption enabled,
permissions are inherited from the table.

KMS ACL Configuration for Hive

When Hive joins tables, it compares the encryption key strength for each table. For this operation to succeed, you must configure the KMS ACL to allow the hive user and group READ access to the Hive key:

If you have disabled HiveServer2 Impersonation (for example, to use Apache Sentry), you must configure the KMS ACLs to grant DECRYPT_EEK permissions to the hive user, as well as any user accessing data in the Hive warehouse.

Cloudera recommends creating a group containing all Hive users, and granting DECRYPT_EEK access to that group.

For example, suppose user jdoe (home directory /user/jdoe) is a Hive user and a member of the group hive-users. The encryption zone (EZ) key for /user/jdoe is named jdoe-key, and the EZ key for /user/hive is hive-key. The following ACL example demonstrates the required permissions:

If you have enabled HiveServer2 impersonation, data is accessed by the user submitting the query or job, and the user account (jdoe in this example) may
still need to access data in their home directory. In this scenario, the required permissions are as follows:

Hue

Recommendations

Make /user/hue an encryption zone because Oozie workflows and other Hue-specific data are stored there by default. When you create the encryption zone,
name the key hue-key to take advantage of auto-generatedKMS ACLs .

Steps

On a cluster without Hue currently installed, create the /user/hue directory and make it an encryption zone.

On a cluster with Hue already installed:

Create an empty /user/hue-tmp directory.

Make /user/hue-tmp an encryption zone.

DistCp all data from /user/hue into /user/hue-tmp.

Remove /user/hue and rename /user/hue-tmp to /user/hue.

KMS ACL Configuration for Hue

In the KMS ACLs, grant the hue and oozie users and
groups DECRYPT_EEK permission for the Hue key:

Impala

Recommendations

If HDFS encryption is enabled, configure Impala to encrypt data spilled to local disk.

In releases lower than Impala 2.2.0 / CDH 5.4.0, Impala does not support the LOAD DATA statement when the source and destination are in different
encryption zones. If you are running an affected release and need to use LOAD DATA with HDFS encryption enabled, copy the data to the table's encryption zone prior to
running the statement.

Use Cloudera Navigator to lock down the local directory where Impala UDFs are copied during execution. By default, Impala copies UDFs into /tmp, and you
can configure this location through the --local_library_dir startup flag for the impalad daemon.

Limit the rename operations for internal tables once encryption zones are set up. Impala cannot do an ALTER TABLE RENAME operation to move an internal
table from one database to another, if the root directories for those databases are in different encryption zones. If the encryption zone covers a table directory but not the parent directory
associated with the database, Impala cannot do an ALTER TABLE RENAME operation to rename an internal table, even within the same database.

Avoid structuring partitioned tables where different partitions reside in different encryption zones, or where any partitions reside in an encryption zone that is different from the root
directory for the table. Impala cannot do an INSERT operation into any partition that is not in the same encryption zone as the root directory of the overall table.

If the data files for a table or partition are in a different encryption zone than the HDFS trashcan, use the PURGE keyword at the end of the DROP TABLE or ALTER TABLE DROP PARTITION statement to delete the HDFS data files immediately. Otherwise, the data files are left behind if they
cannot be moved to the trashcan because of differing encryption zones. This syntax is available in Impala 2.3 / CDH 5.5 and higher.

Steps

Start every impalad process with the --disk_spill_encryption=true flag set. This encrypts all spilled data using
AES-256-CFB. Set this flag by selecting the Disk Spill Encryption checkbox in the Impala configuration (Impala service > Configuration > Category > Security).

Important: Impala does not selectively encrypt data based on whether the source data is already encrypted in HDFS. This results in at most 15
percent performance degradation when data is spilled.

KMS ACL Configuration for Impala

MapReduce and YARN

MapReduce v1

Recommendations

MRv1 stores both history and logs on local disks by default. Even if you do configure history to be stored on HDFS, the files are not renamed. Hence, no special configuration is
required.

MapReduce v2 (YARN)

Recommendations

Make /user/history a single encryption zone, because history files are moved between the intermediate and done directories, and HDFS encryption does not allow moving encrypted files across encryption zones. When you create the encryption zone, name the key mapred-key to take advantage of auto-generated KMS ACLs.

Steps

On a cluster with MRv2 (YARN) installed, create the /user/history directory and make that an encryption zone.

If /user/history already exists and is not empty:

Create an empty /user/history-tmp directory.

Make /user/history-tmp an encryption zone.

DistCp all data from /user/history into /user/history-tmp.

Remove /user/history and rename /user/history-tmp to /user/history.

KMS ACL Configuration for MapReduce

In the KMS ACLs, grant DECRYPT_EEK permission for the MapReduce key to the
mapred and yarn users and the hadoop group:

Spark

Recommendations

By default, application event logs are stored at /user/spark/applicationHistory, which can be made into an encryption zone.

Spark also optionally caches its JAR file at /user/spark/share/lib (by default), but encrypting this directory is not required.

Spark does not encrypt shuffle data. To do so, configure the Spark local directory, spark.local.dir (in Standalone mode), to reside on an encrypted disk.
For YARN mode, make the corresponding YARN configuration changes.

KMS ACL Configuration for Spark

In the KMS ACLs, grant DECRYPT_EEK permission for the Spark key to the
spark user and any groups that can submit Spark jobs:

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.