Project Rhino Goal: At-Rest Encryption for Apache Hadoop

An update on community efforts to bring at-rest encryption to HDFS — a major theme of Project Rhino.

Encryption is a key requirement for many privacy and security-sensitive industries, including healthcare (HIPAA regulations), card payments (PCI DSS regulations), and the US government (FISMA regulations).

Although network encryption has been provided in the Apache Hadoop platform for some time (since Hadoop 2.02-alpha/CDH 4.1), at-rest encryption, the encryption of data stored on persistent storage such as disk, is not. To meet that requirement in the platform, Cloudera and Intel are working with the rest of the Hadoop community under the umbrella of Project Rhino — an effort to bring a comprehensive security framework for data protection to Hadoop, which also now includes Apache Sentry (incubating) — to implement at-rest encryption for HDFS (HDFS-6134 and HADOOP-10150).

With this work, encryption and decryption will be transparent: existing Hadoop applications will be able to work with encrypted data without modifications. Data will be encrypted with configurable ciphers and cipher modes, allowing users to choose an appropriate level of confidentiality. Because encryption is being implemented directly in HDFS, the full spectrum of HDFS access methods and file formats will be supported.

Design Principles

At-rest encryption for HDFS centers around the concept of an encryption zone, which is a directory where all the contents are encrypted with a unique encryption key. When accessing data within an encryption zone, HDFS clients will transparently fetch the encryption key from the cluster key server to encrypt and decrypt data. Encryption keys can also be rolled in the event of compromise or because of corporate security policy. Subsequent files will be encrypted with the new encryption key, while existing files can be rewritten by the user to be encrypted with the new key.

Access to encrypted data is dependent on two things: appropriate HDFS-level filesystem permissions (i.e. Unix-style permissions and access control lists) as well as appropriate permissions on the key server for the encryption key. This two-fold scheme has a number of nice properties. Through the use of HDFS ACLs, users and administrators can granularly control data access. However, because key server permissions are also required, compromising HDFS is insufficient to gain access to unencrypted data. Importantly, this means that even HDFS administrators do not have full access to unencrypted data on the cluster.

A critical part of this vision is the cluster key server (for example, Cloudera Navigator Key Trustee). Since we foresee customer deployments with hundreds to thousands of encryption zones, an enterprise-grade key server needs to be secure, reliable, and easy-to-use. In highly-regulated industries, robust key management is as valid a requirement as at-rest encryption itself.

An equally important part of this vision is support for hardware-accelerated encryption. Encryption is a business-critical need, but it can be unusable if it carries a significant performance penalty. This requirement is addressed by other Rhino-related contributions from Intel (HDFS-10693), which will provide highly optimized libraries using AES-NI instructions available on Intel processors. By using these Intel libraries, HDFS will be able to provide access to encrypted data at hardware speeds with minimal performance impact.

Conclusion

At-rest encryption has been a major objective for Rhino since the effort’s inception. With the new Cloudera-Intel relationship — and the addition of even more engineering resources at our new Center for Security Excellence — we expect at-rest encryption to be among the first committed features to arise from Rhino, and among the first such features to ship inside Cloudera Enterprise 5.x.

Andrew Wang and Charles Lamb are Software Engineers on the HDFS team at Cloudera. Andrew is also a Hadoop PMC member.

To learn more about comprehensive, compliance-ready security for the enterprise, register for the upcoming webinar, “Compliance-Ready Hadoop,” on June 19 at 10am PT.

Filed under:

Hi Ramki, currently we’re planning on using AES-CTR for file encryption. Since encryption is happening transparently at the HDFS level, where all we see is a bytestream, we can’t do anything special with regard to data types or schema. Hope this clears things up.

Very interesting project, since the homomorphic encryption is still very costly and the best implementation is holded by IBM under legal patent.
Is there any intention to implement any kind of proof of retrievability(PoR) for the whole data stored in HDFS ?