Sunday, 18 January 2015

Reference: ClouderaOur thanks to Montrial Harrell, Enterprise Architect for the State of Indiana, for the guest post below.

Recently, the State of Indiana has begun to focus on how enterprise data management can help our state’s government operate more efficiently and improve the lives of our residents. With that goal in mind, I began this journey just like everyone else I know: with an interest in learning more about Apache Hadoop.

I started learning Hadoop via a virtual server onto which I installed CDH and worked through a few online tutorials. Then, I learned a little more by reading blogs and documentation, and by trial and error.

Eventually, I decided to experiment with a classic Hadoop use case: extract, load, and transfer (ELT). In most cases, ELT allows you to offload some resource-intensive data transforms in favor of Hadoop’s MPP-like functionality, thereby cutting resource usage on the current ETL server at a relatively low cost. This functionality is in part delivered via the Hadoop ecosystem project called Apache Sqoop.

Preparing for Sqoop

In preparing to use Sqoop, I found that there are two versions inside CDH. The classic version, called Sqoop 1, has a command line interface (CLI) and you store drivers for it in /var/lib/sqoop. If you are going to use Apache Oozie jobs that reference Sqoop, you also need to store your driver in /user/oozie/share/lib/sqoop.

Support for transparent, end-to-end encryption in HDFS is now available and production-ready (and shipping inside CDH 5.3 and later). Here’s how it works.

Apache Hadoop 2.6 adds support for transparent encryption to HDFS. Once configured, data read from and written to specified HDFS directories will be transparently encrypted and decrypted, without requiring any changes to user application code. This encryption is also end-to-end, meaning that data can only be encrypted and decrypted by the client. HDFS itself never handles unencrypted data or data encryption keys. All these characteristics improve security, and HDFS encryption can be an important part of an organization-wide data protection story.

Cloudera’s HDFS and Cloudera Navigator Key Trustee (formerly Gazzang zTrustee) engineering teams did this work under HDFS-6134 in collaboration with engineers at Intel as an extension of earlier Project Rhino work. In this post, we’ll explain how it works, and how to use it.

Background

In a traditional data management software/hardware stack, encryption can be performed at different layers, each with different pros and cons:

Application-level encryption is the most secure and flexible one. The application has ultimate control over what is encrypted and can precisely reflect the requirements of the user. However, writing applications to handle encryption is difficult. It also relies on the application supporting encryption, which may rule out this approach with many applications already in use by an organization. If integrating encryption in the application isn’t done well, security can be compromised (keys or credentials can be exposed).

Database-level encryption is similar to application-level encryption. Most database vendors offer some form of encryption; however, database encryption often comes with performance trade-offs. One example is that indexes cannot be encrypted.

Filesystem-level encryption offers high performance, application transparency, and is typically easy to deploy. However, it can’t model some application-level policies. For instance, multi-tenant applications might require per-user encryption. A database might require different encryption settings for each column stored within a single file.

Disk-level encryption is easy to deploy and fast but also quite inflexible. In practice, it protects only against physical theft.

HDFS transparent encryption sits between database- and filesystem-level encryption in this stack. This approach has multiple benefits: