New in CDH 5.3: Apache Sentry Integration with HDFS

Starting in CDH 5.3, Apache Sentry integration with HDFS saves admins a lot of work by centralizing access control permissions across components that utilize HDFS.

It’s been more than a year and a half since a couple of my colleagues here at Cloudera shipped the first version of Sentry (now Apache Sentry (incubating)). This project filled a huge security gap in the Apache Hadoop ecosystem by bringing truly secure and dependable fine grained authorization to the Hadoop ecosystem and provided out-of-the-box integration for Apache Hive. Since then the project has grown significantly–adding support for Impala and Search and the wonderful Hue App to name a few significant additions.

In order to provide a truly secure and centralized authorization mechanism, Sentry deployments have been historically set up so that all Hive’s data and metadata are accessible only by HiveServer2 and every other user is cut out. This has been a pain point for Sqoop users as Sqoop does not use the HiveServer2 interface. Hence users with a Sentry-secured Hive deployment were forced to split the import task into two steps: simple HDFS import followed by manually loading the data into Hive.

With the inclusion of HDFS ACLs and the integration of Sentry into the Hive metastore in CDH 5.1, users were able to improve this situation and get the direct Hive import working again. However, this approach required manual administrator intervention to configure HDFS ACLs according to the Sentry configuration and needed a manual refresh to keep both systems in sync.

One of the large features included in the recently released CDH 5.3 is Sentry integration with HDFS, which enables customers to easily share data between Hive, Impala and all the other Hadoop components that interact with HDFS (MapReduce, Spark, Pig, and Sqoop, and so on) while ensuring that user access permissions only need to be set once, and that they are uniformly enforced.

The rest of this post focuses on the example of using Sqoop together with this Sentry feature. Sqoop data can now be imported into Hive without any additional administrator intervention. By exposing Sentry policies—what tables from which a user can select and to what tables they can insert—directly in HDFS, Sqoop will re-use the same policies that have been configured via GRANT/REVOKE statements or the Hue Sentry App and will import data into Hive without any trouble.

Configuration

In order for Sqoop to seamlessly import into a Sentry Secured Hive instance, the Hadoop administrator needs to follow a few configuration steps to enable all the necessary features. First, your cluster needs to be using the Sentry Service as backend for storing authorization metadata and not rely on the older policy files.

If you are already using Sentry Service and GRANT/REVOKE statements, you can directly jump to step 3).

Make sure that you have Sentry service running on your cluster. You should see it in the service list:

And that Hive is configured to use this service as a backend for Sentry metadata:

Finally enable HDFS Integration with Sentry:

Example Sqoop Import

Let’s assume that we have user jarcec who needs to import data into a Hive database named default. User jarcec is part of a group that is also called jarcec – in real life the name of the group doesn’t have to be the same as the username and that is fine.

With an unsecured Hive installation, the Hadoop administrator would have to jump in and grant writing privilege to user jarcec for directory /user/hive/warehouse or one of its subdirectories. With Sentry and HDFS integration, the Hadoop administrator no longer needs to jump in. Instead Sqoop will reuse the same authorization policies that has been configured through Hive SQL or via the Sentry Hue Application. Let’s assume that user bc is jarcec‘s Manager and already has privileges to grant privileges in the default database.

And this new role jarcec_role needs to be granted to jarcec‘s group jarcec.

>

1

2

3

1:jdbc:hive2://sqoopsentry-1.vpc.cloudera.co> GRANT ROLE jarcec_role to GROUP jarcec;

No rows affected(0.651seconds)

And finally bc can grant access to database default (or any other) to the role jarcec_role;

1

2

3

1:jdbc:hive2://sqoopsentry-1.vpc.cloudera.co> GRANT ALL ON DATABASE default TO ROLE jarcec_role;

No rows affected(0.16seconds)

By executing the steps above, user jarcec has been given privilege to do any action (insert or select) with all objects inside database default. That includes the ability to create new tables, insert data or simply querying existing tables. With those privileges user jarcec can run the following Sqoop command as he was used to:

As there is no need to inherit HDFS permissions when Sentry is enabled in HDFS, you can safely ignore such messages.

That’s All, Folks!

The latest CDH release version 5.3.0 brings a bunch of new features. All Sqoop users should particularly check out the Sentry integration with HDFS as it will enable simple and straightforward import into Sentry-secured Hive deployments without the need to manually configure HDFS permissions. The same SQL interface that is used to grant access to various databases and tables is used to determine who can import (or export) data into Hive!

Jarcek Jarcec Cecho is an Engineering Manager at Cloudera, responsible for the Data Ingest team (see team blog). Jarcec is also a committer/PMC member for Sqoop, Apache Flume, Apache MRunit, Apache Datafu (incubating), and Apache Sentry (incubating). He is also the co-author of Apache Sqoop Cookbook.

3 responses on “New in CDH 5.3: Apache Sentry Integration with HDFS”

Very useful!
I am in particular, interested if any support Sentry provides from CDH 5.3 onwards for HDFS file level authorization (typically what one would implement using HDFS ACLs). e.g. can I use Sentry to allow only a named usergroup (defined in our enterprise wide AD) to read specific files in a directory.

While I am applying this to our Hadoop cluster, I found a problem related to hive-site.xml file.
In my case, /etc/hive/conf/hive-site.xml file in a client server includes properties related to Sentry.
If I use hive or sqoop command, it fails reporting that it can’t find a Java class related to Sentry.
It is probably that hive cli (and sqoop) does not recognize these Sentry-related properties.

As a workaround, I issued the command “export HIVE_CONF_DIR=/etc/hive/conf/hive-site.xml.hivecli”, which is the original version that does not use Sentry, before run hive or sqoop command.