Technical details, ideas and news on data warehousing and big data from the Oracle Team

Thursday Jul 18, 2013

Introduction

Documentation and most discussions are quick to point out that HDFS provides OS-level permissions on files and directories.
However, there is less readily-available information about what the
effects of OS-level permissions are on accessing data in HDFS via
higher-level abstractions such as Hive or Pig. To provide a bit of
clarity, I decided to run through the effects of permissions on
different interactions with HDFS.

The Setup

In this scenario, we have
three users: oracle, dan, and not_dan. The oracle user has captured
some data in an HDFS directory. The directory has 750 permissions:
read/write/execute for oracle, read/execute for dan, and no access for
not_dan. One of the files in the directory has 700 permissions, meaning
that only the oracle user can read it. Each user will tries to do the
following tasks:

List the contents of the directory

Count the lines in a subset of files including the file with 700 permissions

Run a simple Hive query over the directory

Listing Files

Each user issues the command

hadoop fs -ls /user/shared/moving_average|more

And what do they see:

[oracle@localhost ~]$ hadoop fs -ls /user/shared/moving_average|more

Found 564 items

Obviously, the oracle user can see all the files in its own directory.

Permissions on Hive

In this final test, the
oracle user defines an external Hive table over the shared directory.
Each user issues a simple COUNT(*) query against the directory.
Interestingly, the results are not the same as piping the datastream to
the shell.

The oracle user's query runs correctly, while both dan and not_dan's queries fail:

As dan

Job Submission failed with exception 'java.io.FileNotFoundException(File /user/shared/moving_average/FlumeData.1374082184056 does not exist)'

So,
what's going on here? In each case, the query fails, but for different
reasons. In the case of not_dan, the query fails because the user has no
permissions on the directory. However, the query issued by dan fails
because of a FileNotFound exception.Because dan
does not have read permissions on the file, Hive cannot find all the
files necessary to build the underlying MapReduce job. Thus, the query
fails before being submitted to the JobTracker. The rule then, becomes
simple: to issue a Hive query, a user must have read permissions on all
files read by the query. If a user has permissions on one set of
partition directories, but not another, they can issue queries against
the readable partitions, but not against the entire table.

Summary

In
a nutshell, the OS-level permissions of HDFS behave just as we would
expect in the shell. However, problems can arise when tools like Hive or
Pig try to construct MapReduce jobs. As a best practice, permissions
structures should be tested against the tools which will access the
data. This ensures that users can read what they are allowed to, in the manner that they need to.