Enabling gphdfs Authentication with a Kerberos-secured Hadoop Cluster

Using external tables and the gphdfs protocol, Greenplum Database can read
files from and write files to a Hadoop File System (HDFS). Greenplum segments read and write
files in parallel from HDFS for fast performance.

When a Hadoop cluster is secured with Kerberos ("Kerberized"), Greenplum Database must be
configured to allow the Greenplum Database gpadmin role, which owns external tables in HDFS,
to authenticate through Kerberos. This topic provides the steps for configuring Greenplum
Database to work with a Kerberized HDFS, including verifying and troubleshooting the
configuration.

Configuring the Greenplum Cluster

The hosts in the Greenplum Cluster must have a Java JRE, Hadoop client files, and Kerberos
clients installed.

Follow these steps to prepare the Greenplum Cluster.

Install a Java 1.6 or later JRE on all Greenplum cluster hosts.

Match the JRE version
the Hadoop cluster is running. You can find the JRE version by running java
--version on a Hadoop node.

(Optional) Confirm that Java Cryptography Extension (JCE) is present.

The
default location of the JCE libraries is
JAVA_HOME/lib/security. If a JDK is installed,
the directory is JAVA_HOME/jre/lib/security. The
files local_policy.jar and
US_export_policy.jar should be present in the JCE
directory.

The Greenplum cluster and the Kerberos server should, preferably, use
the same version of the JCE libraries. You can copy the JCE files from the Kerberos
server to the Greenplum cluster, if needed.

Set the JAVA_HOME environment variable to the location of the JRE in
the .bashrc or .bash_profile file for the
gpadmin account. For
example:

export JAVA_HOME=/usr/java/default

Source the .bashrc or .bash_profile file to
apply the change to your environment. For
example:

$ source ~/.bashrc

Install the Kerberos client utilities on all cluster hosts. Ensure the libraries match
the version on the KDC server before you install them.

For example, the following
command installs the Kerberos client files on Red Hat or CentOS
Linux:

$ sudo yum install krb5-libs krb5-workstation

Use the
kinit command to confirm the Kerberos client is installed and
correctly configured.

Install Hadoop client files on all hosts in the Greenplum Cluster. Refer to the
documentation for your Hadoop distribution for instructions.

Set the Greenplum Database server configuration parameters for Hadoop. The
gp_hadoop_target_version parameter specifies the version of the Hadoop
cluster. See the Greenplum Database Release Notes for the target version value that
corresponds to your Hadoop distribution. The gp_hadoop_home parameter
specifies the Hadoop installation
directory.

Create
a principal for each Greenplum cluster host. Use the same principal name and realm,
substituting the fully-qualified domain name for each host.

Generate a keytab file for each principal that you created (gpadmin and
each gphdfs service principal). You can store the keytab files in any
convenient location (this example uses the directory
/etc/security/keytabs). You will deploy the service principal
keytab files to their respective Greenplum host machines in a later
step:

Edit the hdfs-site.xml client configuration file on all cluster
hosts. Set properties to identify the NameNode Kerberos principals, the location of the
Kerberos keytab file, and the principal it is for:

dfs.namenode.kerberos.principal - the Kerberos principal name the
gphdfs protocol will use for the NameNode, for example
gpadmin@LOCAL.DOMAIN.

dfs.namenode.https.principal - the Kerberos principal name the
gphdfs protocol will use for the NameNode's secure HTTP server, for example
gpadmin@LOCAL.DOMAIN.

com.emc.greenplum.gpdb.hdfsconnector.security.user.keytab.file -
the path to the keytab file for the Kerberos HDFS service, for example
/home/gpadmin/mdw.service.keytab. .

com.emc.greenplum.gpdb.hdfsconnector.security.user.name - the
gphdfs service principal for the host, for example
gphdfs/mdw.example.com@LOCAL.DOMAIN.

Troubleshooting HDFS with Kerberos

Forcing Classpaths

If you encounter "class not found" errors when executing SELECT
statements from gphdfs external tables, edit the
$GPHOME/lib/hadoop-env.sh file and add the following lines towards
the end of the file, before the JAVA_LIBRARY_PATH is set. Update the
script on all of the cluster
hosts.

if [ -d "/usr/hdp/current" ]; then
for f in /usr/hdp/current/**/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
fi

Enabling Kerberos Client Debug Messages

To see debug messages from the Kerberos client, edit the
$GPHOME/lib/hadoop-env.sh client shell script on all cluster hosts
and set the HADOOP_OPTS variable as
follows:

Adjusting JVM Process Memory on Segment Hosts

Each segment launches a JVM process when reading or writing an external table in HDFS. To
change the amount of memory allocated to each JVM process, configure the
GP_JAVA_OPT environment variable.

Edit the $GPHOME/lib/hadoop-env.sh client shell script on all
cluster hosts.

For example:

export GP_JAVA_OPT=-Xmx1000m

Verify Kerberos Security Settings

Review the /etc/krb5.conf file:

If AES256 encryption is not disabled, ensure that all cluster hosts have the JCE
Unlimited Strength Jurisdiction Policy Files installed.

Ensure all encryption types in the Kerberos keytab file match definitions in the
krb5.conf file.

cat /etc/krb5.conf | egrep supported_enctypes

Test Connectivity on an Individual Segment Host

Follow these steps to test that a single Greenplum Database host can read HDFS data. This
test method executes the Greenplum HDFSReader Java class at the
command-line, and can help to troubleshoot connectivity problems outside of the database.

Save a sample data file in HDFS.

hdfs dfs -put test1.txt hdfs://namenode:8020/tmp

On the segment host to be tested, create an environment script,
env.sh, like the
following: