Kerberos & Hadoop: Securing Big Data (part I)

When I began to use Hadoop with Kerberos I felt as I was in the middle of the ocean. I found a lot of information about Kerberos technology but it was very difficult for me to find something about how to use it on Hadoop, why to use it and how to configure it for working with Hadoop.

This trilogy of posts is going to give you a global vision about Kerberos in Hadoop and is going to give you the keys on how to integrate Kerberos in your Hadoop Cluster.

Why we have to secure our environments?

Internet is a dangerous place, if you have something with connectivity you are on risk and today there are a lot of things connected to internet, so the danger is higher. This is a disturbing picture but it is necessary to be aware of it and look up for a solution. We need to think about keeping safe our information and protect our data from different sources of danger.

1) Externally: we must avoid an external attack or an external access to our information. We can use firewalls to resolve this issue.

2) Internally: from inside of an organization we need to protect the information too. All users or applications shouldn’t have access to all data and we need to know who and how they can connect to the information, specially when we are in a distributed system. In this kind of environments we have 3 threats:

An user could be say that he is another user

A host could say that it is another host (changing the IP address)

An user could listen to a communication and could do a “playback attack” and get access to application or hang down a service.

To resolve this issue we can use Kerberos authentication system. In the next sections we are going to understand how Kerberos give us this security.

What is Kerberos?

Kerberos is an authentication service developed by MIT and it is based on Needham-Schroeder’s Protocol. It is designed to provide strong authentication for client/server applications by using secret-key cryptography. But the important thing is that Kerberos is going to ensure that users are who they say and the servers are who they say.

Kerberos basics

This authentication method is very complex and has a lot of parameters. You can find all the details for it in a lot of internet sites or in the official documentation (http://www.kerberos.org/ ). We only want to give you some important ideas that you will need to configure and use it.

In a Kerberos authentication system we are going to have:

“Kerberos Server”: This is the server where you are going to register the users. It is a database with users and passwords so this data is encrypted.

“Kerberos Clients”: These nodes will be able to communicate with Kerberos Server for authentication tasks. You will be able to execute Kerberos commands (kinit, kdestroy, klist) on this servers. In the next post we are going to explain this commands.

“Kerberos Ticket”: This element is very important. If you authenticate your user to the Kerberos Server, it will give you access to the data for 24h. To get this ticket you need to execute kerberos clients commands.

Hadoop environment

When we have a Hadoop system, we are going to store a lot of information for several things. We need to be sure that this information is going to be safe and only the users and the servers that have permissions are going to access to the data.

In Hadoop environments we have a very big ecosystem to exploit the data, so the applications should be integrated with Kerberos. We should authenticate who (user, server and application) wants to access to the data. How can we get this goal?

The application should be configured to use Kerberos.

The servers should be configured to be a Kerberos client.

The users should belong to the Kerberos database.

In high level, when a user want to exploit the data (for example, using Impala):

The user should authenticate against “Kerberos Server”

“Kerberos Server” gives the user a “Kerberos Ticket”

When you connect to the application (e.g. : Impala), the application (configured to use our Kerberos system) is going to use your “Kerberos Ticket” for this server and it will try to get access to the cluster. If one of the steps is not correct (server, user, application) you can’t access to the data.

Conclusions

Furthermore, we should protect the information into our organization and between work-groups. It is probably that we don’t want Human Resources accessing the same information that finance team. For managing easily these accesses we can use LDAP. If we combine Kerberos and LDAP in our Hadoop cluster we get a protected and robust environment and we can avoid lot of headaches.

In the next posts we are going to explain how use Kerberos and how configure it with LDAP in a Hadoop cluster.