Hadoop and LDAP, as seen through Venetian blinds

My wife recently asked me to mount new Venetian blinds in the kids' bathroom. I thought that I'd be
done in five minutes, but two hours later I still had to drill a single hole.

It is not lack of experience. We have the exactly the same type of blinds in four other rooms and they
were all mounted by me. What happened?

As I was approaching the window to see where the drill holes, I couldn't help but notice that the
blinds were approximately 4 to 5 mm wider than the opening. That meant cutting each slat
individually (there are about 40 slats).

If you have any familiarity with manual work or with software development, you know that anything
can be done quickly if there is no custom work to do. Once you need modifications on top of a
library or slightly modify Venetian blinds, then the problems begin. And five minutes ain't
nearly enough!

I have a saw to cut iron, but I didn't have anything for the slats. I turned to my neighbour who
was very helpful and gave me a good saw.

That meant, however, that all the slats had to be removed individually. If you are unfamiliar with
how Venetian blinds are made, this image should help you:

Basically all slats are kept together by three small ropes. The three ropes have to be untied. The
slats can then be removed, cut, and put back in place. The ropes have to then be inserted back.

This takes a long time. But I wanted to get it done, so I went in the garden with my son, and I
started.

As I was working, one of the neighbours came over to ask if I knew why water was coming out of the
wall of one of the nearby houses1. Of course I didn't know, but we went and looked through the
windows: water was dripping from the first to the ground floor, and apparently overflowing to
the outer wall as well.

After 15 minutes the police arrived2, to determine if the firefighters had to be called. Five
minutes later the firefighters arrived. They got in the house and found a weed plantation, whose
irrigation system broke down, flooding the whole house.

All the neighbourhood was in the street, watching what was happening. The firefighters were so kind
to host all my kids on their truck. Then the real police, with a real car, arrived and they were
also so kind to host all my kids in their car.

While only half of my slats were cut, and the other half was waiting.

Long story short, because of this distraction, I needed 3 hours just to cut and re-assemble the
slats.

Mounting them after resizing took five minutes.

What is most peculiar, however, is that I had the same type of experience at work that day, trying
to connect Hadoop and LDAP.

When Kerberos is not used, Hadoop relies on simple security, i.e. it believes you are who you say
you are.

However, to determine to which group a user belong to, Hadoop does by default a lookup on
the NameNode. If the user does not exist in the NameNode, the lookup will turn up empty. This means
that there is no way to know to which groups someone belongs to.

Enter LdapGroupsMapping

To fix this, Hadoop provides LdapGroupsMapping to lookup up a user inside an LDAP directory. I
thought I would give it a crack. How hard can it be?

If you clicked on the previous link, you probably thought that the piece of documentation isn't
enough to really get started. Luckily we have search engines these days, so I stumbled upon a page
by Hortonworks "explaining" how to do it.

I quote "explaining" as you don't get much wiser if you look at it (there are also a couple of
typo's). Let's start with the easy part:

hadoop.security.group.mapping should be org.apache.hadoop.security.LdapGroupsMapping. Nothing
to change here;

hadoop.security.group.mapping.ldap.bind.user should be the user that has read access the LDAP,
usually the administrator. In my case it was
cn=Administrator,cn=users,dc=some,dc=domain,dc=com. You have to configure this to your
situation though;

hadoop.security.group.mapping.ldap.bind.password: no comment here;

hadoop.security.group.mapping.ldap.url needs to be the LDAP address. The form is
ldap://address:port. If you LDAP is listening through the standard 389 port, you can omit it.
If your LDAP is behind SSL you need to use the ldaps protocol (it then assumes port 636) and to
configure some extra SSL properties;

hadoop.security.group.mapping.ldap.base is almost straightforward as well. It is the common
part that all users of your LDAP will have. In my example it could be dc=some,dc=domain,dc=com
so that all users under dc=other,dc=domain,dc=com will not be found.

Now comes the most challenging part, namely:

hadoop.security.group.mapping.ldap.search.filter.user;

hadoop.security.group.mapping.ldap.search.filter.group;

hadoop.security.group.mapping.ldap.search.attr.member;

hadoop.security.group.mapping.ldap.search.attr.group.name.

To find out what to fill, we need ldapsearch, a tool available in most Linux distributions.3

filter.user

To determine the filter.user property, you need to find the query that returns users.

You need to look into your LDAP to see how are users defined (as opposed to groups). In my case I
was using Amazon AD, so I just look up how this is defined. I found that using
(&(objectCategory=user)(sAMAccountName=glanzani)) was returning my user.

If you got that far, then you can use the following value for the
hadoop.security.group.mapping.ldap.search.filter.user property:

where {0} is the name of the user Hadoop is looking up. The wildcards are very important here,
as, for example, if we are searching for the spark user, it will return all users that contain
the spark string. I will show later why this is extremely important.

filter.group

In a similar fashion, you need to know how Hadoop (or ldapsearch) can find groups. In case of
Amazon this is (objectCategory=group)4:

What about non-interactive users?

This section could probably could have much less workarounds if I'd knew more about LDAP.

But I'm a data scientist at heart and I want to get things done.

If you ever dealt with Hadoop, you know that there are a bunch of non-interactive users, i.e.
users who are not supposed to login, such as hdfs, spark, hadoop, etc. These users
are important to have. However the groups with the same name are also important to have. For
example when using airflow and launching a spark job, the log folders will be created under the
airflow user, in the spark group.

LDAP, however, doesn't allow you, to my knowledge6, to have overlapping user/groups, as Unix does.

The way I solved it was to create, in LDAP, the spark_user (or hdfs_user or ...) to work around
this limitation. In fact, using the wilcards specified above to match an Hadoop user to an LDAP
user, the flow would be like this

Here LDAP matches the spark_user, which belongs to the spark group. It doesn't care that I've
asked about the spark user. At this point it creates a query to lookup to which groups
spark_user belongs to. It will return, in my case, the spark.

Great!

Quickly creating users and groups

Creating users and groups with LDAP can be a pain. A quicker way to do so it to use adtool.
Create a ~/.adtool.cfg, with the following content

Wrapping things up

Reading it in a blog post takes maybe three minutes. After you know the parameters of your AD you
need five minutes to implement it. But if you don't know all of the above, it can take you a day,
just like mounting my Venetian blinds.

If you have any feedback on how to improve the non-interactive users part, I'd love to hear it.
You can find me on Twitter.