Kerberos, AFS and Batch Systems

Introduction

Kerberos is a system for
securely
authenticating users in an unsecure network environment. It was
developed in the 1980s at the MIT as part of the famous project Athena.
During the 1990s Kerberos V5 was standardized in RFC 1510 and became
widely used, especially after Microsoft decided to base Windows 2000
security on it. Within Kerberos, each user has a Ticket Granting Ticket
(TGT) which can be used to acquire dedicated service-tickets. These
service-tickets finally are used to authenticate a user to that service.

AFS (the "Andrew File System")
was started at the Carnegie-Mellon
University as a research project. By using a slightly modified Kerberos
V4 system they built a secure network filesystem which allows several
data-storing servers and complex access control lists. Today AFS is
used
all over the world, especially by Universities and other research
institutions. From the Kerberos point of view, AFS is just a service.
AFS itself requires the user to own a valid Kerberos ticket for the
service "AFS", this is often called the "AFS-Token".

A Batch System is needed for
controlling resource usage of special
machines like a supercomputer or some compute servers. Users just
define what their jobs need (e.g. the number of CPUs), the Batch System
decides if and when the user will get these resources. It then takes
care of starting (and probably killing) the job, finally some
accounting information about the job is saved for accounting or
statistics. There are several commercial and non-commercial Batch
Systems available these days, two important open source systems are
Torque/OpenPBS and Sun's Grid Engine.

This article is about the problems which arise, when these technologies
are used together: Imagine a user submits a job which uses files from
his home-directory in AFS. This requires the Batch System to make sure
that the job has the
users AFS-Token while running or the job would not be able to access
the files.
Making these files readable by anyone would allow the Batch System to
not care about AFS, but this is not an option because of security
considerations. Another problem is the limited lifetime of the
AFS-Token, it has to be prolonged or renewed to allow the job access
AFS continuously.

OpenAFS homepage (AFS
became a commercial product that was sold by Transarc Inc. during the
1990s. Around the year 2000 IBM bought Transarc and eventually opened
the AFS sources - this initiated the OpenAFS project.)

Kaserver / Kerberos V4 and AFS-Tokens

The Kaserver is a modified Kerberos V4 server used by the original AFS
setup. It has some special features and speaks an additional network
protocol ("Rx") used only by AFS utilities. Its possible to use a
Kerberos V4 server instead of the Kaserver (Using
MIT's Kerberos Server with AFS). The normal lifetime of an
AFS-Token is configured per user on the server, usually its about 25
hours.
The Token is isolated against the rest of the system in a Process
Authentication Group (PAG) called
structure which is identified by two unique group IDs.

Renewal of an AFS-Token

One possibility to provide a long lasting job with a token is to regain
the token from time to time. The process doing this usually needs some
knowledge about the users password though, either by asking him during
job submission or by requiring him to store it somewhere.

One tool for this task is the "Password Storage and
Retrieval System" (PSR), which uses asymmetric cryptography to
securely store the users password encrypted in his AFS space. When a
job needs to acquire a new token, the password gets decrypted and is
used to simply request a new token. The secret key needed to decrypt
the password is only stored on the machine that runs the job. If the
user changes his password and updates the encrypted storage too, the
new password is automatically used on the next renewal.

Another way to store the password is to use a "SRVTAB" file. Such a
file is normally used to store a server key but it can also be used to
store the key of a user. The stored key is not the plaintext password
but some kind of hash. This way the password is not revealed, but be
aware: Concerning Kerberos, the hash can be used just like the
password. So when a job needs to acquire a new token, the hash can
simply be used. You can find a description of this technology here: UMich: "How to run long
lived jobs with AFS" Some quick hints to be used with KTH Kerberos
V4: Create the SRVTAB like this: ksrvutil
-f mysrvtab -c example.com add where example.com is the name of your AFS
cell. Enter your username when prompted for "Name:", the name of your
AFS cell in uppercase for "Realm:" and just press enter for "Instance:"
and "Version Number:". It will then ask a password twice, enter your
normal AFS
password. The created file "mysrvtab" can be used like this: kauth -n myname -f mysrvtab bash
where myname is your
username. kauth will run the
given command (here: bash)
and repeatedly renew the AFS-Token by using the secret from mysrvtab.

Prolongation of an AFS-Token

A completely different approach to extending the lifetime of AFS-Tokens
is to prolong them, extend their lifetime without acquiring a new one.
To do this one has to extract the Token from the current environment
and decrypt it with the AFS specific Kerberos service key (known only
by the Kerberos server and the AFS fileservers). Its now possible to
put a new timestamp into the Token, thereby extending its lifetime.
After encrypting it with the service key again and putting it back into
the users environment, the user has a Token with an extended lifetime.
If this process is repeated regularly, the Token never expires.

To my knowledge this way was first gone at CERN in the first half of
the 1990s, they created the programs GetToken,
SetToken and forge. These programs became the
base of CERN's "Authenticated
Remote Control" (ARC) system, and some time later Codine and
LoadLeveler evolved with support for these tools. (Codine is today
known as Sun GridEngine, which still contains support for this method.)

I did just another reimplementation,
which does not rely on OpenAFS but uses only Kerberos and "krbafs",
found on any Fedora Core 1 machine.

Kerberos V5 and AFS-Tokens

Kerberos V5 brought some new features, renewable TGT-tickets being the
most notable with regard to this article. Such a ticket can be renewed
via kinit -R without the need
to enter the password again. During a renewal, the kinit command contacts the Kerberos
server and asks if the renewal is acceptable. This makes it possible to
inhibit further usage of a stolen TGT by e.g. disabling the account on
the server. As another security measure, a renewable ticket not only
contains the usual (short) lifetime specification, but also features a
(long) "renewable lifetime" that declares an upper limit for ticket
renewal. Usually the normal lifetime is about a day, while the
renewable lifetime can last for months.

Creating AFS-Tokens out of Thin Air

A completely different way to provide jobs with AFS-Tokens is to fake
them. This is easy if you know the service key of the AFS service.
(Remember: AFS is a Kerberos service, the AFS-Token is just a Kerberos
V4 service ticket. Therefore it has a well known structure and is
encrypted with the AFS' service key.) An implementation of this method
is available as GSSKLOG,
a tool which uses the GSS-API to authenticate a client to the server,
eventually giving back a faked AFS-Token if authentication succeeded.

Batch Systems: Required Steps for AFS-Integration

A complete solution would be to include full Kerberos and AFS support
into the Batch System. But this would probably require considerable
changes in network communication and internal structures, so some
simpler way would be better.

When a job is submitted (e.g. qsub)

Kerberos V4 with Renewal: Ask the user for his password or check
for some prepared storage. Probably attach some information on it to
the job.

Kerberos V4 with Prolongation: Extract the users Token from the
current PAG on the submit host and attach it to the job.

Kerberos V5: Forward the TGT along with the job.

Fake Tokens: The Batch System must become 100% sure about the
users identity.

While the job is queued

Kerberos V4: Do nothing.

Kerberos V5: Renew the TGT repeatedly.

Fake Tokens: Do nothing.

When the job is started

Kerberos V4 with Renewal: Instantiate a PAG. Let Kerberos create
a new TGT and a new AFS-Token.

Kerberos V4 with Prolongation: Instantiate a PAG. Prolong the
AFS-Token and insert it into the PAG.