Concurrent programs are incredibly difficult to debug. Race conditions are just the most security-relevant type of concurrency problem. In this chapter, the authors explore race conditions and their security ramifications.

This chapter is from the book

"O let not Time deceive you,
You cannot conquer Time.
In the burrows of the Nightmare
Where Justice naked is,
Time watches from the shadow
And coughs when you would kiss."
W. H. AUDEN"As I Walked Out One
Evening"

Race conditions are among the most common classes of bugs
found in deployed software. They are only possible in environments in which
there are multiple threads or processes occurring at once that may potentially
interact (or some other form of asynchronous processing, such as with UNIX
signals). People who have experience with multithread programming have almost
certainly had to deal with race conditions, regardless of whether they know the
term. Race conditions are a horrible problem because a program that seems to
work fine may still harbor them. They are very hard todetect, especially if
you're not looking for them. They are often difficult to fix, even when you
are aware of their existence. Race conditions are one of the few places where a
seemingly deterministic program can behave in a seriously nondeterministic way.
In a world where multithreading, multiprocessing, and distributed computing are
becoming more and more prevalent, race conditions will continue to become a
bigger and bigger problem.

Most of the time, race conditions present robustness problems. However, there
are plenty of times when race conditions have security implications. In this
chapter we explore race conditions and their security ramifications. It turns
out that file system accesses are subject to security-related race conditions
far more often than people tend to suspect.

What Is a Race Condition?

Let's say that Alice and Bob work at the same company. Through e-mail,
they decide to meet for lunch, agreeing to meet in the lobby at noon. However,
they do not agree on whether they mean the lobby for their office or the
building lobby several floors below. At 12:15, Alice is standing in the company
lobby by the elevators, waiting for Bob. Then it occurs to her that Bob may be
waiting for her in the building lobby, on the first floor. Her strategy for
finding Bob is to take the elevators down to the first floor, and check to see
if Bob is there.

If Bob is there, all is well. If he isn't, can Alice conclude
that Bob is either late or has stood her up? No. Bob could have been sitting in
the lobby, waiting for Alice. At some point, it could have occurred to him that
Alice may be waiting upstairs, at which point he took an elevator up to check.
If Alice and Bob were both on an elevator at the same time, unless it is the
same elevator, they will pass each other during their ride.

When Bob and Alice each assume that the other one is in the other
place and is staying put and both take the elevator, they have been bitten by a
race condition. A race condition occurs when an assumption needs to hold
true for a period of time, but actually may not. Whether it does is a matter of
exact timing. In every race condition there is a window of vulnerability.
That is, there is a period of time when violating the assumption leads to
incorrect behavior. In the case of Alice and Bob, the window of vulnerability is
approximately twice the length of an elevator ride. Alice can step on the
elevator up until the point where Bob's elevator is about to arrive and
still miss him. Bob can step on to the elevator up until the point that
Alice's elevator is about to arrive. We could imagine the door to
Alice's elevator opening just as Bob's door shuts. When the assumption
is broken, leading to unexpected behavior, then the race condition has been
exploited.

When it comes to computer programs, windows of vulnerability can be
large, but often they are small. For example, consider the following Java
servlet:

This tiny piece of code may look straightforward and correct to most
people, but it has a race condition in it, because Java servlets are
multithreaded. The programmer has implicitly assumed that the variable count is
the same when printed as it is after the previous line of code sets its value.
This isn't necessarily the case. Let's say that Alice and Bob both hit
this servlet at nearly the same time. Alice is first; the variable count becomes
1. Bob causes count to be changed to 2, before println in Alice's thread
runs. The result is that Alice and Bob both see 2, when Alice should have seen
1. In this example, the window of vulnerability isn't very large. It is, at
most, a fraction of a second.

Even if we move the increment of the counter into the expression in
which we print, there is no guarantee that it solves our problem. That is, the
following change isn't going to fix the problem:

p.println(++count + " hits so far!");

The reason is that the call to println takes time, as does the evaluation of
the argument. The amount of time may seem really small, maybe a few dozen
instructions. However, this isn't always the case. In a multithread system,
threads usually run for a fraction of a second, then wait for a short time while
other threads get the chance to run. It could be the case that a thread
increments the counter, and then must wait to evaluate the argument and run
println. While that thread waits, some other thread may also increment the
counter.

It is true that the window of vulnerability is very small. In
practice, this means the bug may show up infrequently, if ever. If our servlet
isn't receiving several hits per second, then it is likely never to be a
problem. This alludes to one of the reasons why race conditions can be so
frustrating: When they manifest themselves, reproducing the problem can be
almost impossible. Race conditions tend not to show up in highly controlled test
environments. If you don't have any clue where to begin looking for a
problem, you may never find it. The same sorts of issues hold true even when the
window of opportunity is bigger.

In real-world examples, an attacker with control over machine
resources can increase the odds of exploiting a race condition by slowing down
the machine. Another factor is that race conditions with security implications
generally only need to be exploited once. That is, an attacker can automate code
that repeatedly tries to exploit the race condition, and just wait for it to
succeed. If the odds are one in a million that the attacker will be able to
exploit the race condition, then it may not take too long to do so with an
automated tool.

In general, the way to fix a race condition is to reduce the window
of vulnerability to zero by making sure that all assumptions hold for however
long they need to hold. The main strategy for doing this is to make the relevant
code atomic with respect to relevant data. By atomic, we mean that all the
relevant code executes as if the operation is a single unit, when nothing can
occur while the operation is executing. What's happening with race
conditions is that a programmer assumes (usually implicitly) that certain
operations happen atomically, when in reality they do not. When we must make
that assumption, then we need to find a way to make the operation atomic. When
we don't have to make the assumption, we can code the algorithm
differently.

To make an operation atomic, we usually use locking primitives,
especially in multithread applications. For example, one way to fix our Java
servlet would be to use the object lock on the servlet by using the synchronized
keyword. The synchronized keyword prevents multiple threads from running
code in the same object that is governed by the synchronized keyword. For
example, if we have ten synchronized methods in a Java class, only one thread
can be running any of those methods at any given time. The JVM implementation is
responsible for enforcing the semantics of the synchronized keyword.

The problem with this solution is that it can have a significant
impact on efficiency. In this particular case, we have made it so that only one
thread can run our servlet at a time, because doGet is the entry point. If the
servlet is incredibly popular, or if the servlet takes a long time to run, this
solution won't work very well. People will have to wait to get their chance
inside the servlet, potentially for a long time. The solution is to keep the
code we need to be atomic (often called a critical section) as small as
possible [Silberschatz, 1999]. In Java, we can apply the synchronized keyword to
blocks of code. For example, the following is a much better solution to our
servlet problem:

We could just put the call to println inside the synchronized block,
and avoid the use of a temporary variable. However, println is a method call,
which is somewhat expensive in and of itself. There's no need for it to be
in the block, so we may as well remove it, to make our critical section finish
as quickly as possible.

As we have seen, race conditions may be possible whenever two or more
operations occur and one of the latter operations depends on the first. In the
interval between events, an attacker may be able to force something to happen,
changing the behavior of the system in ways not anticipated by the developer.
Making this all work as an attacker requires a security-critical context, and
explicit attention to timing and knowledge of the assumptions a developer may
have made.

The term race condition implies a race going on between the
attacker and the developer. In fact, the attacker must "race" to
invalidate assumptions about the system that the programmer may have made in the
interval between operations. A successful attack involves a quick-and-dirty
change to the situation in a way that has not been anticipated.