When Good Information Turns Bad

Data expiration or automatically expiring data means the following: getting rid of information (or data) proactively, based on a time-based notion of staleness.

The first question we need to answer is simply this: why expire data? Why get rid of information? Rather than start by discussing a general theory, I'm going to begin with a series of concrete examples that illustrate types of data that need to be occasionally removed or refreshed programmatically. After these examples, we'll discuss the general theory a little, and then move on to actual code.

Session Keys

Session keys are probably the most common example of automatically-expired data. If you've written a Web application, you know that figuring out whether users are still logged on can be a problem. Because the Web is based entirely on a request-response protocol (Web browsers make requests and get pages back in response; the Web server does not originate any activity), a Web application has no way of distinguishing between the "the user is taking a long time to carefully read the document" and "the user closed her Web browser and went home 20 minutes ago."

To work around this problem, Web applications expire sessions based on user inactivity: if a user hasn't requested any HTML pages in a while, their session will be closed and any session information will be removed. Session expiration is often handled by the servlet container; the person writing the servlet simply needs to set a session timeout. But sometimes using the container isn't good enough. For example, suppose that you're writing an application that monitors a factory. One of the HTML screens in your application might automatically update itself every 30 seconds to provide up-to-date information about the assembly line. This HTML screen has the potential to keep a session alive indefinitely. The problem is that even if the user has stepped away from her desk, the application is still requesting pages. This can be a security hole (or at least a memory leak) if the developers blindly relied on the servlet container to close the session.

Session keys are also often used outside of servlets. Many security-conscious financial institutions consider it a security violation for users to stay logged into an application when they're away from their desk. Accordingly, applications at financial institutions often automatically log out users who have been inactive for a certain period of time. And, of course, this has to be done on the server (the first rule of security is that you have to assume the client's been compromised). Even if the application didn't use the Web at all, and used a Swing-based GUI connecting to an EJB server via RMI-IIOP, you still might want to automatically log out clients. And nowhere in the entire 572 pages of the EJB specification is there anything about automatically expiring session keys.

Limited Duration "Holds"

Buying concert tickets over the World Wide Web often involves several steps that are organized as a transaction. For example, when I purchase tickets from my local ballet company, I go through the following sequence of Web pages.

I start by looking at the schedule and choosing a date and time. Then I click a button that submits the Web form and loads the "What type of ticket" page.

I choose the type of tickets I want to purchase (Grand Circle, Circle, Balcony, etc.) and indicate how many seats. Then I click a button, which submits the form and loads the "Choose your seats" page.

I choose my seats and click the button, which submits the form and loads the order confirmation page.

I click the confirm button.

The interesting thing about this sequence is that the confirmation page explicitly warns me that I haven't actually purchased the seats yet. What I've really done is put a temporary hold on the seats. If I don't click the confirm button within five minutes, the hold will be gone and I'll have to do everything again.

Now, this happens at a much finer-grained level than the idea of a "session." The temporary hold is on a resource that many different clients might want to purchase, and needs to be short-lived (especially relative to the session duration).

Short-Lived Information

In the previous example the hold was valid as long as it was maintained. The application's goal was to release the hold quickly; however, there are many cases where information simply becomes invalid over time. For example, consider a maritime buoy that is reporting weather conditions at sea. A sea-conditions Web application such as the one for Half-Moon Bay, California doesn't need to query the buoy every time a Web browser asks for the page; boaters don't need up-to-the-microsecond information. But, by the same token, boaters do expect that the information is relatively up-to-date. I'd be very upset if I made sailing plans based on the buoy's Web site and then I found out the information was updated weekly.

What the sea-conditions Web server needs to do is occasionally update its information. It can do this in either of two ways: it can wait for a request to come in, and then realize the buoy information is out-of-date and needs to be refetched; or it can automatically refresh the buoy information every so often. In either case, the server needs to have some idea of whether the information it has is valid or not. The validity is based on how old the information is.

Freeing up Scarce Resources

Another set of examples comes from data that doesn't become invalid very quickly, and doesn't need to be released because of application logic, but that does need to be released simply because holding on to it is expensive. Examples of this sort of data are harder to come by, but easy to recognize once you have them in hand. The identifying characteristics of such data are:

The information is coarsely grouped into large data sets, and stored in a centralized persistent structure (e.g., a local cache).

The user isn't accessing the information in a fixed or predictable order.

Even if you discard the information, you can get it again.

One simple example of this type of data is a Web browser's cache. The information (HTML text, images, JavaScript, etc.) is grouped into "pages." Users access the pages in an unpredictable order (frequently using the back or forward buttons to skip around). And if the information is discarded, it can always be downloaded again.

When you have this type of data, it's frequently important to occasionally clean up the cache. This cleanup is usually done by discarding data that hasn't been accessed for some specified period of time.

Remote Stub Caches and Distributed Garbage Collection

Our next example is based on the RemoteStubCache object that was first introduced in my series on Command Objects in distributed programming. Recall that the idea behind RemoteStubCache was simple: client applications often fetch stubs from a naming service based on the name of a server. If client applications don't cache the stub somewhere, then remote method invocation can become much more expensive because every logical call to the server requires two remote calls: one to the naming service to fetch a stub, and one to the actual server.

The problem with building local stub caches is that RMI has a distributed garbage collector. As long as client applications are holding onto stubs, servers that rely on the Unreferenced interface to know when to shut down will stay active. This can cause problems, for example, if you are trying to manage the server lifecycle by using the distributed garbage collector.

Database Connections and Pooling Strategies

A similar problem arises if you use a pool for database connections. A good strategy is often to increase the size of the pool to accommodate current demand, then gradually decrease the size of the pool if demand decreases. By doing so, you get the following benefits:

You get as many connections as you need to handle surges.

You release resources on the database side as soon as you can be sure demand has lessened.

The alternatives usually involve one of the following strategies:

Allocate a large pool of connections and then don't allocate any more, no matter how many connections are needed (make the pool simply block the threads).

Allocate more connections during peak demand, but close them right away.

The first alternative is often not a bad idea. Peak database connection usage often corresponds to peak application processing. And a case can be made that if the application is busy, you don't want to start creating database connections. Instead, it can be more efficient to block some requests and wait for connections to become available. I have never heard anyone claim the second alternative is a good idea; it often pops up in codebases (as a consequence of some generic pooling algorithm), but it's usually viewed as a low-priority bug.

Examining the Examples

All of these examples fall into two main camps:

The data has become invalid and should no longer be used.

The resources no longer needed and can be released.

These two problems look very different at first glance: "the weather might have changed" certainly feels different from "we want to release stubs to remote servers as soon as possible." But the point of listing all these examples back to back is this: in a large number of cases, the correct solution is often to use a time-based expiration strategy:

Timestamp the objects.

Update the time stamp when you use the objects or refresh the information.

Throw away objects, the timestamps of which have expired.

For data sets, the idea is to use a valid time. This approach says "this data is only valid for this time interval. After that, it is no longer valid and should be thrown away." So, for example, a programmer might decide that information about the wind speed and direction is only valid for thirty minutes. Once that thirty minutes has elapsed, the information is no good and must be fetched again.

For resources, the idea is to use a last access time. For session keys, if the user hasn't accessed the server in a predetermined time period, the server decides that the session is no longer valid. For server resources, the idea is that the information is still valid, but if the client hasn't used it in a while, it should be cleaned up, so that resources can be made available to other processes. In either case, the information doesn't have a fixed expiration date. Instead, it has a rolling window.

It's important to note that all of our examples involved data that could be incorrect (e.g., data for which mild inaccuracies are acceptable, or for which recovery strategies are available). If the data absolutely must be correct, then simple data-expiration algorithms will not suffice. In most cases, if that information is going to be cached, the usual strategy is to expire it using a distributed event model.

In fact, if you are caching information, you need to ask yourself questions like these:

Is the data likely to change?

Is the change likely to matter?

Can an undetected change cause a programmatic error?

Is validating data at the moment of use a feasible strategy?

Is this information used often?

The answers to these questions will determine whether you use a cache at all; if you the information changes more often than you use it, you might as well just fetch it when you need it. If you use a cache, you need to either manage the cache locally (using a time-based algorithm like the ones in this article) or update it using an event-based algorithm.

Deriving the Standard Expiration Algorithms

In this section, we're going to repeatedly implement
algorithms and containers that expire data. The idea is to explore the space of possible solutions, and come up with a set of requirements for a "good" solution. Along the way, we'll also discuss how Tomcat solved the problem of expiring session keys (and discuss a potential flaw in Tomcat's solution).

The First Solution

When faced with time-sensitive data that might need to be expired, programmers often start by using the "thread per information unit" pattern. This code is fairly simple and easy to write. It involves the following steps:

Every instance of an object that might need to be expired gets a time stamp, an expected lifetime, and a background thread (created in its constructor) that will expire the object and then halt.

If the instance is going to be expired based on inactivity, then the methods which cause it to be "marked as active" update the time stamp somehow.

A background thread occasionally checks the instance to see if it needs to be expired.

Thus, for example, a developer might start by writing the following code:

package grosso.firstexample;

import java.util.*;

/*
ExpirableObject.

Abstract superclass for objects which will expire.
One interesting design choice is the decision to use
the expected duration of the object, rather than the
absolute time at which it will expire. Doing things this
way is slightly easier on the client code this way
(often, the client code can simply pass in a predefined
constant, as is done here with DEFAULT_LIFETIME).
*/

public abstract class ExpirableObject {
public static final long FIFTEEN_MINUTES = 15 * 60 * 1000;
public static final long DEFAULT_LIFETIME = FIFTEEN_MINUTES;

Non-abstract subclasses of ExpirableObject must have an implementation of the expire method, which contains the "death logic" of the object. Each instance of ExpirableObject allocates a thread that, after a certain amount of time has elapsed, calls the expire method.

This approach is often sufficient for simple applications that have very few pieces of data. But it has a major flaw: using this class creates lots of threads (one per instance of the object in question). As a result, programs which create instances of ExpirableObject for central data structures don't scale very well.

Eliminating the Extra Threads

There's a simple solution to the number-of-threads problem raised by the first solution: instead of each object having its own expiration thread, use a single expiration thread for all of the instances of ExpirableObject. The code now contains two classes, ExpirableObject and GlobalExpirationHandler and looks something like the following:

package grosso.secondexample;

/*
ExpirableObject.

Abstract superclass for objects which will expire.
During construction, instances of this class register
with the ExpirationHandler, which owns a background
thread which expires objects
*/

public abstract class ExpirableObject{
public static final long FIFTEEN_MINUTES = 15 * 60 * 1000;
public static final long DEFAULT_LIFETIME = FIFTEEN_MINUTES;
private long _expirationTime;

All that we've really done here is pulled the expiration thread out of ExpirableObject and adapted it, using the Expirer inner class, to expire any number of instances of ExecutableObject. Even in an example as simple as this, however, there are five interesting points to note:

We had to be a lot more careful with our expiration thread -- we actually catch Throwable. If we didn't do this, then a single instance of any subclass of Throwable could kill the expiration thread and ruin our expiration strategy. For example, a single NullPointerException thrown by some carelessly written expire method could prevent any objects from being expired.

Performance requires two data structures: an instance of Vector and an instance of HashMap. Iterating over instances of HashMap is very slow; checking to see whether an instance of Vector contains an object is even slower.

Our thread is now a lot more active! It wakes up fairly often (every fifteen seconds!) and checks to see which objects are due for expiration. If a single thread is expiring many instances, with many different creation times, then it needs to be fairly active. We've traded lots and lots of very inactive threads for one fairly active thread. It's a good tradeoff to make, but we should be aware, as we move forward, that at some point it's going to be worth looking into ways of making the background thread less computationally expensive.

We've named our single background thread (naming threads simplifies debugging) and given it a low priority (data expiration is usually not a high priority task).

We've gained a little bit of flexibility in ExpirableObject by using the new public method setExpiration. Instances of ExpirableObject that are in use can have their expiration delayed (without making the background thread work any harder).

This example also used generic classes (specifically, the generic collection classes). This wasn't really necessary (we didn't really need to use genericized hashtables or vectors), but it's fun and makes the code a little cleaner. As the examples get more elaborate, we'll start incorporating generics into the signatures of all the objects and interfaces. For more information about generics, I recommend the introduction contained in the third article in my series on command objects.