Resources and Caches

Another interesting topic to keep in mind when deploying a large-scale application is how resource-hungry it is, and you have to consider what impact this new component is going to have on our overall infrastructure.

Let's look again at a practical example. VNU delivers news on the Web. All of our articles are nicely stored in a huge Oracle database, we have a library of several years accessible online at all times, and at the rate in which James Middleton writes articles, I believe that our Sun StorEdge is going to be full in few months.

Our idea is that each new story, each new article, might have references in the past and, although most of those references are quite recent, some of them go back quite a long way. If you browse our site and check out the related articles at the bottom of every page, you will understand what I mean.

If, for every article you click, we had to make a query onto the database, we would probably need a Sun E10000 by now, only to pass the articles back to our Web server. But a couple of years ago, a colleague of mine named Julian Mitchell came out with the idea of caching articles in the servlet container itself.

Brilliant solution: for a few pounds, we added some extra RAM onto our Web server and gave it all to the Java Virtual Machine. Every time a user looks for an article, it is not retrieved from the database itself most of the time, but rather from the cache stored in the Virtual Machine memory, and we hit the database only if the article is not cached. You can even see now when we restart our application: the site, for the first four to five minutes, is quite slow, as we are effectively refilling the cache of articles.

Several products now incorporate this feature as the default. If you look throughout Apache, for example, you will notice that Cocoon has an "embedded" cache storing the result of each atomic operation (XML parsing, XSLT processing, everything) so that it is able to deliver the quite-heavy-to-process XML-based content in basically no time.

If you are going to implement a cache for some part of your application, make sure you know exactly how much RAM you can spare for that, and that you have the right tools for measuring how much memory you are using at any given point. Several books, also, can explain different caching algorithms. My favorites are those about microprocessors' hardware architecture: every microprocessor has some sort of cache and, since it needs to be implemented in hardware, usually their algorithms are pretty small and functional.

Of course, VNU's cache is designed to deal with articles; Cocoon's is optimised for XML content. The Java Platform one day will probably have a caching engine in itself; JSR-107 is aimed exactly at that. It specifies API and semantics for temporary, in-memory caching of Java objects, including object creation, shared access, spooling, invalidation, and consistency across JVMs.

Funnily enough, the JSR was submitted by Oracle. (Maybe they had the same problem we had on our site?)

Tuning and Monitoring the Virtual Machine

The virtual machine is the most critical piece in the overall problem of deploying large-scale Web applications.

Apache, for example, goes to a great deal of trouble to make sure that it won't crash. The architecture of Apache 2.0, for example, was designed to include threads for performance reasons (each one of your processes can serve several concurrent requests), but at the same time to guarantee that if one of the Apache processes dies, you won't have to wait for its restart to continue serving requests. It is usual to see configurations with 4, 8, or 16 Apache processes, each one of them using 64, 128, 256 threads.

The virtual machine is just one big process with hundreds of threads processing requests but, if this one goes down, you have to wait until it comes back up again and, during that downtime, requests obviously cannot be processed.

One way to overcome this is to load-balance several JVMs with the same set of Web applications deployed, but this is not an easy thing to do.

A much simpler approach is to separate virtual applications across several containers. For example, run each of your applications in a different servlet container, in a different Java Virtual Machine.

This allows you to have a fine-grained control over each individual component of your Web site: you can individually control, for example, how much memory each application requires in comparison to the number of requests (see how well your application reacts to spikes of traffic), or what its overall impact on your OS will be (top can tell you all about it).

The advantage is that, if one of these falls over, your site (or most parts of it) will still be up and running, and it will take much less time to restart a VM holding one single application, rather than a VM containing four, five, or six of them (less memory, less classes to load, less JSP to recompile).

One other important thing to monitor in your virtual machine is the number of file descriptors. As with every other process, the Java Virtual Machine has a limit imposed by the operating system on the number of file descriptors (files and sockets for the most part) it can open at the same time.

Things like Lucene (search engine), JDBC connections pools, and client connections from the Web servers can greatly vary the number of file descriptors opened at any given time by the Virtual Machine.

Given that most operating systems are quite conservative about the number of file descriptors each process can open (usually it is 256 or 512) you want to make sure that your limit is high enough for your Web application. ulimit is a great utilty, but sometimes forgotten, and if you start seeing IOExceptions mentioning that the VM cannot open a file that is actually on the disk, and has the right permissions, this is the problem causing it.

When Things Go Wild

If something bad can happen, it will. Murphy's Law is an everyday reality for those involved with production servers.

The first thing to stress is the importance of logging. Logging is quite an expensive operation, and sometimes someone turns it off for performance's sake. This is the first big mistake you can commit, because without logs, you will not know what has actually happened if something crashes. And make sure you have relevant logging details.

Another common mistake is to be overloaded by logging information. In some situations I've had the not-so-pleasant experience to visit, if something crashes, someone (manually) will have to go through several megabytes of log files just to figure out at what time things went bad. A good approach is to make sure that each one of your log files contains relevant information. It is a good idea is to split log files into several different categories, each one for a particular area you want to focus on, as it is easier to combine log files than split them.

Make sure also that your logs can be easily parsed. Most of the time, the data is too much to be analyzed manually, but a couple of good Perl or bash scripts, with a introductory book to statistic analysis, can do marvels that you'd never imagined possible. For instance, remember that if you log exception stack traces, these are in a very ugly format, and are not well suited, for example, to be in the same file as your Apache error logs.

Another thing to remember is to monitor resources. A spike in traffic (the usual one at 10 PM when all geeks turn on their computers and look for news) can put your entire Web server at risk. Things that in the past few months had quite some relevance for us were:

RAM. On the overall system, RAM consumption might vary quite a lot, so be sure to monitor it and not swap too much.

Swap. As before, the more Swap is in use, the slower performances are. On a side note, remember that at least on Solaris, /tmp is mounted on your Swap so it's usually a bad idea to store 500 megabytes of tarballs in there just because it's fast.

Web Server Connections. Remember to monitor the state of the Web server, how many processes and threads (the -L option for ps under Solaris tells you about threads) are active, how many clients have active requests, and what they are requesting (Apache's mod_status can tell you all about this). About access logging, please do not trust services such as WebTrends or NetTraffic. They are marvelous marketing tools, but not even close to being a reliable way to figure out the activity on your Web server.

Network traffic. How much data are you sending or receiving from the network? This is essential to know, for example, in case of denial-of-service attacks.

If you collect all of this information in a timely manner (for example, we monitor each one of these parameters every minute), when something goes wrong you will have the situation pretty much under control. You will start to have a rough idea of actually what happened, at what time exactly and more or less why.

And then, given that you have all the logs available in a nice parsable format, and organized, you will be able to pinpoint exactly what caused the problem, and (hopefully) find a solution in no time.

But that's far from saying that you won't be called at 5:30am on a
beautiful Sunday morning because the Web site is down (again).

Pier Fumagalli
is an Apache Software Foundation member and active in the
Jakarta and HTTPD/APR projects. He works for VNU Business Publications in
London.