Profiling LAMP Applications with Apache's Blackbox Logs

Most people use the Apache web server for web services because it can scale
to match growth. It works well on low- and high-end server hardware and it has
an extensible API for adding new modules. Also, since HTTP is a stateless
protocol, you can add more web servers to handle the same content.

If your CGI scripts are taking too long, you can modify your code to turn
your Apache server into an application server that runs the scripts in the
background. If you want redundancy, you can put multiple servers in a
load-balanced pool. If your outgoing bandwidth costs are too high, you can use
a distributed caching service like Akamai or Mirror Image.

But by adding more components to your web server environment, you increase
the chance that something will go wrong. When something does go wrong, the
complexity makes the problem analysis more difficult.

In almost all cases when an Apache web server is performing outside of
normal specifications, a good administrator will check the web server log. This
comes from the most common theory that a web server's slow performance is due
to the server being hammered with requests. Maybe a web designer is
now serving a five megabyte MPEG file, or maybe there is a worm out there that
is running a distributed DOS attack against your site. In these types of
situations, checking the log file is a good thing.

My current work assignment puts me in a team of administrators handling
roughly 20 web servers. The servers are load-balanced and hand out both
static and dynamic content. We proxy almost everything through a distributed
caching service to reduce our bandwidth costs and to improve client load
times.

It isn't foolproof. We usually see balanced traffic 99.9% of the time, but
sometimes a group of cache servers will fail, or new content may be added that
does not have properly set expires headers. Sometime during my first month, I
experienced my first unexpected traffic surge, and my boss asked me to see if I
could find anything in the log files.

Log File Analysis

When I started my analysis, I made the assumption that most of the content
will be cached. I should be seeing a lot of hits from unique IP addresses
during the allowed cache lifetime of each object. It is also possible that the
cache server will expire the object itself if it ran out of disk space, or the
cache tries to reload an expired object that may not have changed. With that in
mind, I whipped up a couple of log-file parsing scripts.

Just to review, here is a tiny clip of a web server log using the common
log file format.

The most obvious information you can get from each entry is the source IP of
the request, the request string, and the time of the request. With a parsing
script, you can also find the size of the largest file served, the highest
number of requests that came in during one single second, a breakdown of all
the source IPs, and the ratio of 200 to 304 hits.

After some checking I discovered that a few objects had a much higher hit
ratio than others. Those hits all had unique IP addresses, and none of those
addresses had hit other URLs handled by our server. My conclusion was that the
cache servers were being bypassed for a few specific objects, which would
explain the high traffic load. Later, we discovered a group of URLs was posted
to a public web site that incorrectly pointed directly to us.

Since that incident, I worked on more tools to handle log-file hit analysis.
I took older log files and ran them through more parsing scripts to find out as
much as I could about our traffic patterns.

Unfortunately, I learned that there was performance data out there that I
would never be able to find in the log file. I couldn't see whether keep-alives were being used by the web clients, which child processes handled which request,
and I couldn't determine how many different clients were coming from the same
source IP. While the common log-file format (and the combined format) are great
for hit tracking, they aren't suitable for getting hardcore performance
data.

Fortunately, the Apache web server is known for being incredibly flexible.
The mod_log_config module makes it easy to put together a log file
format that will give you the data you need. The Apache manual lists all of
the logging directives you can use to build your own custom format.

The Blackbox Format

In this article I will discuss a format that uses the best logging
directives of Apache to create a running record of what Apache is doing when
incoming clients access the server. The name of the format is called
Blackbox.

A black box is a device that records information on the performance of an
aircraft. Airspeed, direction, and status indicators combine to give an
accurate account of what happened during a flight. If the aircraft develops
problems (or in the worse case, crashes) technicians can use the data in the
black box to find out what went on within the aircraft.

This log format is meant to supplement the common access log file, not to
replace it. Since Apache can handle multiple log files, administrators can have
Apache write to the second log file by adding two configuration directives.