Profiling LAMP Applications with Apache's Blackbox Logs

Most people use the Apache web server for web services because it can scale
to match growth. It works well on low- and high-end server hardware and it has
an extensible API for adding new modules. Also, since HTTP is a stateless
protocol, you can add more web servers to handle the same content.

If your CGI scripts are taking too long, you can modify your code to turn
your Apache server into an application server that runs the scripts in the
background. If you want redundancy, you can put multiple servers in a
load-balanced pool. If your outgoing bandwidth costs are too high, you can use
a distributed caching service like Akamai or Mirror Image.

But by adding more components to your web server environment, you increase
the chance that something will go wrong. When something does go wrong, the
complexity makes the problem analysis more difficult.

In almost all cases when an Apache web server is performing outside of
normal specifications, a good administrator will check the web server log. This
comes from the most common theory that a web server's slow performance is due
to the server being hammered with requests. Maybe a web designer is
now serving a five megabyte MPEG file, or maybe there is a worm out there that
is running a distributed DOS attack against your site. In these types of
situations, checking the log file is a good thing.

My current work assignment puts me in a team of administrators handling
roughly 20 web servers. The servers are load-balanced and hand out both
static and dynamic content. We proxy almost everything through a distributed
caching service to reduce our bandwidth costs and to improve client load
times.

It isn't foolproof. We usually see balanced traffic 99.9% of the time, but
sometimes a group of cache servers will fail, or new content may be added that
does not have properly set expires headers. Sometime during my first month, I
experienced my first unexpected traffic surge, and my boss asked me to see if I
could find anything in the log files.

Log File Analysis

When I started my analysis, I made the assumption that most of the content
will be cached. I should be seeing a lot of hits from unique IP addresses
during the allowed cache lifetime of each object. It is also possible that the
cache server will expire the object itself if it ran out of disk space, or the
cache tries to reload an expired object that may not have changed. With that in
mind, I whipped up a couple of log-file parsing scripts.

Just to review, here is a tiny clip of a web server log using the common
log file format.

The most obvious information you can get from each entry is the source IP of
the request, the request string, and the time of the request. With a parsing
script, you can also find the size of the largest file served, the highest
number of requests that came in during one single second, a breakdown of all
the source IPs, and the ratio of 200 to 304 hits.

After some checking I discovered that a few objects had a much higher hit
ratio than others. Those hits all had unique IP addresses, and none of those
addresses had hit other URLs handled by our server. My conclusion was that the
cache servers were being bypassed for a few specific objects, which would
explain the high traffic load. Later, we discovered a group of URLs was posted
to a public web site that incorrectly pointed directly to us.

Since that incident, I worked on more tools to handle log-file hit analysis.
I took older log files and ran them through more parsing scripts to find out as
much as I could about our traffic patterns.

Unfortunately, I learned that there was performance data out there that I
would never be able to find in the log file. I couldn't see whether keep-alives were being used by the web clients, which child processes handled which request,
and I couldn't determine how many different clients were coming from the same
source IP. While the common log-file format (and the combined format) are great
for hit tracking, they aren't suitable for getting hardcore performance
data.

Fortunately, the Apache web server is known for being incredibly flexible.
The mod_log_config module makes it easy to put together a log file
format that will give you the data you need. The Apache manual lists all of
the logging directives you can use to build your own custom format.

The Blackbox Format

In this article I will discuss a format that uses the best logging
directives of Apache to create a running record of what Apache is doing when
incoming clients access the server. The name of the format is called
Blackbox.

A black box is a device that records information on the performance of an
aircraft. Airspeed, direction, and status indicators combine to give an
accurate account of what happened during a flight. If the aircraft develops
problems (or in the worse case, crashes) technicians can use the data in the
black box to find out what went on within the aircraft.

This log format is meant to supplement the common access log file, not to
replace it. Since Apache can handle multiple log files, administrators can have
Apache write to the second log file by adding two configuration directives.

What to Log

The directives listed below are most compatible with Apache 2.0 with the
mod_logio module. Later in the article, I will discuss cross-compatibility with Apache 1.3 and other server environments.

Source IP, Time, and Request Line

%a, %t, and %r

These directives are already used in the common log file format. They are
the three most obvious request metrics to track.

When logging the remote host, it is important to log the client IP address,
not the hostname. To do this, use the %a directive instead of
%h. Even if HostnameLookups are turned on, using
%a will only record the IP. For the purposes of the Blackbox
format, reverse DNS should not be trusted.

The %t directive records the time that the request started. It
could be modified using a strftime format, but it would be better
to keep it as is. That makes it easier to correlate lines between the Blackbox
log file and the common log file.

The %r directive is the first line of text sent by the web
client, which includes the request method, the full URL, and the HTTP protocol.
It is possible to break up this data using individual directives. For example,
you could log a URL without a query string. Again, it's better to keep the
request line intact for comparison.

Process id and Thread id

%{pid}P and %{tid}P

When the Apache server starts, it spawns off child processes to handle
incoming requests. As it runs, it shuts down older processes and adds new ones.
Apache can add additional child processes if it needs to keep up with a high
demand. By recording the process id and thread id (if applicable), you will
have a record of which child process handled an incoming client.

You can also track the number of Apache processes for a given time and
determine when a child process shut down. If you are running an application
handler (mod_perl, mod_python), recording the PID
will make it easier to find out what hits a child process was handling when
debugging an application error.

Connection Status

%X

The connection status directive tells us detailed information about the
client connection. It returns one of three flags: X if the client
aborted the connection before completion, + if the client has
indicated that it will use keep-alives (and request additional URLs), or
- if the connection will be closed after the request.

Keep-Alive is an HTTP 1.1 directive that informs a web server
that a client can request multiple files during the same connection. This way a
client doesn't need to go through the overhead of re-establishing a TCP
connection to retrieve a new file.

For Apache 1.3, use the %c directive in place of the
%X directive.

Status Codes

%s and %>s

There's nothing really new about this directive, since it's already used in
the common log file format. The CLF records the status code — after any
redirections take place — with %>s. For the Blackbox
format, we will want to record the status code before and after the redirection
took place.

Time to Serve Request

%T and %D

The common log file format cannot accurately determine the amount of time it
takes to serve a file. Some parsing programs will try to make estimates based
on the timestamp on hits from the same source, but it is very unreliable,
especially if the hits are being made in parallel.

These two directives will give you the exact metrics you need. The
%T directive will report the time in seconds it took to handle the
request while the %D directive will report the same time in
microseconds.

Apache 1.3 does not support the %D directive.

Bytes Sent and Received

%I, %O, and %B

Apache 2.0 includes the optional mod_logio module which can
report on how many bytes of traffic a client sent and how many bytes the
server returned.

The %b directive does a good job, but it only reports the bytes
returned in the requested object, excluding the bytes from the HTTP headers.
The header traffic is usually small, but you may want to record it to get a
better idea of outgoing TCP traffic for a given interface can be like.
Recording the incoming bytes is helpful when your users are uploading files
with the PUT or POST methods.

Use %I to record incoming bytes, %O to record
outgoing bytes, and %B to record outgoing content bytes. In cases
where no content is returned, the %B directive returns a zero,
whereas %b returns a dash. Since we're dealing with integer
values, it's better to use %B.

Client Source Port

%{REMOTE_PORT}e or %S

Logging the client source TCP port can provide some useful network data and
can help us associate a single client with multiple requests.

If two clients from the same IP address make simultaneous connections, the
common log file format cannot distinguish between those clients. Otherwise, if
the client uses keep-alives, then every hit made from a single TCP session will
be associated by the same client port number.

The port information can indicate how many connections our server is
handling at once, which may help in tuning server TCP/IP settings. It will also
identify which client ports are legitimate requests if the administrator is
examining a possible SYN-attack against a server.

This is the part where Apache doesn't really help you out. The only native
way to record the source port is by using %{REMOTE_PORT}e, which
grabs the value of the REMOTE_PORT environment variable, but
environment variables only come into play when mod_cgi or
mod_cgid is handling the request.

The only way to record the source port for every request is by modifying the
source code of mod_log_config.c or by creating a custom logging
module. In both cases, a new format directive could then record the source port
of the client connection.

If you want to record the remote port, here's the source code modification
for Apache 2.0. All of the changes are made in
modules/loggers/mod_log_config.c in the source distribution.

Find the function definition for log_remote_address. Add the
following function after it:

What Not to Log

At this point there are 14 fields in our log file format, which is getting
pretty close to the limits of readability. Before considering making the format
bigger, let's consider what doesn't need to be logged.

A lot of data in the combined log file format is client supplied, which means
our log analysis depends on a client being truthful. The HTTP username, ident username, HTTP referrer, and user agent strings all come from data sent from
the client. For example, some robots pretend that they are really a web
browser, so you can't tell when your site is being spidered. It's difficult to
verify this data, so it's best to ignore it.

Keep logging the data in the combined log file format. If you need to refer
to the data at a later time, it shouldn't be too difficult to match a Blackbox
log entry to a combined log entry.

Putting it Together

The following is the configuration syntax for the Blackbox log format. It's
designed to group similar directives together and still be readable when using
the tail command.

Apache 1.3 Configuration

If you decide not to use the remote port logging patch, replace
%S with 0, otherwise Apache will refuse to run.

The formats above include all of the directives to log and still provide a
cross-compatibility between the Apache 1.3 and 2.0 server releases. Even though
1.3 does not support threading or newer 2.0 logging directives, any parsing
script will work with both versions.

Once the directives are in place, restart Apache. You will need to modify
whatever scripts you use for log file rotation to take the Blackbox log file into
account. You might want to consider retaining at least 10 days of logging data
for your records.

Working with the Blackbox Format

Now that you're logging the data, you'll need to find a way to put it to
work.

The Perl Cookbook featured a field extractor regular expression for the
common log file format. That same code can be modified to extract all of the
fields in the Blackbox format.

Apache only logs attempted client requests. If someone connects to port 80
and doesn't send any data, it won't be logged. At a minimum, a client needs to
send some text and one carriage return. Also, the pattern above assumes a
proper request line will be transmitted by the client. If a client only
transmits garbage, the regex will fail.

The timestamps refer to the time the request was started, so it is possible
to see log file entries that are out of order, especially if you are dealing
with a mix of large files and small files being served.

The %T and %D directives refer to the time it
takes to handle the entire transaction, including the amount of time it takes
for the client to transmit data. If you see wide variances in the amount of
time it takes to serve the same file, it may be related to a web client network
problem.

Listed below are a few ideas for potential analysis tools that you can
develop from the Blackbox log.

Performance Graphing

Long-term collection of Blackbox data is better suited to a graphing
environment like RRDTool. You can graph metrics including bytes in and out,
maximum clients per second, and child process lifetime.

To do this, you need to write a program that continually scans the Blackbox
log in short intervals (5 or 10 minutes), grabs all new entries, and then
imports the data into the RRD file.

RRDTool collects time-based sampling data for any duration. One data file
can keep data for durations of a week, a month, or even a year. You can merge
data files into a single graph if you want to report on the performance of a
group of load-balanced servers.

Flight Recorder

You have almost enough data in the Blackbox format to see exactly how a
single client handled a series of requests. You can extract a full HTTP session
by filtering on the remote port and IP.

You could also program a HTTP client to replay the exact same data if you
wanted to try and simulate the client actions. The downside is that you won't
have a record of all of the client headers passed, such as cookies or
authentication data. Plus the timestamp resolution is to the second, which may
not be a 100% match with when the original client transmitted the request.

Final Thoughts

Everyone has been logging web server traffic as long as the web has been
around, but the emphasis has always been aimed toward the content served, not
the server itself. There are tools and modules out there that can monitor
performance, but most of them generate reports of data you can already find
just by logging it.

The Blackbox format is a simple alternative since it doesn't require
additional modules. Further, you can examine the data without a running web
server. All the work to get it up and running requires the addition of Apache
logging directives and an optional patch to the source code.

Chris Josephes
works as a system administrator for Internet Broadcasting.