Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with
their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in
Linux Magazine magazine.
However, the version you are reading here is as the author
originally submitted the article for publication, not after their
editors applied their creativity.

The server for www.stonehenge.com (and a few other sublease
customers like www.geekcruises.com) uses mod_perl and apache.

Because the mod_perl processes tend to be heavy, I have the
traditional caching reverse-proxy server stack, so that the mod_perl
processes can be freed the moment they've completed their task, and
some lightweight hits (such as icons) can be served without touching
the backend at all.

For many years, I've been logging the backend processes directly to a
database with DBI, using a custom mod_perl logger-phase handler.
However, because the front processes didn't have mod_perl, the only
logging I could get was a traditional flat file.

Recently, I upgraded to Apache2 and mod_perl2 (which is a bit of a
story in itself, but perhaps for another time). This required
updating my DBI logger a bit. At the same time, I was also asked to
track users to the GeekCruises web site, so I wanted to add
mod_usertrack and also a way to link front hits with back hits for
complete correlation.

At first, I just enabled the DBI logger in the front servers, but
within a few days, things started breaking. Because I use PostgreSQL,
each child process in both the front (many) and back (few) processes
was maintaining a separate connection to the database, and this was
causing me to run out of memory and available database connections.
Even worse, if the connections were just full enough, a new connection
for the registration or administration applications for GeekCruises
would fail, leaving either the customer or the admin person upset and
frustrated.

So, I quickly concluded that rather than have separate connections
from each child to the database, I needed some sort of aggregator. I
noticed that the automatic logroller application included with Apache
was basically a separate process that was forked during the startup
phase, but shared a connection to all of the children. After studying
the C source code of how this was accomplished, I decided I could do
the same thing in a few lines of Perl as well.

The result is that I now have only two connections to the database
(one used by the front cluster of servers, and the other used
by the back cluster), and my PostgreSQL is happy, and so are the
folks at GeekCruises.

The code from the mod_perl handler is in [listing one, below]. The
code for the separate logger program will be presented in the next
part of this article.

Line 1 marks this as Perl code, so that Emacs will switch to the right
editing mode. Line 2 establishes the package for my handler. I put
locally developed modules into the Stonehenge:: prefix to prevent
collisions with CPAN modules. Line 3 enables the traditional compiler
restrictions.

Line 5 reminds me what I need to do to use this module. I try to
leave hints like this to remind myself what I was thinking at the
time. In this case, I'm reminded of the necessary line in
httpd.conf.

Line 9 defines pnote name that will be used to hold the initial
wall clock and CPU time values. I'll set this up as an array ref at
the beginning of the service cycle, and when we arrive at the logging
phase, I can pull the values back to see what to subtract from the
current values to get the deltas.

Line 11 defines the external program to accept the logging values and
write them into the database. When this file is processed through
Template Toolkit (as described in [my column from July 02]) during the
installation process, the value inside the brackets will be replaced
by the path to the root of the installation tree. Beneath that root,
I have the traditional sbin directory for things like apachectl,
and I've added this logger program as well. The logger program
will be described in the next part of this article.

Lines 15 and 16 import the constants needed later in this module.

Line 18 declares the variable that will hold the handle for the pipe
to the logger program. This handle will be set up in the parent
process, before Apache forks each child, and will thus be shared for
the children. This is the key to getting only one logger to activate
for each cluster of front and back servers.

Lines 20 to 44 define the subroutine that will be triggered in the
parent Apache process just after the configuration files have been
read, but before Apache begins forking children. Line 21 is copied
from the documentation, and provides me with things I don't
understand, so I mostly ignore them. However, $s is the server
handle, and this is useful.

Line 23 can be uncommented so that I can see that my handler is being
executed in the right process (and only once).

Lines 25 to 33 set up the handlers for the start-of-transaction
(PostReadRequest) and end-of-transaction (Log) phases. I loop through
all of the servers, skipping over any that are virtual servers. There
should be just one real server, and this is where I want to add the
other phase handlers, which I verified by uncommenting line 28. The
corresponding handlers are located in the current package with names
other than handler, in fact corresponding to the phase name.

Lines 35 and 36 establish the pipe to the child process for the logger
program. Note that death here means my master server process won't
start, so this logger is critical to the entire operation. Lines 37
to 41 unbuffer this pipe.

Line 43 returns an OK status, which is probably ignored anyway,
because we're not in a phase which is actually serving a response.

Once the child apache process has accepted a connection, and read the
headers, the handler beginning in line 46 is executed.

Line 47 grabs the request object as the only parameter to the handler.
Line 48 rejects subrequests, because we don't want to reset the
counters if a page uses a subrequest, such as to perform an include
file or to check the MIME type of a file in a directory listing.

Line 50 creates a 5-element list, consisting of the UNIX-epoch
wall-clock time, and the four accumulated CPU times (user CPU, system
CPU, child-user CPU, and child-system CPU). These values will form
the ``start'' time for our current hit.

Line 52 pushes these values into the request pnotes structure as an
array reference, so that we may access the values later.

Line 54 may be uncommented to verify that the right values are being
stored.

Line 55 returns DECLINED so that other handlers at this phase will
also be run. (I don't recall, but I suspect I can return OK as well,
but this is safer and more familiar.)

Line 58 starts the handler that is called at the end of the request
cycle, during the Log phase. Line 59 grabs the request object so that
we can get further information. Line 60 can be uncommented to see
that we are actually getting into this handler.

Lines 62 to 70 force any children that are running to be finished
before we perform the logging. Generally, child processes are
finished anyway, so this is typically a no-op, but if there are still
some children running that would have been reaped in the cleanup
phase, I want to make sure that their child CPU values don't get
charged to the wrong process.

Line 72 extracts the 5-element list of time values saved during the
earlier handler. If they don't exist, then line 74 issues a warning,
and line 75 returns immediately, because there's no valid values for
this hit. I have seen this message trigger occasionally, when a
request comes in but the browser disconnects before sending a complete
header, for example.

Line 79 subtracts these values just obtained from the current values,
using a clever mapping that I figured out one day. I create a new
list that results in pairing the new values with each element of the
old list one by one, subtracting the old value.

Lines 81 and 82 attempt to extract the $orig (original) and
$last request record from the current request record. When a
request receives an internal redirect, such as for a 404 handler, an
additional request record is added to the chain. I was having
difficulty finding my wanted values, as you'll see in the code a few
lines later, and decided to try looking at most of the likely elements
of the request chain.

Line 83 extracts the server information object from the original
request. Line 84 similarly extracts the connection object.

Line 85 fetches the server value, so that I can tell whether this is
the lightweight reverse-proxy server or the heavyweight backend
server. In the httpd.conf, I have:

PerlSetVar "DBILogServer" "[* kind *]"

and ``kind'' expands to ``P'' for the proxy server and ``B'' for the back
server. This string ends up in $server for logging.

Lines 87 to 108 remind me of the SQL that I used to create the request
records in my database. And to load that data up, I create a hash
%data in line 112, and fill it in lines 114 to 136. Each key
corresponds to a column name. The logger program doesn't know the
column names, and will simply take key-value pairs and create an
insert for those names.

Lines 114 to 120 try to find my unique-id (added by mod_unique_id).
I was having trouble finding it on some of the requests, so this is
probably overkill now, but I'm impatient. Similarly, lines 121 to 124
try to find the cookie I created with my own Perl version of
mod_usertrack.

Line 125 gets the timestamp of the beginning of the request.
Similarly, lines 126 through 135 fetch other parts of the information
about the request. Line 136 uses a hash slice to set the five values
for the wallclock time and CPU times all at once.

Once the hash has been created, I need to create a simple string from
the data so that it can be written to the logger, which is expecting
line-oriented input. Lines 138 to 143 create a space-separated list
of values that look like columnname=hexstring, where the hexstring
is the hex-encoded value. I have to encode the value because it might
contain spaces and other odd characters.

Lines 146 and 147 writes the data line to the logger handle, enclosed
in angle brackets, which I use as start and end markers in the logger
to make sure I got a clean record. Modern UNIX-like operating systems
guarantee that if this write is smaller than a certain size (4096
bytes is typical), then the data will not be interrupted by other data
from a different write, and this is the key to allowing many child
processes to all use the same logger process.

Well, that's the server-side part of the process. Next time, I'll
talk about the contents of logger, and show some typical queries
that this database log lets me perform. Until then, enjoy!