The handler( ) function starts by calling an internal function named read_configuration( ),
which, as its name implies, parses the navigation bar configuration
file. If successful, the function returns a custom-designed NavBar
object that implements the methods we need to build the navigation bar
on the fly. As in the server-side includes example, we cache NavBar objects in the package global %BARS and only re-create them when the configuration file changes. The cache logic is all handled internally by read_configuration( ).

If, for some reason, read_configuration( ) returns an undefined value, we decline the transaction by returning DECLINED. Apache will display the page, but the navigation bar will be missing.

As in the server-side include example, we check the MIME type of the requested file. If it isn't of type text/html, then we can't add a navigation bar to it and we return DECLINED to let Apache take its default actions. Otherwise, we attempt to open the file by calling Apache::File 's new( ) method. If this fails, we again return DECLINED to let Apache generate the appropriate error message.

my $navbar = make_bar($r, $bar);

Having successfully processed the configuration file and opened the requested file, we call an internal subroutine named make_bar( )
to create the HTML text for the navigation bar. We'll look at this
subroutine momentarily. This fragment of HTML is stored in a variable
named $navbar.

The remaining code should look familiar. We send the
HTTP header and loop through the text in paragraph-style chunks looking
for all instances of the <BODY> and </BODY> tags. When we
find either tag we insert the navigation bar just below or above it. We
use paragraph mode (by setting $/ to the empty string) in order to catch documents that have spread the initial <BODY> tag among multiple lines.

The make_bar( ) function is
responsible for generating the navigation bar HTML code. First, it
recovers the current document's URI by calling the Apache request
object's uri( ) method. Next, it calls $bar->urls( ) to fetch the list of partial URIs for the site's major areas and iterates over the areas in a for( ) loop:

For each URI, the code fetches its human-readable label by calling $bar->label( )
and determines whether the current document is part of the area using a
pattern match. What happens next depends on whether the current
document is part of the area or not. In the former case, the code
generates a label enclosed within a <FONT> tag with the COLOR
attribute set to red. In the latter case, the code generates a
hypertext link. The label or link is then pushed onto a growing array
of HTML table cells.

return qq(<TABLE $TABLEATTS><TR>@cells</TR></TABLE>\n); }

At the end of the loop, the code incorporates the table cells into a one-row table and returns the HTML to the caller.

Potentially there can be several configuration files,
each one for a different part of the site. The path to the
configuration file is specified by a per-directory Perl configuration
variable named NavConf. We retrieve the path to the configuration file with dir_config( ), convert it into an absolute path name with server_root_relative( ), and test that the file exists with the -e operator.

Because we don't want to reparse the configuration each time we need it, we cache the NavBar object in much the same way we did with the server-side include example. Each NavBar object has a modified( ) method that returns the time that its configuration file was modified. The NavBar objects are held in a global cache named %BARS and indexed by the name of the configuration files. The next bit of code calls stat( ) to return the configuration file's modification time--notice that we can stat( ) the _ filehandle because the foregoing -e operation will have cached its results. We then check whether there is already a ready-made NavBar
object in the cache, and if so, whether its modification date is not
older than the configuration file. If both tests are true, we return
the cached object; otherwise, we create a new one by calling the NavBarnew( ) method.

You'll notice that we use a different technique for finding the modification date here than we did in Apache::ESSI (Example 4-3). In the previous example, we used the -M
file test flag, which returns the relative age of the file in days
since the Perl interpreter was launched. In this example, we use stat( )
to determine the absolute age of the file from the filesystem
timestamp. The reason for this will become clear later, when we modify
the module to handle If-Modified-Since caching.

Toward the bottom of the example is the definition for the NavBar class. It defines three methods named new( ), urls( ), and label( ) :

The new( ) method is called to parse a configuration file and return a new NavBar
object. It opens up the indicated configuration file, splits each row
into the URI and label parts, and stores the two parts into a hash.
Since the order in which the various areas appear in the navigation bar
is significant, this method also saves the URIs to an ordered array.

The urls( ) method returns the ordered list of areas, and the label( ) method uses the NavBar object's hash to return the human-readable label for the given URI. If none is defined, it just returns the URL. modified( ) returns the modification time of the configuration file.

Because so much of what Apache::NavBar and Apache:ESSI
do is similar, you might want to merge the navigation bar and
server-side include examples. This is just a matter of cutting and
pasting the navigation bar code into the server-side function
definitions file and then writing a small stub function named NAVBAR( ).
This stub function will call the subroutines that read the
configuration file and generate the navigation bar table. You can then
incorporate the appropriate navigation bar into your pages anywhere you
like with an include like this one:

<!--#NAVBAR-->

Handling If-Modified-Since

One of us (Lincoln) thought the virtual navigation bar
was so neat that he immediately ran out and used it for all documents
on his site. Unfortunately, he had some pretty large (>400 MB) files
there, and he soon noticed something interesting. Before installing the
navigation bar handler, browsers would cache the large HTML files
locally and only download them again when they had changed. After
installing the handler, however, the files were always downloaded. What
happened?

When a browser is asked to display a document that it
has cached locally, it sends the remote server a GET request with an
additional header field named If-Modified-Since. The request looks something like this:

The server will compare the document's current
modification date to the time given in the request. If the document is
more recent than that, it will return the whole document. Otherwise,
the server will respond with a 304 "not modified" message and the
browser will display its cached copy. This reduces network bandwidth
usage dramatically.

When you install a custom content handler, the If-Modified-Since mechanism no longer works unless you implement it. In fact, you can generally ignore If-Modified-Since
because content handlers usually generate dynamic documents that change
from access to access. However, in some cases the content you provide
is sufficiently static that it pays to cache the documents. The
navigation bar is one such case because even though the bar is
generated dynamically, it rarely changes from day to day.

In order to handle If-Modified-Since
caching, you have to settle on a definition for the document's most
recent modification date. In the case of a static document, this is
simply the modification time of the file. In the case of composite
documents that consist equally of static file content and a dynamically
generated navigation bar, the modification date is either the time that
the HTML file was last changed or the time that the navigation bar
configuration file was changed, whichever happens to be more recent.
Fortunately for us, we're already storing the configuration file's
modification date in the NavBar object, so finding this aggregate modification time is relatively simple.

To use these routines, simply add the following just before the call to $r->send_http_header in the handler( ) subroutine:

We first call the update_mtime( )
function with the navigation bar's modification date. This function
will compare the specified date with the modification date of the
request document and update the request's internal mtime field to the most recent of the two. We then call set_last_modified( ) to copy the mtime field into the outgoing Last-Modified header. If a synthesized document depends on several configuration files, you should call update_mtime( ) once for each configuration file, followed by set_last_modified( ) at the very end.

The complete code for the new and improved Apache::NavBar, with the If-Modified-Since improvements, can be found at this book's companion web site.

If you think carefully about this module, you'll see
that it still isn't strictly correct. There's a third modification date
that we should take into account, that of the module source code
itself. Changes to the source code may affect the appearance of the
document without changing the modification date of either the
configuration file or the HTML file. We could add a new update_mtime( ) with the modification time of the Apache::NavBar module, but then we'd have to worry about modification times of libraries that Apache::NavBar depends on, such as Apache::File.
This gets hairy very quickly, which is why caching becomes a moot issue
for any dynamic document much more complicated than this one. See "The
Apache::File Class" in Chapter 9, Perl API Reference Guide, for a complete rundown of the methods that are available to you for controlling HTTP/1.1 caching.

Sending Static Files

If you want your content handler to send a file through
without modifying it, the easiest way is to let Apache do all the work
for you. Simply return DECLINED from your
handler (before you send the HTTP header or the body) and the request
will fall through to Apache's default handler. This is a lot easier,
not to mention faster, than opening up the file, reading it line by
line, and transmitting it unchanged. In addition, Apache will
automatically handle a lot of the details for you, first and foremost
of which is handling the If-Modified-Since header and other aspects of client-side caching.

If you have a compelling reason to send static files manually, see Using Apache::File to Send Static Files in Chapter 9 for a full description of the technique. Also see "Redirection,"
later in this chapter, for details on how to direct the browser to
request a different URI or to make Apache send the browser a different
document from the one that was specifically requested.

Virtual Documents

The previous sections of this chapter have been
concerned with transforming existing files. Now we turn our attention
to spinning documents out of thin air. Despite the fact that these two
operations seem very different, Apache content handlers are responsible
for them both. A content handler is free to ignore the translation of
the URI that is passed to it. Apache neither knows nor cares that the
document produced by the content handler has no correspondence to a
physical file.

We've already seen an Apache content handler that produces a virtual document. Chapter 2, A First Module, gave the code for Apache::Hello, an Apache Perl module that produces a short HTML document. For convenience, we show it again in Example 4-7.
This content handler is essentially identical to the previous content
handlers we've seen. The main difference is that the content handler
sets the MIME content type itself, calling the request object's content_type( ) method to set the MIME type to type text/html.
This is in contrast to the idiom we used earlier, where the handler
allowed Apache to choose the content type for it. After this, the
process of emitting the HTTP header and the document itself is the same
as we've seen before.

After setting the content type, the handler calls send_http_header( ) to send the HTTP header to the browser, and immediately exits with an OK status code if header_only( ) returns true (this is a slight improvement over the original Chapter 2 version of the program). We call get_remote_host( )
to get the DNS name of the remote host machine, and incorporate the
name into a short HTML document that we transmit using the request
object's print( ) method. At the end of the handler, we return OK.

There's no reason to be limited to producing virtual
HTML documents. You can just as easily produce images, sounds, and
other types of multimedia, provided of course that you know how to
produce the file format that goes along with the MIME type.

Redirection

Instead of synthesizing a document, a content handler
has the option of redirecting the browser to fetch a different URI
using the HTTP redirect mechanism. You can use this facility to
randomly select a page or picture to display in response to a URI
request (many banner ad generators work this way) or to implement a
custom navigation system.

Redirection is extremely simple with the Apache API. You need only add a Location field to the HTTP header containing the full or partial URI of the desired destination, and return a REDIRECT result code. A complete functional example using mod_perl is only a few lines (Example 4-8). This module, named Apache::GoHome, redirects users to the hardcoded URI http://www.ora.com/.
When the user selects a document or a portion of the document tree that
this content handler has been attached to, the browser will immediately
jump to that URI.

The module begins by importing the REDIRECT error code from Apache::Constants (REDIRECT isn't among the standard set of result codes imported with :common). The handler( ) method then adds the desired location to the outgoing headers by calling Apache::header_out( ). header_out( )
can take one or two arguments. Called with one argument, it returns the
current value of the indicated HTTP header field. Called with two
arguments, it sets the field indicated by the first argument to the
value indicated by the second argument. In this case, we use the
two-argument form to set the HTTP Location field to the desired URI.

The final step is to return the REDIRECT result code. There's no need to generate an HTML body, since most HTTP-compliant browsers will take you directly to the Location
URI. However, Apache adds an appropriate body automatically in order to
be HTTP-compliant. You can see the header and body message using
telnet:

You'll notice from this example that the REDIRECT
status causes a "Moved Temporarily" message to be issued. This is
appropriate in most cases because it makes no warrants to the browser
that it will be redirected to the same location the next time it tries
to fetch the desired URI. If you wish to redirect permanently, you
should use the MOVED status code instead,
which results in a "301 Moved Permanently" message. A smart browser
might remember the redirected URI and fetch it directly from its new
location the next time it's needed.

As a more substantial example of redirection in action, consider Apache::RandPicture (Example 4-9)
which randomly selects a different image file to display each time it's
called. It works by selecting an image file from among the contents of
a designated directory, then redirecting the browser to that file's
URI. In addition to demonstrating a useful application of redirection,
it again shows off the idiom for interconverting physical file names
and URIs.

The handler begins by fetching the name of a directory
to fetch the images from, which is specified in the server
configuration file by the Perl variable PictureDir.
Because the selected image has to be directly fetchable by the browser,
the image directory must be given as a URI rather than as a physical
path.

The next task is to convert the directory URI into a physical directory path. The subroutine adds a /
to the end of the URI if there isn't one there already (ensuring that
Apache treats the URI as a directory), then calls the request object's lookup_uri( ) and filename( ) methods in order to perform the URI translation steps. The code looks like this:

my $subr = $r->lookup_uri($dir_uri); my $dir = $subr->filename;

Now we need to obtain a listing of image files in the
directory. The simple way to do this would be to use the Perl glob
operator, for instance:

chdir $dir; @files = <*.{jpg,gif}>;

However, this technique is flawed. First off, on many
systems the glob operation launches a C subshell, which sends
performance plummeting and won't even work on systems without the C
shell (like Win32 platforms). Second, it makes assumptions about the
extension types of image files. Your site may have defined an alternate
extension for image files (or may be using a completely different
system for keeping track of image types, such as the Apache MIME magic
module), in which case this operation will miss some images.

Instead, we create a DirHandle object using Perl's directory handle object wrapper. We call the directory handle's read( )
method repeatedly to iterate through the contents of the directory. For
each item we ask Apache what it thinks the file's MIME type should be,
by calling the lookup_uri( ) method to turn the filename into a subrequest and content_type( )
to fetch the MIME type information from the subrequest. We perform a
pattern match on the returned type and, if the file is one of the MIME
image types, add it to a growing list of image URIs. The subrequest
object's uri( ) method is called to return the absolute URI for the image. The whole process looks like this:

Note that we look up the directory entry's filename by calling the subrequest object's lookup_uri( ) method rather than using the main request object stored in $r. This takes advantage of the fact that subrequests will look up relative paths relative to their own URI.

The next step is to select a member of this list randomly, which we do using this time-tested Perl idiom:

my $lucky_one = $files[rand @files];

The last step is to set the Location header to point at this file (being sure to express the location as a URI) and to return a REDIRECT
result code. If you install the module using the sample configuration
file and <IMG> tag shown at the bottom of the listing, a
different picture will be displayed every time you load the page.

Although elegant, this technique for selecting a random
image file suffers from a bad performance bottleneck. Instead of
requiring only a single network operation to get the picture from the
server to the browser, it needs two round-trips across the network: one
for the browser's initial request and redirect and one to fetch the
image itself.

You can eliminate this overhead in several different
ways. The more obvious technique is to get rid of the redirection
entirely and simply send the image file directly. After selecting the
random image and placing it in the variable $lucky_one, we replace the last two lines of the handler( ) subroutine with code like this:

We create yet another subrequest, this one for the
selected image file, then use information from the subrequest to set
the outgoing content type. We then open up the file and send it with
the send_fd( ) method.

However, this is still a little wasteful because it
requires you to open up the file yourself. A more subtle solution would
be to let Apache do the work of sending the file by invoking the
subrequest's run( ) method. run( )
invokes the subrequest's content handler to send the body of the
document, just as if the browser had made the request itself. The code
now looks like this:

We call lookup_uri( ) and check the value returned by its status( ) method in order to make sure that it is DOCUMENT_FOLLOWS (status code 200, the same as HTTP_OK). This constant is not exported by Apache::Constants
by default but has to be imported explicitly. We then set the main
request's content type to the same as that of the subrequest, and send
off the appropriate HTTP header. Finally, we call the subrequest's run( ) method to invoke its content handler and send the contents of the image to the browser.

Internal Redirection

The two Apache::RandPicture
optimizations that we showed in the previous section involve a lot of
typing, and the resulting code is a bit obscure. A far more elegant
solution is to let Apache do all the work for you with its internal
redirection mechanism. In this scheme, Apache handles the entire
redirection internally. It pretends that the web browser made the
request for the new URI and sends the contents of the file, without
letting the browser in on the secret. It is functionally equivalent to
the solution that we showed at the end of the preceding section.

To invoke the Apache internal redirection system, modify the last two lines of Apache::RandPicture 's handler( ) subroutine to read like this:

$r->internal_redirect($lucky_one); return OK;

The request object's internal_redirect( ) method takes a single argument consisting of an absolute local URI (one starting with a / ).
The method does all the work of translating the URI, invoking its
content handler, and returning the file contents, if any. Unfortunately
internal_redirect( ) returns no result code,
so there's no way of knowing whether the redirect was successful (you
can't do this from a conventional redirect either). However, the call
will return in any case, allowing you to do whatever cleanup is needed.
You should exit the handler with a result code of OK.

In informal benchmarks, replacing the basic Apache::RandPicture
with a version that uses internal redirection increased the throughput
by a factor of two, exactly what we'd expect from halving the number of
trips through the network. In contrast, replacing all the MIME type
lookups with a simpler direct grep for image file extensions had
negligible effect on the speed of the module. Apache's subrequest
mechanism is very efficient.

If you have very many images in the random pictures
directory (more than a few hundred), iterating through the directory
listing each time you need to fetch an image will result in a
noticeable performance hit. In this case, you'll want to cache the
directory listing in a package variable the first time you generate it
and only rebuild the listing when the directory's modification time
changes (or just wait for a server restart, if the directory doesn't
change often). You could adapt the Apache::ESSI caching system for this purpose.

Internal redirection is a win for most cases when you
want to redirect the browser to a different URI on your own site. Be
careful not to use it for external URIs, however. For these, you must
either use standard redirection or invoke Apache's proxy API (Chapter
7).

When you use internal redirection to pass control from
one module to another, the second module in the chain can retrieve the
original query string, the document URI, and other information about
the original request by calling the request object's prev( ) method or, in Apache::Registry scripts only, by examining certain environment variables. There is also a way, using Apache::err_header_out( )
for the original module to set various HTTP header fields, such as
cookies, that will be transferred to the second across the internal
redirect. Because internal redirects are most commonly used in error
handlers, these techniques are discussed in the section "Handling Errors" later in this chapter.

Processing Input

You can make the virtual documents generated by the
Apache API interactive in exactly the way that you would documents
generated by CGI scripts. Your module will generate an HTML form for
the user to fill out. When the user completes and submits the form,
your module will process the parameters and generate a new document,
which may contain another fill-out form that prompts the user for
additional information. In addition, you can store information inside
the URI itself by placing it in the additional path information part.

CGI Parameters

When a fill-out form is submitted, the contents of its
fields are turned into a series of name=value parameter pairs that are
available for your module's use. Unfortunately, correctly processing
these parameter pairs is annoying because, for a number of historical
reasons, there are a variety of formats that you must know about and
deal with. The first complication is that the form may be submitted
using either the HTTP GET or POST method. If the GET method is used,
the URI encoded parameter pairs can be found separated by ampersands in
the "query string," the part of the URI that follows the ? character:

http://your.site/uri/path?name1=val1&name2=val2&name3=val3...

To recover the parameters from a GET request, mod_perl users should use the request object's args( )
method. In a scalar context this method returns the entire query
string, ampersands and all. In an array context, this method returns
the parsed name=value pairs; however, you will still have to do further
processing in order to correctly handle multivalued parameters. This
feature is only found in the Perl API. Programmers who use the C API
must recover the query string from the request object's args field and do all the parsing manually.

If the client uses the POST method to submit the
fill-out form, the parameter pairs can be found in something called the
"client block." C API users must call three functions named setup_client_block( ), should_client_block( ), and get_client_block( ) in order to retrieve the information.

While these methods are also available in the Perl API, mod_perl users have an easier way: they need only call the request object's content( ) method to retrieve the preparsed list of name=value pairs. However, there's a catch: this only works for the older application/x-www-form-urlencoded style of parameter encoding. If the browser uses the newer multipart/form-data encoding (which is used for file uploads, among other things), then mod_perl users will have to read and parse the content information themselves. read( )
will fetch the unparsed content information by looping until the
requested number of bytes have been read (or a predetermined timeout
has occurred). Fortunately, there are a number of helpful modules that
allow mod_perl programmers to accept file uploads without parsing the data themselves, including CGI.pm and Apache::Request, both of which we describe later.

To show you the general technique for prompting and processing user input, Example 4-10 gives a new version of Apache::Hello. It looks for a parameter named user_name
and displays a customized welcome page, if present. Otherwise, it
creates a more generic message. In both cases, it also displays a
fill-out form that prompts the user to enter a new value for user_name.
When the user presses the submission button labeled "Set Name," the
information is POSTed to the module and the page is redisplayed (Figure 4-4).

Figure 4-4.The Apache::Hello2 module can process user input.

The code is very simple. On entry to handler( ) the module calls the request object's method( )
method to determine whether the handler was invoked using a POST
request, or by some other means (usually GET). If the POST method was
used, the handler calls the request object's content( ) method to retrieve the posted parameters. Otherwise, it attempts to retrieve the information from the query string by calling args( ). The parsed name=value pairs are now stuffed into a hash named %params for convenient access.

Having processed the user input, if any, the handler retrieves the value of the user_name parameter from the hash and stores it in a variable. If the parameter is empty, we default to "Unknown User."

The next step is to generate the document. We set the content type to text/html as before and emit the HTTP header. We again call the request object's header_only( ) to determine whether the client has requested the entire document or just the HTTP header information.

This is followed by a single long Apache::print( )
statement. We create the HTML header and body, along with a suitable
fill-out form. Notice that we use the current value of the user name
variable to initialize the appropriate text field. This is a frill that
we have always thought was kind of neat.

This method of processing user input is only one of
several equally valid alternatives. For example, you might want to work
with query string and POSTed parameters simultaneously, to accommodate
this type of fill-out form:

In this case, you could recover the values of both the day and user_name parameters using a code fragment like this one:

my %params = ($r->args, $r->content);

If the same parameter is present in both the query
string and the POSTed values, then the latter will override the former.
Depending on your application's logic, you might like this behavior.
Alternatively, you could store the two types of parameter in different
places or take different actions depending on whether the parameters
were submitted via GET or POST. For example, you might want to use
query string parameters to initialize the default values of the
fill-out form and enter the information into a database when a POST
request is received.

When you store the parsed parameters into a hash, you
lose information about parameters that are present more than once. This
can be bad if you are expecting multivalued parameters, such as those
generated by a selection list or a series of checkboxes linked by the
same name. To keep multivalued information, you need to do something
like this:

This bit of code aggregates the GET and POST parameters into a single array named @args.
It then loops through each name=value pair, building up a hash in which
the key is the parameter name and the value is an array reference
containing all the values for that parameter. This way, if you have a
selection list that generates query strings of the form:

vegetable=kale&vegetable=broccoli&vegetable=carrots

you can recover the complete vegetable list in this manner:

@vegetables = @{$params{'vegetable'}};

An alternative is to use a module that was still in development at the time this chapter was written. This module, named Apache::Request,
uses the CGI.pm-style method calls to process user input but does so
efficiently by going directly to the request object. With this module,
the user input parameters are retrieved by calling param( ). Call param( ) without any arguments to retrieve a list of all the parameter names. Call param( )
with a parameter name to return a list of the values for that parameter
in an array context, and the first member of the list in a scalar
context. Unlike the vanilla request object, input of type multipart/form-data is handled correctly, and uploaded files can be recovered too (using the same API as CGI.pm).

To take advantage of Apache::Request in our "Hello World" module, we modify the top part of the module to read as follows:

The main detail here is that instead of retrieving the request object directly, we wrap it inside an Apache::Request object. Apache::Request adds param( ) and a few other useful methods and inherits all other method calls from the Apache class. More information will be found in the Apache::Request manual page when that package is officially released.

Like CGI.pm, Apache::Request
allows you to handle browser file uploading, although it is somewhat
different in detail from the interface provided in CGI.pm versions 2.46
and lower (the two libraries have been brought into harnony in Version
2.47). As in ordinary CGI, you create a file upload field by defining
an <INPUT> element of type "file" within a <FORM> section
of type "multipart/form-data". After the form is POSTed, you retrieve
the file contents by reading from a filehandle returned by the Apache::Requestupload( ) method. This code fragment illustrates the technique:

Additional Path Information

Recall that after Apache parses an incoming URI to
figure out what module to invoke, there may be some extra bits left
over. This extra stuff becomes the "additional path information" and is
available for your module to use in any way it wishes. Because it is
hierarchical, the additional path information part of the URI follows
all the same relative path rules as the rest of the URI. For example, ..
means to move up one level. For this reason, additional path
information is often used to navigate through a virtual document tree
that is dynamically created and maintained by a CGI script or module.
However, you don't have to take advantage of the hierarchical nature of
path information. You can just use it as a handy place to store
variables. In the next chapter, we'll use additional path information
to stash a session identifier for a long-running web application.

Apache modules fetch additional path information by calling the request object's path_info( ) method. If desired, they can then turn the path information into a physical filename by calling lookup_uri( ).

An example of how additional path information can be used as a virtual document tree is shown in Example 4-11, which contains the code for Apache::TreeBrowser.
This module generates a series of documents that are organized in a
browseable tree hierarchy that is indistinguishable to the user from a
conventional HTML file hierarchy. However, there are no physical files.
Instead, the documents are generated from a large treelike Perl data
structure that specifies how each "document" should be displayed. Here
is an excerpt:

In this bit of the tree, a document named "bark" has
the title "The Wrong Tree" and the contents "His bark was worse than
his bite." Beneath this document are two subdocuments named "smooth"
and "rough." The "smooth" document has the title "Like Butter" and the
contents "As smooth as silk." The "rough" document is similarly silly.
These subdocuments can be addressed with the additional path
information /bark/smooth and /bark/rough, respectively. The parent document, naturally enough, is addressed by /bark. Within the module, we call each chunk of this data structure a "node."

Using the information contained in the data structure, Apache::TreeBrowser
constructs the document and displays its information along with a
browseable set of links organized in hierarchical fashion (see Figure 4-5).
As the user moves from document to document, the currently displayed
document is highlighted--sort of a hierarchical navigation bar!

The module starts by importing the usual Apache constants and the REDIRECT result code. It then creates the browseable tree by calling an internal subroutine named make_tree( ) and stores the information in a package global named $TREE.
In a real-life application, this data structure would be created in
some interesting way, for example, using a query on a database, but in
this case make_tree( ) just returns the hardcoded data structure that follows the _ _DATA_ _ token at the end of the code.

Now's the time to process the additional path
information. The handler fetches the path information by calling the
request object's path_info( ) method and fetches the module's base URI by calling uri( ). Even though we won't be using it, we transform the additional path information into a physical pathname by calling lookup_uri( ) and filename( ). This is useful for seeing how Apache does URI translation.

For this module to work correctly, some additional path information has to be provided, even if it's only a /
character. If we find that the additional path information is empty, we
rectify the situation by redirecting the browser to our URI with an
additional / appended to the end. This is
similar to the way that Apache redirects browsers to directories when
the terminal slash is omitted.

At this point we begin to construct the document. We set the content type to text/html, send out the HTTP header, and exit if header_only( )
returns true. Otherwise, we split the path information into its
components and then traverse the tree, following each component name
until we either reach the last component on the list or come to a
component that doesn't have a corresponding entry in the tree (which
sometimes happens when users type in the URI themselves). By the time
we reach the end of the tree traversal, the variable $node
points to the part of the tree that is referred to by the additional
path information or, if the path information wasn't entirely correct,
to the part of the tree corresponding to the last valid path component.

We now call print( ) to print
out the HTML document. We first display the current document's title
and contents. We then print a hyperlink that points back to the "root"
(really the top level) of the tree. Notice how we construct this link
by creating a relative URI based on the number of components in the
additional path information. If the additional path information is
currently /bark/rough/cork, we construct a link whose HREF is ../../../. Through the magic of relative addressing, this will take us back to the root / document.

The next task is to construct the hierarchical navigation system shown in Figure 4-5. We do this by calling print_node( ), an internal function. This is followed by a link to the next-higher document, which is simply the relative path ../.

This subroutine is responsible for displaying a tree
node as a nested list. It starts by finding all the branches beneath
the requested node, which just happens to be all the hash keys that
don't begin with a hyphen. It then prints out the name of the node. If
the node being displayed corresponds to the current document, the name
is surrounded by <FONT> tags to display it in red. Otherwise, the
node name is turned into a hyperlink that points to the appropriate
document. Then, for each subdocument beneath the current node, it
invokes itself recursively to display the subdocument. The most obscure
part of this subroutine is the need to append a $prefix variable to each URI the routine generates. $prefix contains just the right number of ../ sequences to make the URIs point to the root of the virtual document tree. This simplifies the program logic.

The last function in this module is make_tree( ). It simply reads in the text following the _ _DATA_ _ token and eval( ) s it, turning it into a Perl data structure:

Apache::Registry

If you are using mod_perl to write Apache modules, then you probably want to take advantage of Apache::Registry. Apache::Registry
is a prewritten Apache Perl module that is a content handler for files
containing Perl code. In addition to making it unnecessary to restart
the server every time you revise a source file, Apache::Registry
sets up a simulated CGI environment, so that programs that expect to
get information about the transaction from environment variables can
continue to do so. This allows legacy CGI applications to run under the
Apache Perl API, and lets you use server-side code libraries (such as
the original CGI.pm) that assume the script is running in a CGI
environment.

Apache::Registry is similar
in concept to the content filters we created earlier in this chapter,
but instead of performing simple string substitutions on the contents
of the requested file, Apache::Registry compiles and executes the code contained within it. In order to avoid recompiling the script each time it's requested, Apache::Registry
caches the compiled code and checks the file modification time each
time it's requested in order to determine whether it can safely use the
cached code or whether it must recompile the file. Should you ever wish
to look at its source code, Apache::Registry is a good example of a well-written Apache content handler that exercises much of the Perl API.

We created a typical configuration file entry for Apache::Registry in Chapter 2. Let's examine it in more detail now.

The Alias directive simply maps the physical directory /usr/local/apache/perl/ to a virtual directory named /perl. The <Location> section is more interesting. It uses SetHandler to make perl-script the content handler for this directory and sets Apache::Registry to be the module to handle requests for files within this part of the document tree.

The PerlSendHeaderOn line tells mod_perl to intercept anything that looks like a header line (such as Content-Type:text/html )
and to automatically turn it into a correctly formatted HTTP/1.0 header
the way that Apache does with CGI scripts. This allows you to write
scripts without bothering to call the request object's send_http_header( ) method. Like other Apache::Registry features, this option makes it easier to port CGI scripts to the Apache API. If you use CGI.pm's header( ) function to generate HTTP headers, you do not need to activate this directive because CGI.pm detects mod_perl and calls send_http_header( ) for you. However, it does not hurt to use this directive anyway.

Option +ExecCGI ordinarily tells Apache that it's all right for the directory to contain CGI scripts. In this case the flag is required by Apache::Registry to confirm that you really know what you're doing. In addition, all scripts located in directories handled by Apache::Registry must be executable--another check against accidentally leaving wayward nonscript files in the directory.

When you use Apache::Registry,
you can program in either of two distinct styles. You can choose to
ignore the Apache Perl API entirely and act as if your script were
executed within a CGI environment, or you can ignore the CGI
compatibility features and make Apache API calls. You can also combine
both programming styles in a single script, although you run the risk
of confusing yourself and anyone else who needs to maintain your code!

A typical example of the first style is the hello.pl script (Example 4-12),
which you also saw in Chapter 2. The interesting thing about this
script is that there's nothing Apache-specific about it. The same
script will run as a standard CGI script under Apache or any other web
server. Any library modules that rely on the CGI environment will work
as well.

#!/usr/local/bin/perl
# file: hello.pl
print "Content-Type: text/html\n\n";
print <<END;
<HTML>
<HEAD>
<TITLE>Hello There</TITLE>
</HEAD>
<BODY>
<H1>Hello $ENV{REMOTE_HOST}</h2>
Who would take this book seriously if the examples
didn't say "hello world" in at least four different ways?
</BODY>
</HTML>
END

Example 4-13 shows the same script rewritten more compactly by taking advantage of the various shortcuts provided by the CGI.pm module.

Example 4-13:An Apache::Registry Script That Uses CGI.pm

#!/usr/local/bin/perl
# file: hello2.pl
use CGI qw(:standard);
print header,
start_html('Hello There'),
h1('Hello',remote_host()),
'Who would take this book seriously if the examples',
'didn\'t say "hello world" in at least four different ways?',
end_html;

In contrast, Example 4-14 shows the script written in the Apache Perl API style. If you compare the script to Example 4-7,
which used the vanilla API to define its own content handler, you'll
see that the contents of this script (with the exception of the #! line at the top) are almost identical to the body of the handler( )
subroutine defined there. The main difference is that instead of
retrieving the Apache request object from the subroutine argument list,
we get it by calling Apache->request(). request( ) is a static (class) method in the Apache package where the current request object can always be found.

There are also some subtle differences between Apache::Registry scripts that make Apache API calls and plain content handlers. One thing to notice is that there is no return value from Apache::Registry scripts. Apache::Registry normally assumes an HTTP status code of 200 (OK). However, you can change the status code manually by calling the request object's status( ) method to change the status code before sending out the header:

$r->status(404); # forbidden

Strictly speaking, it isn't necessary to call send_http_header( ) if you have PerlSendHeader On. However, it is good practice to do so, and it won't lead to redundant headers being printed.

Alternatively, you can use the CGI compatibility mode to set the status by printing out an HTTP header that contains a Status: field:

print "Status: 404 Forbidden\n\n";

Another subtle difference is that at least one of the command-line switches that may be found on the topmost #! line is significant. The -w switch, if present, will signal Apache::Registry to turn on Perl warnings by setting the $^W global to a true value. Another common switch used with CGI scripts is -T,
which turns on taint checking. Currently, taint checking can be
activated for the Perl interpreter as a whole only at server startup
time by setting the configuration directive PerlTaintCheck On. However, if Apache::Registry notices -T on the #! line and taint checks are not activated, it will print a warning in the server error log.

Since Apache::Registry scripts can do double duty as normal CGI scripts and as mod_perl
scripts, it's sometimes useful for them to check the environment and
behave differently in the two situations. They can do this by checking
for the existence of the environment variable MOD_PERL or for the value of GATEWAY_INTERFACE. When running under mod_perl, GATEWAY_INTERFACE will be equal to CGI-Perl/1.1. Under the normal CGI interface, it will be CGI/1.1.

A Useful Apache::Registry Application

All the Apache::Registry
examples that we've seen so far have been short and, frankly, silly.
Now let's look at an example of a real-world script that actually does
something useful. The guestbook script (Example 4-15),
as its name implies, manages a typical site guestbook, where visitors
can enter their names, email addresses, and comments. It works well as
both a standalone CGI script and a mod_perlApache::Registry script, automatically detecting when it is running under the Apache Perl API in order to take advantage of mod_perl 's
features. In addition to showing you how to generate a series of
fill-out forms to handle a moderately complex user interaction, this
script demonstrates how to read and update a file without the risk of
several instances of the script trying to do so simultaneously.

Unlike some other guestbook programs, this one doesn't
append users' names to a growing HTML document. Instead, it maintains a
flat file in which each user's entry is represented as a single line in
the file. Tabs separate the five fields, which are the date of the
entry, the user's name, the user's email address, the user's location
(e.g., city of residence), and comments. Nonalphanumeric characters are
URL-escaped to prevent the format from getting messed up if the user
enters newlines or tabs in the fields, giving records that look like:

05/07/98 JR jr_ewing%40dallas.com Dallas,%20TX Like%20the%20hat

When the script is first called, it presents the user
with the option of signing the guestbook file or looking at previous
entries (Figure 4-6).

If the user presses the button labeled "Sign
Guestbook," a confirmation page appears, which echoes the entry and
prompts the user to edit or confirm it (Figure 4-7).

Figure 4-7.The confirmation page generated by guestbook

Pressing the "Change Entry" button takes the user back
to the previous page with the fields filled in and waiting for the
user's changes. Pressing "Confirm Entry" appends the user's entry to
the guestbook file and displays the whole file (Figure 4-8).

The script then defines some constants. @FIELDS
is an array of all the fields known to the guestbook. By changing the
contents of this array you can generate different fill-out forms. %REQUIRED is a hash that designates certain fields as required, in this case name and e-mail.
The script will refuse to add an entry to the guestbook until these
fields are filled out (however, no error checking on the contents of
the fields is done). %BIG is a hash containing the names of fields that are displayed as large text areas, in this case comments. Other fields are displayed as one-line text entries.

Next the script checks if it is running under mod_perl by checking for the MOD_PERL environment variable. If the script finds that it is running under mod_perl, it fetches the Apache request object and queries the object for a per-directory configuration variable named GuestbookFile.
This contains the physical pathname of the file where the guestbook
entries are stored. If the script is a standalone CGI script, or if no GuestbookFile configuration variable is defined, the script defaults to a hardcoded file path. In the case of Apache::Registry scripts, the PerlSetVar directive used to set per-directory configuration variables must be located in a .htaccess file in the same directory as the script.

The script now begins to generate the document by
calling shortcut functions defined in the CGI module to generate the
HTTP header, the HTML header and title, and a level 1 heading of
"Guestbook."

CASE: { $_ = param('action'); /^sign/i and do { sign_guestbook(); last CASE; }; /^confirm/i and do { write_guestbook() and view_guestbook(); last CASE; }; /^view/i and do { view_guestbook(1); last CASE; }; generate_form(); }

We now enter the variable part of the script. Depending
on what phase of the transaction the user is in, we either want to
prompt the user to fill out the guestbook form, confirm an entered
entry, or view the entire guestbook. We distinguish between the phases
by looking at the contents of a script parameter named action. If action equals sign,
we know that the user has just completed the fill-out form and pressed
the "Sign Guestbook" button, so we jump to the routine responsible for
this part of the transaction. Similarly, we look for action values of confirm and view, and jump to the appropriate routines for these actions. If action is missing, or if it has some value we don't expect, we take the default action of generating the fill-out form.

print end_html; exit 0;

Having done its work, the script prints out the </HTML> tag and exits.

The subroutine responsible for generating the form is named, appropriately enough, generate_form( ). It iterates over @FIELDS
and dynamically generates a text label and a form element for each
field, modifying the format somewhat based on whether the field is
marked optional or big. Each label/field pair is pushed onto a list
named @rows. When the loop is finished, @rows
is turned into a nicely formatted table using CGI.pm's table-generation
shortcuts. The "View Guestbook" and "Sign Guestbook" buttons are added
to the form, and the routine finishes.

sign_guestbook( ) has a
slightly more complex job. Its first task is to check the submitted
form for missing required fields by calling the internal subroutine check_missing( ). If any are missing, it displays the missing fields by calling another internal subroutine, print_warning( ), and then invokes generate_form( )
to redisplay the form with its current values. No particular
hocus-pocus is required to display the partially completed form
correctly; this is just one of the beneficial side effects of CGI.pm's
"sticky forms" feature.

If all the required fields are filled in, sign_guestbook( )
generates an HTML table to display the user's entries. The technique
for generating the form is similar to that used in the previous
subroutine, except that no special cases are needed for different types
of fields. We do, however, have to be careful to call escapeHTML( )
(a function imported from CGI.pm) in order to prevent HTML entities and
other funny characters that the user might have entered from messing up
the page.

We end the routine by creating a short fill-out form.
This form contains the contents of the user's guestbook entry stashed
into a series of hidden fields, and push buttons labeled "Change Entry"
and "Confirm Entry." We hide the guestbook entry information in this
way in order to carry the information forward to the next set of pages.

The check_missing( ) and print_warning( ) subroutines are short and sweet. The first routine uses the Perl grep( )
function to check the list of provided fields against the list of
required fields and returns a list of the truants, if any. The second
routine accepts a list of missing fields and turns it into a warning of
the form, "Please fill in the following fields: e-mail." For emphasis, the message is rendered in a red font (under browsers that understand the <FONT> extension).

The write_guestbook( ) and view_guestbook( )
subroutines are the most complex of the bunch. The main complication is
that, on an active site, there's a pretty good chance that a second
instance of the script may be invoked by another user before the first
instance has completed opening and updating the guestbook file. If the
writes overlap, the file could be corrupted and a guestbook entry lost
or scrambled. For this reason, it's important for the script to lock
the file before working with it.

POSIX-compliant systems (which include both Unix and Windows systems) offer a simple form of advisory file locking through the flock( ) system call. When a process opens a file and flock( )s it, no other process can flock( )
it until the first process either closes the file or manually
relinquishes the lock. There are actually two types of lock. A "shared"
lock can be held by many processes simultaneously. An "exclusive" lock
can only be held by one process at a time and prevents any other
program from locking the file. Typically, a program that wants to read
from a file will obtain a shared lock, while a program that wants to
write to the file asks the system for an exclusive lock. A shared lock
allows multiple programs to read from a file without worrying that some
other process will change the file while they are still reading it. A
program that wants to write to a file will call flock( )
to obtain an exclusive lock; the call will then block until all other
processes have released their locks. After an exclusive lock is
granted, no other program can lock the file until the writing process
has finished its work and released the lock.

It's important to realize that the flock( ) locking mechanism is advisory. Nothing prevents a program from ignoring the flock( )
call and reading from or writing to a file without seeking to obtain a
lock first. However, as long as only the programs you've written
yourself attempt to access the file and you're always careful to call flock( ) before working with it, the system works just fine.

To make life a little simpler, the guestbook script defines a utility function named lock( ) that takes care of opening and locking the guestbook file (you'll find the definition at the bottom of the source listing). lock( )
takes two arguments: the name of the file to open and a flag indicating
whether the file should be opened for writing. If the write flag is
true, the function opens the file in append mode and then attempts to
obtain an exclusive lock. Otherwise, it opens the file read only and
tries to obtain a shared lock. If successful, the opened filehandle is
returned to the caller.

The flock( ) function is used
to obtain both types of lock. The first argument is the opened
filehandle; the second is a constant indicating the type of lock to
obtain. The constants for exclusive and shared locks are LOCK_EX and LOCK_SH, respectively. Both constants are imported from the Fcntl module using the :flock tag. We combine these constants with the LOCK_NB (nonblocking) constant, also obtained from Fcntl, in order to tell flock( ) to return if a lock cannot be obtained immediately. Otherwise, flock( )
will block indefinitely until the file is available. In order to avoid
a long wait in which the script appears to be hung, we call flock( )
in a polling loop. If a lock cannot immediately be obtained, we print a
warning message to the browser screen and sleep for 1 second. After 10
consecutive failed tries, we give up and exit the script. If the lock
is successful, we return the filehandle.

To write a new entry into the guestbook, the write_guestbook( ) function calls lock( )
with the path to the guestbook file and a flag indicating we want write
access. If the call fails, we display an appropriate error message and
return. Otherwise, we seek to the end of the file, just in case someone
else wrote to the file while we were waiting for the lock. We then join
together the current date (obtained from the POSIX strftime( )
function) with the current values of the guestbook fields and write
them out to the guestbook filehandle. To avoid the possibility of the
user messing up our tab-delimited field scheme by entering tabs or
newlines in the fill-out form, we're careful to escape the fields
before writing them to the file. To do this, we use the map operator to pass the fields through CGI.pm's escape( ) function. This function is ordinarily used to make text safe for use in URIs, but it works just as well here.

After writing to the file, we're careful to close the
filehandle. This releases the lock on the file and gives other
processes access to it.

The view_guestbook( )
subroutine looks a lot like the one we just looked at but in reverse.
It starts by creating a tiny fill-out form containing a single button
labeled "Sign Guestbook." This button is only displayed when someone
views the guestbook without signing it first and is controlled by the $show_sign_button flag. Next we obtain a read-only filehandle on the guestbook file by calling lock( ) with a false second argument. If lock( )
returns an undefined result, we print an error message and exit.
Otherwise, we read the contents of the guestbook file line by line and
split out the fields.

The fields are then processed through map( ) twice: once to unescape the URL escape characters using the CGI.pm unescape( ) function and once again to make them safe to display on an HTML page using CGI.pm's escapeHTML( ) function. The second round of escaping is to avoid problems with values that contain the <, >, and & symbols. The processed lines are turned into HTML table cells, and unshifted onto a list named @rows. The purpose of the unshift
is to reverse the order of the lines, so that more recent guestbook
entries appear at the top of the list. We add the headings for the
table and turn the whole thing into an HTML table using the appropriate
CGI.pm shortcuts. We close the filehandle and exit.

If we were not interested in running this script under
standard CGI, we could increase performance slightly and reduce memory
consumption substantially by replacing a few functions with their Apache:: equivalents:

See the reference listings in Chapter 9 for the proper
syntax for these replacements. You'll also find a version of the
guestbook script that uses these lightweight replacements on this
book's companion web site, http://www.modperl.com.

Apache::Registry Traps

There are a number of traps and pitfalls that you can fall into when using Apache::Registry. This section warns you about them.

It helps to know how Apache::Registry works in order to understand why the traps are there. When the server is asked to return a file that is handled by the Apache::Registry content handler (in other words, a script!), Apache::Registry
first looks in an internal cache of compiled subroutines that it
maintains. If it doesn't find a subroutine that corresponds to the
script file, it reads the contents of the file and repackages it into a
block of code that looks something like this:

$mangled_package_name is a
version of the script's URI which has been modified in such a way as to
turn it into a legal Perl package name while keeping it distinct from
all other compiled Apache::Registry scripts. For example, the guestbook.cgi script shown in the last section would be turned into a cached subroutine in the package Apache::ROOT::perl::guestbook_2ecgi. The compiled code is then cached for later use.

Before Apache::Registry even comes into play, mod_perl fiddles with the environment to make it appear as if the script were being called under the CGI protocol. For example, the $ENV{QUERY_STRING} environment variable is initialized with the contents of Apache::args( ), and $ENV{SERVER_NAME} is filled in from the value returned by Apache::server_hostname( ). This behavior is controlled by the PerlSetupEnv directive, which is On by default. If your scripts do not need to use CGI %ENV variables, turning this directive Off will reduce memory overhead slightly.

In addition to caching the compiled script, Apache::Registry
also stores the script's last modification time. It checks the stored
time against the current modification time before executing the cached
code. If it detects that the script has been modified more recently
than the last time it was compiled, it discards the cached code and
recompiles the script.

The first and most common pitfall when using Apache::Registry
is to forget that the code will be persistent across many sessions.
Perl CGI programmers commonly make profligate use of globals, allocate
mammoth memory structures without disposing of them, and open
filehandles and never close them. They get away with this because CGI
scripts are short-lived. When the CGI transaction is done, the script
exits, and everything is cleaned up automatically.

Not so with Apache::Registry
scripts (or any other Apache Perl module, for that matter). Globals
persist from invocation to invocation, big data structures will remain
in memory, and open files will remain open until the Apache child
process has exited or the server itself it shut down.

Therefore, it is vital to code cleanly. You should
never depend on a global variable being uninitialized in order to
determine when a subroutine is being called for the first time. In
fact, you should reduce your dependency on globals in general. Close
filehandles when you are finished with them, and make sure to kill (or
at least wait on) any child processes you may have launched.

Perl provides two useful tools for writing clean code. use strict turns on checks that make it harder to use global variables unintentionally. Variables must either be lexically scoped (with my
) or qualified with their complete package names. The only way around
these restrictions is to declare variables you intend to use as globals
at the top of the script with use vars. This code snippet shows how:

use strict; use vars qw{$INIT $DEBUG @NAMES %HANDLES};

We have used strict in many of the examples in the preceding sections, and we strongly recommend it for any Perl script you write.

The other tool is Perl runtime warnings, which can be turned on in Apache::Registry scripts by including a -w switch on the #! line, or within other modules by setting the magic $^W variable to true. You can even enable warnings globally by setting $^W to true inside the server's Perl startup script, if there is one (see Chapter 2).

-w will catch a variety of
errors, dubious programming constructs, typos, and other sins. Among
other things, it will warn when a bareword (a string without
surrounding quotation marks) conflicts with a subroutine name, when a
variable is used only once, and when a lexical variable is
inappropriately shared between an outer and an inner scope (a horrible
problem which we expose in all its gory details a few paragraphs
later).

-w may also generate hundreds
of "Use of uninitialized value" messages at runtime, which will fill up
your server error log. Many of these warnings can be hard to track
down. If there is no line number reported with the warning, or if the
reported line number is incorrect,[[2]] try using Perl's #line token described in the perlsyn manual page and in Chapter 9 under "Special Global Variables, Subroutines, and Literals."

It may also be helpful to see a full stack trace of the code which triggered the warning. The cluck( ) function found in the standard Carp module will give you this functionality. Here is an example:

use Carp (); local $SIG{_ _WARN_ _} = \&Carp::cluck;

Note that -w checks are done
at runtime, which may slow down script execution time. In production
mode, you may wish to turn warnings off altogether or localize warnings
using the $^W global variable described in the perlvar manpage.

Another subtle mod_perl trap
that lies in wait for even experienced programmers involves the sharing
of lexical variables between outer and inner named subroutines. To
understand this problem, consider the following innocent-looking code:

When you run this script, it generates the following inexplicable output:

Variable "$a" will not stay shared at ./test.pl line 12. In the outer scope, $a is 1 In the inner scope, $a is 2 In the outer scope, $a is 1 In the inner scope, $a is 3 In the outer scope, $a is 1 In the inner scope, $a is 4 In the outer scope, $a is 1 In the inner scope, $a is 5

For some reason the variable $a has become "unstuck" from its my( ) declaration in bump_and_print( ) and has taken on a life of its own in the inner subroutine bump( ). Because of the -w
switch, Perl complains about this problem during the compilation phase,
with the terse warning that the variable "will not stay shared." This
behavior does not happen if the inner subroutine is made into an
anonymous subroutine. It only affects named inner subroutines.

The rationale for the peculiar behavior of lexical variables and ways to avoid it in conventional scripts are explained in the perldiag manual page. When using Apache::Registry this bug can bite you when you least expect it. Because Apache::Registry works by wrapping the contents of a script inside a handler( ) function, inner named subroutines are created whether you want them or not. Hence, this piece of code will not do what you expect:

The first time you run it, it will run correctly, printing the value of the name
CGI parameter. However, on subsequent invocations the script will
appear to get "stuck" and remember the values of previous invocations.
This is because the lexically scoped $name variable is being referenced from within print_body( ), which, when running under Apache::Registry, is a named inner subroutine. Because multiple Apache processes are running, each process will remember a different value of $name, resulting in bizarre and arbitrary behavior.

Perl may be fixed someday to do the right thing with
inner subroutines. In the meantime, there are several ways to avoid
this problem. Instead of making the outer variable lexically scoped,
you can declare it to be a package global, as this snippet shows:

use strict; use vars '$name'; $name = param('name');

Because globals are global, they aren't subject to weird scoping rules.

Alternatively, you can pass the variable to the
subroutine as an argument and avoid sharing variables between scopes
altogether. This example shows that variant:

Finally, you can put the guts of your application into a library and use or require it. The Apache::Registry then becomes only a hook that invokes the library:

#!/usr/local/bin/perl require "my_application_guts"; do_everything();

The shared lexical variable problem is a good reason to use the -w switch during Apache::Registry
script development and debugging. If you see warnings about a variable
not remaining shared, you have a problem, even if the ill effects don't
immediately manifest themselves.

Another problem that you will certainly run into involves the use of custom libraries by Apache::Registry scripts. When you make an editing change to a script, the Apache::Registry
notices the recent modification time and reloads the script. However,
the same isn't true of any library file that you load into the script
with use or require. If you make a change to a require
d file, the script will continue to run the old version of the file
until the script itself is recompiled for some reason. This can lead to
confusion and much hair-tearing during development!

You can avoid going bald by using Apache::StatINC, a standard part of the mod_perl distribution. It watches over the contents of the internal Perl %INC array and reloads any files that have changed since the last time it was invoked. Installing Apache::StatINC is easy. Simply install it as the PerlInitHandler for any directory that is managed by Apache::Registry. For example, here is an access.conf entry that installs both Apache::Registry and Apache::StatINC :

Because Apache::StatINC operates at a level above the level of individual scripts, any nonstandard library locations added by the script with use lib or by directly manipulating the contents of @INC will be ignored. If you want these locations to be monitored by Apache::StatINC,
you should make sure that they are added to the library search path
before invoking the script. You can do this either by setting the PERL5LIB environment variable before starting up the Apache server (for instance, in the server startup script), or by placing a use lib line in your Perl startup file, as described in Chapter 2.

When you use Apache::StatINC, there is a slight overhead for performing a stat
on each included file every time a script is run. This overhead is
usually immeasurable, but it will become noticeable on a heavily loaded
server. In this case, you may want to forego it and instead manually
force the embedded Perl interpreter to reload all its compiled scripts by restarting the server with apachectl. In order for this to work, the PerlFreshRestart directive must be turned on in the Apache configuration file. If you haven't done so already, add this line to perl.conf or one of the other configuration files:

PerlFreshRestart On

You can try reloading compiled scripts in this way
whenever things seem to have gotten themselves into a weird state. This
will reset all scripts to known initial settings and allow you to
investigate problems systematically. You might also want to stop the
server completely and restart it using the -X
switch. This forces the server to run as a single process in the
foreground. Interacting with a single process rather than multiple ones
makes it easier to debug misbehaving scripts. In a production
environment, you'll want to do this on a test server in order to avoid
disrupting web services.

Handling Errors

Errors in Apache modules do occur, and tracking them
down is significantly trickier than in standalone Perl or C programs.
Some errors are due to bugs in your code, while others are due to the
unavoidable hazards of running in a networked environment. The remote
user might cancel a form submission before it is entirely done, the
connection might drop while you're updating a database, or a file that
you're trying to access might not exist.

A virtuous Apache module must let at least two people
know when a problem has occurred: you, the module's author, and the
remote user. You can communicate errors and other exception conditions
to yourself by writing out entries to the server log. For alerting the
user when a problem has occurred, you can take advantage of the simple
but flexible Apache ErrorDocument system, use CGI::Carp, or roll your own error handler.

Error Logging

We talked about tracking down code bugs in Chapter 2
and will talk more about C-language specific debugging in Chapter 10.
This section focuses on defensive coding techniques for intercepting
and handling other types of runtime errors.

The most important rule is to log everything.
Log anything unexpected, whether it is a fatal error or a condition
that you can work around. Log expected but unusual conditions too, and
generate routine logging messages that can help you trace the execution
of your module under normal conditions.

Apache versions 1.3 and higher offer syslog-like log levels ranging in severity from debug, for low-priority messages, through warn, for noncritical errors, to emerg, for fatal errors that make the module unusable. By setting the LogLevel
directive in the server configuration file, you can adjust the level of
messages that are written to the server error log. For example, by
setting LogLevel to warn, messages with a priority level of warn and higher are displayed in the log; lower-priority messages are ignored.

To use this adjustable logging API, you must load the standard Apache::Log module. This adds a log( ) method to the Apache request object, which will return an Apache::Log
object. You can then invoke this object's methods in order to write
nicely formatted log entries to the server's error log at the priority
level you desire. Here's a short example:

In this example, we first obtain a log object by calling the request object's log( ) method. We call the log object's debug( )
method to send a debug message to the error log and then try to perform
a locking operation. If the operation fails, we log an error message at
the emerg priority level using the log object's emerg( ) method and exit. Otherwise, we log another debugging message.

You'll find the full list of method calls made available by Apache::Log in Chapter 9, in the subsection "Logging Methods" under "The Apache Request Object."
In addition, the Apache Perl API offers three simpler methods for
entering messages into the log file. You don't have to import the Apache::Log module to use these methods, and they're appropriate for smaller projects (such as most of the examples in this book).

$r->log_error($message)

log_error( ) writes out a time-stamped message into the server error log using a facility of error.
Use it for critical errors that make further normal execution of the
module impossible. This method predates the 1.3 LogLevel API but still
exists for backward compatibility and as a shortcut to $r->log->error.

$r->warn($message)

warn( ) will log an error message with a severity level of warn.
You can use this for noncritical errors or unexpected conditions that
you can work around. This method predates the 1.3 LogLevel API but
still exists for backward compatibility and as a shortcut to $r->log->warn.

$r->log_reason($message,$file)

This is a special-purpose log message used for
errors that occur when a content handler tries to process a file. It
results in a message that looks something like this:

You might also choose to include a $DEBUG
global in your modules, either hard-coding it directly into the source,
or by pulling its value out of the configuration file with Apache::dir_config( ).
Your module can then check this global every time it does something
significant. If set to a true value, your script should send verbose
informational messages to the Apache error log (or to an alternative
log file of your choice).

The ErrorDocument System

Apache provides a handy ErrorDocument
directive that can be used to display a custom page when a handler
returns a non-OK status code. The custom page can be any URI, including
a remote web page, a local static page, a local server-side include
document, or a CGI script or module. In the last three cases, the
server generates an internal redirect, making the redirection very
efficient.

For example, the configuration file for Lincoln's laboratory site contains this directive:

ErrorDocument 404 /perl/missing.cgi

When the server encounters a 404 "Not Found" status
code, whether generated by a custom module or by the default content
handler, it will generate an internal redirect to a mod_perl script named missing.cgi. Before calling the script, Apache sets some useful environment variables including the following:

REDIRECT_URL

The URL of the document that the user was originally trying to fetch.

REDIRECT_STATUS

The status code that caused the redirection to occur.

REDIRECT_REQUEST_METHOD

The method (GET or POST) that caused the redirection.

REDIRECT_QUERY_STRING

The original query string, if any.

REDIRECT_ERROR_NOTES

The logged error message, if any.

A slightly simplified version of missing.cgi that works with Apache::Registry (as well as a standalone CGI script) is shown in Example 4-16. For a screenshot of what the user gets when requesting a nonexistent URI, see Figure 4-9.

Figure 4-9.The missing.cgi script generates a custom page to display when a URI is not found.

If you want to implement the ErrorDocument handler as a vanilla Apache Perl API script, the various REDIRECT_ environment variables will not be available to you. However, you can get the same information by calling the request object's prev( )
method. This returns the request object from the original request. You
can then query this object to recover the requested URI, the request
method, and so forth.

Example 4-17 shows a rewritten version of missing.cgi that uses prev( ) to recover the URI of the missing document. The feature to note in this code is the call to $r->prev on the fifth line of the handler( )
subroutine. If the handler was invoked as the result of an internal
redirection, this call will return the original request object, which
we then query for the requested document by calling its uri( )
method. If the handler was invoked directly (perhaps by the user
requesting its URI), the original request will be undefined and we use
an empty string for the document URI.

If the static nature of the Apache ErrorDocument
directive is inadequate for your needs, you can set the error document
dynamically from within a handler by calling the request object's custom_response( )
method. This method takes two arguments: the status code of the
response you want to handle and the URI of the document or module that
you want to pass control to. This error document setting will persist
for the lifetime of the current request only. After the handler exits,
the setting returns to its default.

For example, the following code snippet sets up a custom error handler for the SERVER_ERROR error code (a generic error that covers a variety of sins). If the things_are_ok( )
subroutine (not implemented here) returns a true value, we do our work
and return an OK status. Otherwise, we set the error document to point
to a URI named /Carp and return a SERVER_ERROR status.

HTTP Headers and Error Handling

You already know about using header_out( ) to set HTTP header fields. A properly formatted HTTP header is sent to the browser when your module explicitly calls send_http_header( ), or it is sent for you automatically if you are using Apache::Registry, the PerlSendHeader directive is set to On, and your script prints some text that looks like an HTTP header.

You have to be careful, however, if your module ever
returns non-OK status codes. Apache wants to assume control over the
header generation process in the case of errors; if your module has
already sent the header, then Apache will send a redundant set of
headers with unattractive results. This applies both to real HTTP
errors, like BAD_REQUEST and NOT_FOUND, as well as to nonfatal conditions like REDIRECT and AUTH_REQUIRED.

After setting the document MIME type, this module sends off the HTTP header. It then checks a constant named CRASH and if true, which it always is, returns a status code of SERVER_ERROR.
Apache would ordinarily send a custom HTTP header in response to this
status code, but because the module has already emitted a header, it's
too late. Confusion results. If we map this module to the URI /Crash, we can telnet directly to the server to demonstrate the problem:

Not only are there two HTTP headers here, but both of them indicate a status code of 200OK,
which is definitely not right. When displayed in the browser, the page
will be marred by extraneous header lines at the top of the screen.

The cardinal rule is that you should never call Apache::send_http_header( ) until your module has completed all its error checking and has decided to return an OK status code. Here's a better version of Apache::Crash that avoids the problem:

Another important detail about error handling is that Apache ignores the fields that you set with header_out( )
when your module generates an error status or invokes an internal
redirect. This is usually not a problem, but there are some cases in
which this restriction can be problematic. The most typical case is the
one in which you want a module to give the browser a cookie and
immediately redirect to a different URI. Or you might want to assign an
error document to the UNAUTHORIZED status
code so that a custom login screen appears when the user tries to
access a restricted page. In both cases you need to manipulate the HTTP
header fields prior to the redirect.

For these cases, call the request object's err_header_out( ) method. It has identical syntax to header_out( ),
but the fields that you set with it are sent to the browser only when
an error has occurred. Unlike ordinary headers, the fields set with err_header_out( ) persist across internal redirections, and so they are passed to Apache ErrorDocument handlers and other local URIs.

This provides you with a simple way to pass information
between modules across internal redirects. Combining the example from
this section with the example from the previous section gives the
modules shown in Example 4-18. Apache::GoFish generates a SERVER_ERROR, which is intercepted and handled by the custom ErrorDocument handler named Apache::Carp (Example 4-19). Before relinquishing control, however, Apache::GoFish creates a custom HTTP field named X-Odor which gives the error handler something substantial to complain about. The end result is shown in Figure 4-10.

The code should be fairly self-explanatory. The main point to notice is Apache::GoFish 's use of err_header_out( ) to set the value of the X-Odor field, and Apache::Carp 's use of the same function to retrieve it. Like header_out( ), when you call err_header_out( )
with a single argument, it returns the current value of the field and
does not otherwise alter the header. When you call it with two
arguments, it sets the indicated field.

An interesting side effect of this technique is that the X-Odor
field is also returned to the browser in the HTTP header. This could be
construed as a feature. If you wished to pass information between the
content handler and the error handler without leaving tracks in the
HTTP header, you could instead use the request object's "notes" table
to pass messages from one module to another. Chapter 9 covers how to
use this facility (see the description of the notes( ) method under "Server Core Functions").

Chaining Content Handlers

The C-language Apache API only allows a single content
handler to completely process a request. Several handlers may be given
a shot at it, but the first one to return an OK status will terminate
the content handling phase of the transaction.

There are times when it would be nice to chain handlers
into a pipeline. For example, one handler could add canned headers and
footers to the page, another could correct spelling errors, while a
third could add trademark symbols to all proprietary names. Although
the native C API can't do this yet,[[3]] the Perl API can, using a technique called "stacked handlers."

It is actually quite simple to stack handlers. Instead of declaring a single module or subroutine in the PerlHandler
directive, you declare several. Each handler will be called in turn in
the order in which it was declared. The exception to this rule is if
one of the handlers in the series returns an error code (anything other
than OK, DECLINED, or DONE). Handlers can adjust the stacking order themselves, or even arrange to process each other's output.

Simple Case of Stacked Handlers

Example 4-20 gives a very simple example of a stack of three content handlers. It's adapted slightly from the mod_perl manual page. For simplicity, all three handlers are defined in the same file, and are subroutines named header( ), body( ), and footer( ).
As the names imply, the first handler is responsible for the top of the
page (including the HTTP header), the second is responsible for the
middle, and the third for the bottom.

We first load the whole module into memory using the PerlModule directive. We then declare a URI location /My and assign the perl-script handler to it. Perl in turn is configured to run the My::header, My::body, and My::footer subroutines by passing them as arguments to a PerlHandler directive. In this case, the /My location has no corresponding physical directory, but there's no reason that it couldn't.

After bringing in the OK constant from Apache::Constants, we define the subroutines header( ), body( ), and footer( ). header( ) sets the document's content type to plain text, sends the HTTP header, and prints out a line at the top of the document. body( ) and footer( ) both print out a line of text to identify themselves. The resulting page looks like this:

Coordinating Stacked Handlers

Stacked handlers often have to coordinate their activities. In the example of the previous section, the header( )
handler must be run before either of the other two in order for the
HTTP header to come out correctly. Sometimes it's useful to make the
first handler responsible for coordinating the other routines rather
than relying on the configuration file. The request object's push_handlers( ) method will help you do this.

push_handlers( ) takes two
arguments: a string representing the phase to handle, and a reference
to a subroutine to handle that phase. For example, this code fragment
will arrange for the footer( ) subroutine to be the next content handler invoked:

$r->push_handlers(PerlHandler => \&footer);

With this technique, we can rewrite the previous example along the lines shown in Example 4-21. In the revised module, we declare a subroutine named handler( ) that calls push_handlers( ) three times, once each for the header, body, and footer of the document. It then exits. The other routines are unchanged.

Stacked Handler Pipelining

The stacked handlers we looked at in the previous
example didn't interact. When one was finished processing, the next
took over. A more sophisticated set of handlers might want to pipeline
their results in such a way that the output of one handler becomes the
input to the next. This would allow the handlers to modify each other's
output in classic Unix filter fashion. This sounds difficult, but in
fact it's pretty simple. This section will show you how to set up a
filter pipeline. As an aside, it will also introduce you to the concept
of Apache Perl API method handlers.

The trick to achieving a handler pipeline is to use
"tied" filehandles to connect the neighbors together. In the event that
you've never worked with a tied filehandle before, it's a way of giving
a filehandle seemingly magic behavior. When you print( )
to a tied filehandle, the data is redirected to a method in a
user-defined class rather than going through the usual filesystem
routines. To create a tied filehandle, you simply declare a class that
defines a method named TIEHANDLE( ) and various methods to handle the sorts of things one does with a filehandle, such as PRINT( ) and READ( ).

Here's a concrete example of a tied filehandle class that interfaces to an antique daisywheel printer of some sort:

The TIEHANDLE( ) method gets
called first. It is responsible for opening the daisywheel printer
driver (routine not shown here!) and returning a blessed object
containing its instance variables. The PRINT( )
method is called whenever the main program prints to the tied
filehandle. Its arguments are the blessed object and a list containing
the arguments to print( ). It recovers the
printer name from its instance variables and then passes it, and the
items to print, to an internal routine that does the actual work. DESTROY( ) is called when the filehandle is untie( )d or closed. It calls an internal routine that closes the printer driver.

To use this class, a program just has to call tie( ) with the name of an appropriate printer:

A more complete tied filehandle class might include a PRINTF( ) method, a READ( ) method, a READLINE( ) method, and a GETC( ) method, but for output-only filehandles PRINT( ) is usually enough.

Now back to Apache. The strategy will be for each
filter in the pipeline, including the very first and last ones, to
print to STDOUT, rather than directly invoking the Apache::print( ) method via the request object. We will arrange for STDOUT to be tied( ) in each case to a PRINT( ) method defined in the next filter down the chain. The whole scheme looks something like this:

Interestingly enough, the last filter in the chain
doesn't have to get special treatment. Internally, the Apache request
ties STDOUT to Apache::PRINT( ), which in turn calls Apache::print( ). This is why handlers can use $r->print('something') and print('something') interchangeably.

To simplify setting up these pipelines, we'll define a utility class called Apache::Forward.[[4]]Apache::Forward
is a null filter that passes its input through to the next filter in
the chain unmodified. Modules that inherit from this class override its
PRINT( ) method to do something interesting with the data.

Example 4-22 gives the source code for Apache::Forward. We'll discuss the code one section at a time.

Most of the work is done in the handler( ) subroutine, which is responsible for correctly tying the STDOUT filehandle. Notice that the function prototype for handler( ) is ($$), or two scalar arguments. This is a special signal to Apache to activate its method handler behavior. Instead of calling handler( ) like an ordinary subroutine, Apache calls handler( ) like this:

Apache::Forward->handler($r);

The result is that the handler( )
receives the class name as its first argument, and the request object
as the second argument. This object-oriented calling style allows Apache::Forward to be subclassed.

The handler( ) subroutine begins by recovering the identity of the next handler in the pipeline. It does this by calling tied( ) on the STDOUT filehandle. tied( )
returns a reference to whatever object a filehandle is tied to. It will
always return a valid object, even when the current package is the last
filter in the pipeline. This is because Apache ties STDOUT to itself,
so the last filter will get a reference to the Apache object.
Nevertheless, we do check that tied( ) returns an object and error out if not--just in case.

Next the subroutine reties STDOUT to itself, passing tie( )
the request object and the reference to the next filter in the
pipeline. This call shouldn't fail, but if it does, we return a server
error at this point.

Before finishing up, the handler( )
method needs to ensure that the filehandle will be untied before the
transaction terminates. We do this by registering a handler for the
cleanup phase. This is the last handler to be called before a
transaction terminates and is traditionally reserved for this kind of
garbage collection. We use register_cleanup( )
to push an anonymous subroutine that unties STDOUT. When the time
comes, the filehandle will be untied, automatically invoking the
class's DESTROY( ) method. This gives the
object a chance to clean up, if it needs to. Note that the client
connection will be closed before registered cleanups are run, so class DESTROY( ) methods should not attempt to send any data to the client.

The next routine to consider is TIEHANDLE( ), whose job is to return a new blessed object. It creates a blessed hash containing the keys r and next. r points to the request object, and next points to the next filter in the pipeline. Both of these arguments were passed to us by handler( ).

The PRINT( ) method is
invoked whenever the caller wants to print something to the tied
filehandle. The arguments consist of the blessed object and a list of
data items to be processed. Subclasses will want to modify the data
items in some way, but we just forward them unmodified to the next
filter in line by calling an internal routine named forward( ).

#sub DESTROY { # my $self = shift; # # maybe clean up here #}

DESTROY( ) is normally
responsible for cleaning up. There's nothing to do in the general case,
so we comment out the definition to avoid being called, saving a bit of
overhead.

sub forward { shift()->{'next'}->PRINT(@_); }

forward( ) is called by PRINT( )
to forward the modified data items to the next filter in line. We shift
the blessed object off the argument stack, find the next filter in
line, and invoke its PRINT( ) method.

Having defined the filter base class, we can now define
filters that actually do something. We'll show a couple of simple ones
to give you the idea first, then create a larger module that does
something useful.

Apache::Upcase (Example 4-23) transforms everything it receives into uppercase letters. It inherits from Apache::Forward and then overrides the PRINT( ) method. PRINT( ) loops through the list of data items, calling uc( ) on each. It then forwards the modified data to the next filter in line by calling its forward( ) method (which we do not need to override).

Along the same lines, Apache::Censor (Example 4-24)
filters its input data to replace four-letter words with starred
versions. It takes the definition of "four-letter word" a little
liberally, transforming "sent" into "s**t." It is identical in every
way to Apache::Upcase, except that PRINT( )
performs a global regular expression substitution on the input data.
The transformed data is then forwarded to the next filter as before.

To watch these filters in action, we need a data
source. Here's a very simple content handler that emits a constant
string. It is very important that the content be sent with a regular print( ) statement rather than the specialized $r->print( ) method. If you call Apache::print( ) directly, rather than through the tied STDOUT filehandle, you short-circuit the whole chain!

package Apache::TestFilter; use strict; use Apache::Constants 'OK'; sub handler { my $r = shift; $r->content_type('text/plain'); $r->send_http_header; print(<<END); This is some text that is being sent out with a print() statement to STDOUT. We do not know whether STDOUT is tied to Apache or to some other source, and in fact it does not really matter. We are just the content source. The filters come later. END OK; } 1; _ _END_ _

The last step is to provide a suitable entry in the configuration file. The PerlHandler directive should declare the components of the pipeline in reverse
order. As Apache works its way forward from the last handler in the
pipeline to the first, each of the handlers unties and reties STDOUT.
The last handler in the series is the one that creates the actual
content. It emits its data using print( ) and the chained handlers do all the rest. Here's a sample entry:

The last filter we'll show you is actually useful in
its own right. When inserted into a filter pipeline, it compresses the
data stream using the GZip protocol, and flags the browser that the
data has been GZip-encoded by adding a Content-Encoding
field to the HTTP header. Browsers that support on-the-fly
decompression of GZip data will display the original document without
any user intervention.[[5]]

This filter requires the zlib compression library and its Perl interface, Paul Marquess' Compress::Zlib. zlib, along with instructions on installing it, can be found at ftp://ftp.uu.net/pub/archiving/zip/zlib*. As usual, you can find Compress::Zlib
at CPAN. Together these libraries provide both stream-based and
in-memory compression/decompression services, as well as a high-level
interface for creating and reading gzip files.

The filter is a little more complicated than the
previous ones because GZip works best when the entire document is
compressed in a single large segment. However, the filter will be
processing a series of print( ) statements
on data that is often as short as a single line. Although we could
compress each line as a single segment, compression efficiency suffers
dramatically. So instead we buffer the output, using zlib 's stream-oriented compression routines to emit the encoded data whenever zlib
thinks enough data has been received to compress efficiently. We also
have to take care of the details of creating a valid GZip header and
footer. The header consists of the current date, information about the
operating system, and some flags. The footer contains a CRC redundancy
check and the size of the uncompressed file.

Example 4-25 gives the complete code for Apache::GZip. Although it inherits its core functionality from Apache::Forward, each subroutine has to be tweaked a bit to support the unique requirements of GZip compression.

After the usual preamble, we import the compression routines from Compress::Zlib, and bring in the Apache::Forward
class. We then define a couple of constants needed for the GZip header
(in case you're wondering, we got these constants by looking at the zlib C code).

In order for the browser to automatically decompress the data, it needs to see a Content-Encoding field with the value gzip in the HTTP header. In order to insert this field, we override the parent class's handler( ) subroutine and set the field using the request object's content_encoding( ) method. We then call our superclass's handler( ) method to do the rest of the work.

The commented line that comes before the call to content_encoding( ) is an attempt to "do the right thing." Browsers are supposed to send a header named Accept-Encoding
if they can accept compressed or otherwise encoded data formats. This
line tests whether the browser can accept the GZip format and declines
the transaction if it can't. Unfortunately, it turns out that many
Netscape browsers don't transmit this essential header, so we skip the
test.[[6]]

All the compression work is done in TIEHANDLE( ), PRINT( ), and flush( ). TIEHANDLE( ) begins by invoking the superclass's handler( ) method to create an object blessed into the current class. The method then creates a new Compress::Zlib deflation object by calling deflateInit( ), using an argument of -WindowBits that is appropriate for GZip files (again, we got this by reading the zlib
C source code). Finally we add a few new instance variables to the
object and return it to the caller. The instance variables include crc, for the cyclic redundancy check, d for the deflation object, l for the total length of the uncompressed data, and h for a flag that indicates whether the header has been printed.[[7]] Finally, TIEHANDLE( ) will call the push_handlers( ) method, installing our flush( ) method at the end of the output chain.

The PRINT( ) method is called once each time the previous filter in the pipeline calls print( ). It first checks whether the GZip header has already been sent, and sends it if not. The GZip header is created by the gzheader( )
routine and consists of a number of constants packed into a 10-byte
string. It then passes each of its arguments to the deflation object's deflate( )
method to compress the information, then forwards whatever compressed
data is returned to the next filter in the chain (or Apache, if this is
the last filter). The subroutine also updates the running total of
bytes compressed and calculates the CRC, using Compress::Zlib 's crc32( ) subroutine.

The flush( ) routine is called when the last of our chained handlers is run. Because zlib
buffers its compressed data, there is usually some data left in its
internal buffers that hasn't yet been printed. We call the deflation
object's flush( ) method to obtain whatever
is left and forward it onward. Lastly we forward the CRC and the total
length of the uncompressed file, creating the obligatory GZip footer.

You can use Apache::GZip with any content handler that prints directly to STDOUT. Most of the modules given in this chapter send data via $r->print( ). Simply delete the $r-> part to make them compatible with Apache::GZip and other chained content handlers.

Readers who are interested in content handler pipelines should be aware of Jan Pazdziora's Apache::OutputChain module. It accomplishes the same thing as Apache::Forward but uses an object model that is less transparent than this one (among other things, the Apache::OutputChain module must always appear first on the PerlHandler list). You should also have a look at Andreas Koenig's Apache::PassFile and Apache::GZipChain
modules. The former injects a file into an OutputChain and is an
excellent way of providing the input to a set of filters. The latter
implements compression just as Apache::GZip does but doesn't buffer the compression stream, losing efficiency when print( ) is called for multiple small data segments.

Just as this book was going to press, Ken Williams announced Apache::Filter,
a chained content handler system that uses a more devious scheme than
that described here. Among the advantages of this system is that you do
not have to list the components of the pipeline in reverse order.

Other Types of Stacked Handlers

Content handlers aren't the only type of Apache Perl
API handler that can be stacked. Translation handlers, type handlers,
authorization handlers, and in fact all types of handlers can be
chained using exactly the same techniques we used for the content
phase.

A particularly useful phase for stacking is the cleanup
handler. Your code can use this to register any subroutines that should
be called at the very end of the transaction. You can deallocate
resources, unlock files, decrement reference counts, or clear globals.
For example, the CGI.pm module maintains a number of package globals
controlling various programmer preferences. In order to continue to
work correctly in the persistent environment of mod_perl, CGI.pm has to clear these globals after each transaction. It does this by arranging for an internal routine named _reset_globals( ) to be called at the end of each transaction using this line of code:

$r->push_handlers('PerlCleanupHandler',\&CGI::_reset_globals);

Your program can push as many handlers as it likes, but
you should remember that despite its name, the handler stack doesn't
act like the classic LIFO (last-in/first-out) stack. Instead it acts
like a FIFO (first-in/first-out) queue. Also remember that if the same
handler is pushed twice, it will be invoked twice.

Method Handlers

It should come as no surprise that between the Apache
distribution and third-party modules, there exist dozens of
authentication modules, several directory indexing modules, and a
couple of extended server-side include modules. All of these modules
contain code that was copied and pasted from each other. In some cases
all but a minuscule portion of the module consists of duplicated code.

Code duplication is not bad in and of itself, but it is
wasteful of memory resources and, more important, of developers' time.
It would be much better if code could be reused
rather than duplicated, by using a form of object-oriented subclassing.
For the C-language API there's not much hope of this. Vanilla C doesn't
provide object-oriented features, while C++ would require both the
Apache core and every extension module to adopt the same class
hierarchy--and it's a little late in the game for this to happen.

Fortunately, the Perl language does support a simple
object-oriented model that doesn't require that everyone buy into the
same class hierarchy. This section describes how these object-oriented
features can be used by Perl API modules to reuse code instead of
duplicating it.

We've already looked at piecing together documents in
various ways. Here we will explore an implementation using method
handlers. There are two classes involved with this example: My::PageBase and My::Page.

Example 4-26 shows the My::PageBase class, which provides the base functionality for the family of documents derived from this class. My::PageBase stitches together a document by calling four methods: the header( ) method sends the HTTP headers, the top( ) method emits the beginning of an HTML document, including the title, the body( ) method emits the main contents of the page, and the bottom( ) method adds a common footer. My::PageBase includes generic definitions for header( ), top( ), body( ), and bottom( ), each of which can be overridden by its subclasses. These are all very simple methods. See Example 4-26 for the definitions.

The key to using My::PageBase in an object-oriented way is the handler( ) subroutine's use of the ($$) function prototype. This tells mod_perl
that the handler wants two arguments: the static class name or object,
followed by the Apache request object that is normally passed to
handlers. When the handler is called, it retrieves its class name or
object reference and stores it in the lexical variable $self. It checks whether $self is an object reference, and if not, it calls its own new( ) method to create a new object. It then invokes the header( ), top( ), body( ), and bottom( ) methods in turn.

The My::PageBasenew( ) method turn the arguments passed to it into a blessed hash in the My::PageBase package. Each key in the hash is an attribute that can be used to construct the page. We do not define any default attributes:

sub new { my $class = shift; bless {@_}, $class; }

We will see later why this method is useful.

As we saw in the section on the Apache::Forward module, method handlers are configured just like any other:

However, for clarity's sake, or if you use a handler method named something other than handler( ), you can use Perl's standard -> method-calling notation. You will have to load the module first with the PerlModule directive:

When My::PageBase is installed in this way and you request URI /my, you will see the exciting screen shown in Figure 4-12.

Figure 4-12.The generic document produced by My::PageBase

Naturally, we'll want to add a bit more spice to this
page. Because the page is modularized, we can do so one step at a time
by subclassing Apache::PageBase 's methods. The My::Page class does so by inheriting from the My::PageBase class and simply overriding the body( ) method.

Now we need a better title for our document. We could override the top( ) method as we did for body( ), but that would involve cutting and pasting a significant amount of HTML (see Example 4-26). Instead, we can make use of the object's title attribute, which is used by the top( ) method in this way:

my $title = $self->{title} || "untitled document";

So how do we set the title attribute? This is where the My::PageBasenew( ) method comes in. When it is called with a set of attribute=value
pairs, it blesses them into a hash reference and returns the new
object. To set the title attribute, we just have to call the new( ) method like this:

This wraps up our discussion of the basic techniques
for generating page content, filtering files, and processing user
input. The next chapter ventures into the perilous domain of imposing
state on the stateless HTTP protocol. You'll learn techniques for
setting up user sessions, interacting with databases, and managing
long-term relationships with users.

1.
At least in theory, you can divine what MIME types a browser prefers by examining the contents of the Accept header with $r->header_in('Accept').
According to the HTTP protocol, this should return a list of MIME types
that the browser can handle along with a numeric preference score. The
CGI.pm module even has an accept( ) function
that leverages this information to choose the best format for a given
document type. Unfortunately, this part of the HTTP protocol has
atrophied, and neither Netscape's nor Microsoft's browsers give enough
information in the Accept header to make it useful for content negotiation.

2.
Certain uses of the eval operator and "here" documents are known to throw off Perl's line numbering.

3.
At the time this was written, the Apache developers were discussing a
layered I/O system which will be part of the Apache 2.0 API.

4.
The more obvious name, Apache::Filter, is already taken by a third-party module that does output chaining in a slightly different manner.

5.
For historical reasons this facility is limited to Unix versions of
Netscape Navigator, to PowerPC versions of Navigator on the Macintosh,
and to some other Unix-based browsers such as W3-Emacs. However, now
that Navigator's source code has been released to the developer
community, we hope to see a more widespread implementation of this
useful feature.

6.
Andreas Koenig's Apache::GzipChain module, which does much the same thing as this one, contains a hardcoded pattern match for the browser type contained in the User-Agent field. You can add this sort of test yourself if you wish, or wait for the browser developers to implement Accept-Encoding correctly.

7.
At the time this chapter was being prepared, the author of Compress::Zlib, Paul Marquess, was enhancing his library to make this manual manipulation of the compressed output stream unnecessary.