Perl Practicum: Fun With Formats

by Hal Pomeranz

Before Perl became a general purpose programming language, it was
PERL: the Practical Extraction and Report Language. You can find the
evolutionary remains of Perl's humble beginnings hidden away in dark
corners of the language. Formats, for example, are a Perl language
construct with a syntax unlike any other Perl construct and which
generally have functionality that can be emulated with other routines
(notably printf()). For these and other reasons, most people first
learning Perl seem to skip over information about formats, but if you
write any reasonable number of scripts to produce reports from long
files of data, formats can be a valuable tool.

Simple Reporting

One of the first useful Perl applications I wrote was a little program
to balance my checkbook: the application reads in a file of data
containing all of the transactions I have made to date, and prints a
nicely formatted statement with a running balance. I originally wrote
the output portion using printf() statements, but when I
gave the code to Tom Limoncelli, he sent it back to me with all of the
printf() statements replaced with format code. Darn it,
his version was nicer (but my checkbook was balanced first).

I wanted to make the data file as easy to type as possible, so the
format is very simple. The first line of the input file is the
starting balance, in pennies (no need to type a decimal point and no
floating point arithmetic). Each of the following lines represents a
transaction: four tab separated fields giving the check number or
transaction code, the date, a description, and the amount (again in
pennies). Deposits and other credits to the account are represented as
negative values (I seem to put money into my accounts much less
frequently than I take it out). Here is a simple program to read this
input file and generate a statement of the account:

The first four lines in the example are a format declaration. The
first line defines the format's name. When the write() function is
called to print a line of formatted data, it uses the format named for
the currently selected file handle. In our example, the program is
sending the report to the standard output. Note that if no format name
is specified, STDOUT is assumed, but it is always better to name
formats explicitly, even when you are using STDOUT.

The second line is a picture of how each output line will look. Each
group of characters beginning with an @ is an output field specifier -
everything else is a literal (e.g., the $ signs at the beginning of
the two money fields). Less-than () signs mean right justified;
the pipe symbol (|) specifies centered fields. Numeric fields are
indicated with hash marks (#) and an optional decimal point. The field
width is the number of special characters, INCLUDING the @ sign (in
the example below, the first field is six characters wide, the second
is five, etc.). This enables the picture to resemble a somewhat
abstract but perfectly aligned example of the output.

The picture's third line associates a variable with each field. When
the write() function is called, the current value of each of the named
variables is printed using the specified format. It is clearer to read
if you to try and line up the variable specifications with their
associated field specifications on the line above.

The last line of a format declaration is always a dot on a line by
itself. This terminates the format declaration.

Format declarations can appear anywhere in the program. The example above
contains two format declarations: one before the code and one
after. This was done to make the point; in your own code, I recommend
you group all formats together near the top of the script. If there
are multiple formats with the same name in the program, the one
defined last will be the one that gets used.

If a format with the special name top is defined in the program, this
format will be printed at the beginning of each page of formatted
output. The special variable $= defines the number of
lines per page; 60 is the default, but you can assign a smaller number
if you like (for example, when printing to a terminal or small
window). The special variable $- gives the number of
lines left on the current page. You can force a new page by setting
$- to 0. However: DO NOT mix print() and
printf() statements with write() or else the
$- variable will not be decremented correctly.

Dirty Tricks

While you can define a special top format for page headers, there is
no way to define a format for page footers. There is, however, a trick
for dealing with this situation. While write() usually uses the format
named for the file handle that the output is going to, you can use a
different format by assigning the alternate format's name to the
special $~ variable. The trick then, is to keep track of the number of
lines left on the page and emit a special footer format at the bottom
of the page. Here is the program logic for doing this:

First we introduce a new footer format and a new global constant,
$footer_depth, which is the number of lines that the
footer occupies on the page. The footer format in our example uses yet
another special variable, $%, which gives the current
page number (numbered starting with 1).

Each time we emit a line with write(), we check
$- for the number of lines remaining on the page. When we
have exactly $footer_depth lines left, it is time to
write the page footer. To write the footer, we simply set
$~ to the name of the footer format (footer,
this example), issue a write(), and then reset $~
to the usual format (STDOUT) before getting the
next line from the transaction file. This line will appear on the next
page after the header in the usual fashion.

While this method works very cleanly, when each write() statement only
outputs a single line, anticipating the end of page when using
multi-line formats can get tricky. Also notice that no footer will be
output on the last page. Additional code would have to be added after
the while() loop to output additional blank lines and the footer. This
is left as an exercise to the reader.

If you ever want to change header formats for any reason - for example
if you wanted a large header on the first page, but only minimal
headers on the other pages - you can use the special $^
variable. This variable behaves like $~, but selects the
header format instead. Never set $^ (or $~
for that matter) to a non-existent format because this will cause your
program to exit with a fatal error at run time. If you want a null
header, never define the top format at all, or set $^ to
an empty format.

Multi-Line Formats

Consider a couple of important facts about the top format in the two
examples. First, there are no field definitions anywhere in the format
declaration. It is perfectly legal to have a format with no field
declarations, though in practice you will probably only do this for
header formats.

Second, the format declaration defines multiple lines of output. This
also is perfectly legal and each line can have zero, one, or more
field declarations in it. The general pattern for multi-line format
declarations is one line of field descriptions, followed by a line
containing the variables associated with those fields, followed by
another line of field descriptions, etc.

The next example shows an interesting use of multi-line formats. For
purposes of this example program, we are assuming a function called
mailparse() which processes email messages one at a time
from the standard input. For each message, mailparse()
puts all header information in a global associative array,
%header, indexed by the header tag (e.g.,
From, To) and all of the body lines in a
global scalar variable called $body. The output is shown
below the example. By the way, my editor never sent me that message: I
made it up. Like all writers, I am always early for all
deadlines. Well, that last part was a lie, but I really did make up
the email message.

There are a number of new constructs in the message format in this example. First are the fields that begin with ^ instead of @. For these
fields, Perl outputs as much text as will fit in the field and then
removes that text from the string variable. By stacking several ^
fields together using the same long string, you can output that string
as a block of text with a ragged right margin, as shown in the output,
with both the body of the message and the Subj: line. The special $:
variable (last special variable in this column, I promise) is the set
of characters on which Perl can legally break the line; the
default value for $: is \n - (newline, space, or hyphen).

The special ~~ marker on the last line means "keep
outputting lines until all variables ($body and
$header{Subject} in this case) are exhausted." This is
useful for situations where you are not sure how long your text may
run, but you want to be able to output all of the information. You can
put the ~~ anywhere on the line, but it is best to put it
in a very visible location (the beginning of the line is almost always
best).

Conclusion

I have run across many Perl programs with complex printf() blocks that
would have been much easier to write and much more readable if the
developer had used formats instead. If you need to quickly produce
reports, or output large amounts of tabulated data, formats are an
extremely effective tool.