Automated Mail Purging for SMTP Mail

Mr. Keller gives us three scripts for cleaning out old mail files automatically.

If your Linux system actively handles
electronic mail, especially for several users, you may discover
that mail consumes a lot of space where it resides, usually in one
of /var/mail (System V-compliant systems), /usr/mail (BSD),
/usr/spool/mail or /var/spool/mail (where my Debian GNU/Linux
system stores it). The collection of Bash scripts presented here
provides a way to reduce the space used by mail files by purging
old messages.

Background

For several months I worked as a contractor at a company that
used a Unix system to process all electronic mail for its
employees. Several hundred users retrieved their mail from a single
host that served as both Simple Mail Transfer Protocol (SMTP)
server (for general mail transport) and Post Office Protocol (POP3)
server (for clients to retrieve mail without logging into the
host). While it had a home-grown mail purge system, users could
inadvertently defeat the system, allowing files to grow without
end.

The existing mail purge system ran as a nightly
cron job. It would determine the
date 60 days before and construct a string in SMTP mail header
format. It would then loop through all the user mail files in
/var/mail using the grep command
to find that string and note the lines on which it appeared. For
each file that contained the string, it would use the
tail command to discard everything
prior to the line containing the target date string.

If messages arrived undamaged, stayed intact, always appeared
in date sequence and cron called this job daily, this method would
usually work. However, sometimes cron doesn't run a job, messages
do not always arrive in date order and other software run against
mail files might reorder messages. Because of the existing
solution's all-or-nothing keep-or-discard method, it had the
following problems:

It was possible to lose newer messages if an older
date appeared after a newer one.

If a message body had an un-escaped string that
matched the search string, part of a message could be lost.

If older messages appeared out of order, they might
remain longer than desired, since the keep/discard decision was
based on simply finding a string, rather than examining each
message's date.

It depended on finding the exact date string
instead of making a numeric comparison that could discard messages
older than the target date. A message containing invalid headers
and lying at the end of the message file might remain, even if it
had aged beyond the retention period.

A further effect is that some mail readers might
not handle damaged messages gracefully.

A Solution

To find a more reliable solution, I sought an existing free
program that would address my needs. I asked my peers, including
those on Internet mailing lists, but got no useful response. A
search of Internet Unix archives revealed nothing, either. Thus, I
was left to create a solution.

To ensure acceptable handling of messages, I settled on these
requirements:

Each message must undergo individual examination to
keep or discard it based on its creation date.

A reliable way to determine where messages began
and ended was needed.

Because I discovered that date formats vary a bit
(mail-handling programs only loosely follow the rules), I needed to
convert each message date to a simple number and use that number
for the keep/discard choice.

I discovered that formail,
part of the procmail mail-handling
package, can split an SMTP mail file into individual messages and
repair damaged headers in one pass. This ability enables individual
examination of each message to decide whether to keep or discard
it. Since each message would undergo separate examination, date
order would not be important.

How It Works

The first script, mailrm.sh (see
Listing 1), requires one
command-line argument, naming the number of days of messages to
preserve. When run, it sets needed variables and starts checking
each mail file in $MAILDIR. It uses formail, located in $FORMAIL,
to verify, repair and split each mail file into individual
messages.

Each message is then examined by the mailage.sh script (see
Listing 2) to determine whether to
keep it. First, it checks the message header's “From” line for
the date, moving fields as necessary. (If formail has to repair a
message date, the resulting date doesn't have a time zone in it.)
Then it compares the message date with the value computed for today
minus the number of days to retain messages. If the message is
newer, the script concatenates the message onto STDOUT, saving it
to a temporary file. If the message is older, the script exits with
no output.

Afterward, if the new output file has a different number of
lines from the original, it is moved into the original file's
place, its ownership is reset and its permissions are restricted.
If the original mail file is now zero length, mailrm.sh removes it.
If a user has been removed from a system leaving his mail behind,
this script deletes his mail file after all the messages in it have
expired.

The third script, maildate.sh (see
Listing 3), returns the integer
number of days since 1900 of an input date in the form “MMM DD
YYYY”. The returned integer is useful for calculating the
difference between two dates.