Data Mining Email

I once did a contract administrating the network of a small startup company.
The owner's total existence revolved around the email generated from his
various business relationships and he never made a decision unless the issues were
first detailed in an email. My boss once confided in me of his hatred of his
email client because it limited him in the way he could organize his
information.

I was so impressed by his desperation that I found myself constantly
thinking about his dilemma. In time the answer came to me. He needed a data
mine.

What is data mining anyway? Data mining is a
class of database applications that look for hidden patterns in a group of
data.

In writing this article it quickly became apparent that you might need
several pieces of unrelated technical know-how to create a data mine so I ended
up asking myself "who am I writing this article for?". Do I write for the Perl
programmer or the DBA? Programming and managing data don't use the same set of
skills but they are nevertheless both required. So where does this article
go?

This article is for those who need a guide to generating information from
existing data and are looking for ideas on how to do it. Thomas Edison once
said that genius was 1 percent inspiration and 99 percent perspiration. Here's your 1 percent.

The Scenario

The objective is to data mine the email that resides in a Mozilla mail box.
We are especially interested in MS Word .doc attachments. We'll
parse the email and upload it into a relational database where we can then
analyze it.

The Developer's Skillset

Perl-wise, you'll need intermediate programming experience. You'll need to know how to execute command-line utilties within a Perl script. You should be able to install and use Perl modules from the CPAN, and you'll need to be familiar with using PostgreSQL from Perl.

Regarding PostgreSQL, you should know how to install PostgreSQL on Linux
either by using precompiled binaries such as RPMs, Debian packages, and tar balls or by compiling it from source code. If someone else has installed this
already, great! You should be familiar with the SQL 92 and SQL 99 standards in
that you can create tables, views, constraints, and user-defined functions.

On the command line, you need to know how to pipe standard input and output
from one process to the next.

The Assumptions

I assume you have Perl 5.8.x installed with access to the CPAN configured
correctly. You also need a standard installation of Postgres with minimal
security options configured. The example Perl script will connect to the
database via a socket connection, so uncomment the line tcpip_socket =
true in postgreql.conf.

Furthermore, the script assumes that the connection requires a password:

I'll assume that we'll use the PostgreSQL superuser account,
postgres. As well, the mbox file we'll process is
Sent and is in the same directory as the parser script.

Part 1. Searching for Mailboxes

The Mozilla mail client stores its mailboxes under the user's home
directory, usually in ~/.mozilla. Figure 1 shows the different
mailboxes of an existing profile. The mboxes are those files without any
extensions. Another hint is to look at the file size; mailboxes are normally
the largest files present.

Figure 1. Screenshot of a Mozilla mailbox.

The following command lists all the mailboxes in my Mozilla profile account:

The Tables

The most important column in the tables is the
messageid, which is a unique alphanumeric
code that is part of the email message. The tables have been
defined such that no data population can occur unless the messageid
already exists in the table mailid.

The message body and attachments are to be stored as
large objects. Many of the tables have a column of type oid, which refers to the actual data that is located in
the system catalog pg_largeobject.

Indexes are created by default in Postgres for each
primary key that is defined in a table.

The create tables script will clean the
database email and create the tables. Invoke it by executing it from the
console:

$ psql -U postgres -f createTables.sql email

Part 3. Developing the Email Parser

Install the following modules from the CPAN, along with their dependencies:

Mail::MboxParser, a simple interface that provides read-only access to Unix-mailboxes

DBD::Pg, the database interface between Perl and the PostgreSQL database.

The mbox parser script first defines some global
variables and then establishes a database connection with the PostgreSQL
server:

The main body and attachments of a message are referenced in the
main and attachment tables with an oid. Prior to
version 7.1, PostgreSQL cannot insert data into a row that exceeds the default
data page size of 8192 bytes. To get around this, we'll instead insert this
information as large objects. Upload a large object by using
psql:

decode.pl decodes the MIME Base64 message back into its
original binary encoding. There is also a PostgreSQL function for this. The
program then pipes the encoded output into the file utility to
determine if it really is a MS Word document. At the same time, it saves the
decoded attachment as the temporary file /tmp/attachment.

Any attachment that is identified as a MS Word document with a MIME type of
application/MS Word runs through the antiword utility
to turn it into an ASCII text document. The program then uploads this as a
large object.

All other attachments are saved in the database as large objects in their
original MIME encodings.

Part 4. Querying the Database

Now that the information is in the database, the next trick is to retrieve
it in a way that makes sense. I've created the following queries, based on my
own database that I used to develop this article.

First, let's summarize all the email messages that have both a main body and
an attachment. (Refer to the table creation create
tables script for table details).

SELECT m.from_email AS From,
m.subject AS Subject,
m.mailbody AS Message,
a.attachment AS Attachment
FROM main m,
attachment a
WHERE m.messageid=a.message

This will produce output similar to:

From

Subject

Message

Attachment

robert.bernier5@sympatico.ca

message 1

60227

60230

robert.bernier5@sympatico.ca

message 2

60233

60236

casestudy@postgresql.org

message 3

60239

60242

casestudy@postgresql.org

message 4

60245

60248

The Message and Attachments are oids, or reference numbers, to the large
objects now stored in the pg_largeobject catalog.

Now, let's list all messages with attachments and identify those attachments
that were MS Word documents:

SELECT m.messageid AS "Message ID",
a.attachment AS "Attachment oid",
CASE
WHEN a.mime_type='application/MS Word'
THEN 'true'
ELSE 'false'
END
AS "MS Word attachment"
FROM main m, attachment a
WHERE m.messageid=a.messageid;

This will produce output similar to:

Message ID

Attachment oid

MS Word attachment

3FF79F08.6060208@sympatico.ca

60230

true

3FF79F29.2090101@sympatico.ca

60236

false

3FF7A511.2030104@postgresql.org

60242

false

3FF7A52A.8040203@postgresql.org

60248

true

You can return the ASCII text conversion of any MS Word attachment from a
query similar to:

Regular Expressions

This query will count the number of emails with attachments and whose from
address begins with the letters ca and end in
org.

SELECT count(m.from_email)
FROM main m,
attachment a
WHERE m.messageid = a.messageid
AND m.from_email ~ '^ca'
AND m.from_email ~ 'org$';

Caveats and Suggestions

In the interests of keeping the Perl script as simple as possible, I've
assumed that all email attachments arrive encoded in Base64. As well, since I
wrote the script to follow the flow of this article, the code is a bit clumsy.

The database definitions here only demonstrate where you can make a start.
Making a database that flies like a jet is an article in itself.

Save yourself time and grief by saving SQL queries and definitions in a file
and then invoking it through the psql client. If you're paranoid, you can put these files under version control

Antiword is a bit quirky in that it will not parse files that
it considers too small.

Viewing the Mozilla mbox through its client can be deceiving. It never
really removes email from the box even if you've deleted it, so you may well
see email in your database that you thought you destroyed months before.

You can export large objects from the database by using the
lo_export function in the psql client. For example
\lo_export 123 temp.txt will save the large object with the oid
number of 132 to a file named temp.txt.

Conclusion

There are two parts to a successful data mine; the mechanism that prepares
the document and the algorithm that looks for patterns. We've looked at one way
getting email into a database but frankly, I haven't given that much coverage on
the power of regular expressions in a database environment, although some of
you Perl fiends have a pretty good idea of the possibilities.