Every once in a while I get to write a neat piece of code that I can share. This is one of those times. I realize it is not large and by PerlMonk standards not very elegant. The problem therein lies with maintainability over the next few years. Regardless I like what I wrote and would like to share.

At the Circus we had a pretty good idea that we had some data leakage. Nothing like people taking off with everything needed to get home loans and rip off customers, just people not thinking about what they send through email. We didn’t know the extent of the problem or even if we had one. We just weren’t sure. Our C-level executives didn’t believe that employees would be so careless with customer data. We decided to find out.

I must say that the results were actually quite positive. We had a couple of people email work related data home so they could work at home over the weekend and a few emails regarding employment, but they were originated by the prospective employee.

Regardless, in order for us to find out I wrote a few scripts that hook into our email system. One that I am particularly proud of recurses through a directory of email messages and attachments scanning each file for relevant data.

Please note that by the time these scripts touch the data it has been scrubbed by the antivirus and other checks we have in place. I am only looking for keywords or regular expressions that would indicate customer related data loss.

Let me explain the directory structure. Under the email system is the directory /var/spool/filter that contains every email message that has been sent in the last 30 minutes. There is a cleanup process that erases all the files in that directory and that is actually where I wrote the hook, in the cleanup process. Here is a sample listing of the directory.

The subroutine I am most pleased with is the one that recurses through the directory structure. The slurp command returns a hash and if it is a subdirectory then it is a hash as well. I look for it with the following line of code.

if (ref $structure->{$key} eq 'HASH')

That is how I find subdirectories to push onto the stack of recursive calls. As it traverses each directory it just looks at each file extension and makes a determination as to what to do with it.

I realize most system administrators are asking why I didn’t use the file command to make sure the script was acting appropriately for each file type but that does not work with the new Microsoft document types.