a year of spam

Yesterday was the last day in my data collection: saving all of the spam I received in one year. It started March 1, 2006 and ended February 28, 2007. In the end, I received 7052 spam messages. The first was from Stewart Duane (blazqzathry [at] artisitcallyyours [dot] com). This guy seemed very concerned that I might be “Tortured with health problems” and assured me I was “one click away from healthy life”. The last to be received was last night. In an email entitled simply, “throughout”, Moyer Stephen (ykb [at] ukvending [dot] co [dot] uk), suggest a hot stock tip as well as a lengthy note that was completely incomprehensible.

My guess is that 7052 messages over the year (or roughly 19 to 20 messages per day) is on average, perhaps even a little low. My email id only appears on a handful of pages on Penn State servers. Harvesting my email address is possible but the right set of pages must be traversed. On the other hand, our campus’ webmaster had his id (or perhaps a webmaster alias – I can’t recall) appearing on the 1000+ pages for our campus. Where I get 20 spam messages a day he gets between 100 and 200.

Reaching an individual at Penn State is not that difficult. Penn State uses a simple scheme to create user ids for its employees and students – three letters to capture initials with an appended counter to be able to distinguish one from another – and establishes an email with address that is of the form @psu.edu. For example, xxx123@psu.edu. The ids are easy to generate – it doesn’t require much thought. Even if you took a conservative approach and opted to not generate identifiers with q, z or x (as they may have a low frequency of occurrence) and you decided to only use 1 through 20 as the appended numbers, you still are generating over 243000 email ids. Opting to use numerical suffixes up to 100 and you are generating over 1.2 million.

I have used my Penn State email account very sparsely. I have actively avoided providing it to any list-serv I didn’t consider safe and I never use it for forums. Still there was a good deal of work I had to do to pick through the spam messages and pull out what I would call “self-inflicted spam”. These would include emails from retailers or third-parties related to a vendor or retailer I had to work with. They amounted to only a few. But I did identify as spam any unsolicited email if the topic was related to my job (e.g., spam targeted for educators) but those were rather rare.

What was I hoping to accomplish? First, I was interested in knowing how many I would accumulate over the course of a year. I knew I received spam but had no handle on how many I actually received. If you take the number of spam messages I received and multiply that by the number Penn State ids that might be generated by a spam engine, the numbers are huge. Given our conservative approach to generating ids, over the course of a year, Penn State would be handling at least 8.4 billion spam messages a year – about 23.2 million per day. Since I know of ids that have a suffix number over 5000 and considering my spam rate is at best average, this estimate is probably incredibly low.

Of course, this is just the email that arrives that is irrelevant to any user. The typical user probably receives considerably more valid email than spam.

At some point in the near future, I am hoping to parse through the messages, categorize them and look for patterns. Spam is definitely interesting. As spam filters evolve to be more effective, spam techniques evolve to avoid being detected. I am sure we have all seen every possible way to spell Viagra via a mix of letters and special characters.

In any event, the exploration will be an interesting. Though I have to wonder if my self-image will survive after thousands of message that question the effectiveness of my manhood…