Category Archives: Commentary

When web based email services like Hotmail where still quite a new service, the barrage of spam that they received was a source of constant complaint for their users. As the internet and their spam filtering techniques have evolved, I would argue that spam has become less of a problem in email as their classifiers have become more sophisticated.

In the last couple of months, either the Windows Live Hotmail spam filtering algorithms have decided to take a nap or the spammers have found a really clever way of slipping through undetected with what a human would consider as easily identifiable spam.

How Do Email Spam Filters Work?

Early versions of email spam filters were very primitive and the mere presence of a word in an email was enough to push it into the spam queue. Fortunately, those simple measures have been replaced by the more robust Bayesian spam filtering which are supported by a lot of customisation.

Bayesian spam filtering decomposes an email into tokens, normally words but sometines other markers and then uses the frequency of each words appearance in that given email against a know sample to determine if it is spam or not. Using that sort of a strategy allows a user to use a trigger word in their email and not have it marked as spam because the other content in the email doesn’t push the classification of the email high enough.

Manual Email Spam Classification

Following I’ll breakdown the first highlighted email in the inset image above into its individual components and show you why it is easy to identify for a human but slightly more complicated for a spam filter to identify.

Sender Name

Harry Long

The surname might be a play on words to do with the size of the senders penis or it could just be a randomly chosen surname. If you scan across the other senders names in the above image, you’ll notice they are all quite straight forward like Walter Carlson or Paula Santos. What you aren’t seeing as the sender names are very obviously spam sender names like Free Porn or Barbara Big Boobs; so there has definitely been some learning taking place.

Sender Email

harry-long7125@hotmail.com

Nothing unusual about the senders email address, it is common place these days for Hotmail email addresses to contain numbers in them as they have so many users. Checking across the other spam email addresses and all of the sender email addresses seem quite reasonable and have some amount of correlation to the senders name and none of them contain clearly identifiable spam trigger words.

Does the Hotmail spam filtering treat email from other Hotmail email addresses as less likely to be spam than an email originating from outside of their service? Could they be placing too much emphasis on the security of their signup process to weed out spammers? I know at some point Gmail had their signup process brute forced and the spammers were able to systematically overcome the CAPTCHA.

Recipient Email

amakye@hotmail.de

To an English speaker, you’d look at the senders email and think that it might be more gibberish however you’d be wrong. Amakye is actually a name – for a real world example, consider Amakye Dede, one of Ghana’s premier musicians.

As a trait of a high quality messages, do people that forward an email onto someone else regularly forward it onto multiple recipients? Given that all of the spam emails seen in the image above have a single additional recipient, I believe that must be quite common.

Following on from the possibility of Hotmail spam filtering providing brownie points for the sender coming from a Hotmail email address, it surely isn’t a coincidence that each of the emails above also have the additional recipient as another Hotmail user.

Subject

FW: MikaPMizunoMTakesGOneRHardACockFAfterTAnotherZInBThisLGangbang

Working from left to right, you can see the message has been supposedly forwarded onto me, signified by the classic FW: prefix for the subject. Scanning through the other emails highlighted and they all contain the same FW: prefix; maybe a forwarded email gains brownie points with the spam classifiers as being less likely to be spam.

Next up and a quick glance at the actual subject topic itself and it looks like it is gibberish until your eye catches actual words in the subject and then rescans the subject seeking out other real words amongst the nonsense. Suddenly you’re left with a subject more like:

FW: Mika Mizuna Takes One Hard Cock After Another In This Gangbang

I can only assume that the word cock isn’t being picked up because it doesn’t have a space or recognised word seperator on either side of it. I also think that the different capital letter used as a word seperator must have something to do with it as well – as an example if the subject used a full stop, hyphen or pipe character it’d be crystal clear.

Message

VCery SWexy YounUg whiCte ChicVk wiGth a NiQce AsRs BoQoty gPets F.Lu.c.ked bLy
http://younghotasia55198146n.com
by now."Corey clicked at him in disgust. The others of their pod had long gone

The message content itself is less obfuscated than the subject and far more easy to read in a single glance. I think the important thing about the message content is that no trigger words such as ass, arse, booty, young or fucked are present and spelled correctly.

Addressing the domain name and the spammers are no longer using domain names with words that’d be clearly identified as spam. In this particular instance, I am surprised that the spam classifier isn’t smart enough to identify that the domain itself isn’t in a common format; when was the last time that you visited a domain that had 8 numbers in it?

I suspect that the last sentence in the message is nothing but a token sentence so the entire email doesn’t look completely bogus. I find it interesting that they’ve mixed in a few words which would seldom be seen together such as clicked, disgust and pod – all of which would typically never be seen in a spam email.

As you can see from the above, to a human it is just so simple to spot spam – however it just isn’t that simple when you need to write software to identify spam while making sure not to falsely identify real email.

Every day that these spam email messages keep landing in my inbox, I keep group selecting them and reporting them to Hotmail. I wait patiently hoping that someone will answer my calls to try and tackle whatever issues their spam classifiers are having identifying those spam emails.

In the last few days, a homeless man in Ohio named Ted Williams has become a viral internet sensation because he has a golden radio voice. Ted, a panhandler was standing at an intersection when someone took a short video of him and uploaded it to the internet:

The simple handwritten cardboard sign Ted was holding reads:

I have a God given gift of voice. I’m an ex-radio announcer who has fallen on hard time. Please! any help will be greatfully appreciated :) Thank you and God bless you, happy holidays.

Since the video was taken yesterday, God has clearly shined a light on Ted as his fortunates are turning around. While being interviewed on local radio station WNCI, Ted said he has been offered a full time job with the Cleveland Cavaileirs and a house! If a source of income wasn’t the most amazing gift for a homeless person, a house to live in must seem like all his Christmas’s have come at once.

Now onto the dirty, filthy, grubby part of the story and it relates to unscrupulous individuals capitalising on the viral nature of the Ted Williams sensation and trying to cash in for some quick, easy dollars.

Click to enlarge

YouTube provides related or suggested videos to the right of the video you’re watching on their site. When watching the embedded video above on YouTube, the image to the right is what I’m seeing at the moment. Highlighted are five videos belonging to four different YouTube accounts, each beginning with ‘usnews<number>’ with a similar but different copy of the video.

If you click the related video, the user has essentially made a still video with an overlay stating that they can’t show you the video and you’ll need to click the link in the video description to watch the video. As soon as the overlay popped up and the video wasn’t available, it was clear it was just a scammer trying to make some money.

Click to enlarge

Digging a little deeper and you can see that each of the ‘usnews<number>’ accounts has uploaded the video between six and a dozen times to each account. Each video has a different name, slightly different length and links to a page on the same domain for the user to watch the video with ads plastered all over it.

They are employing fairly standard video search engine optimisation techniques, by providing a good video name, categories and description. By uploading dozens of copies of the same video but changing it each time, they are maximising their opportunity of having one of their videos rank when a YouTube user goes looking for Ted Williams. At the time of writing this, usnews57 has 10 different versions of the fake video which have been viewed just under 2500 times.

It disappoints me no end that people will go to these lengths to make a dollar. I know it is easy and no one is necessarily being hurt by their actions but the fact that they would essentially take advantage of a homeless man and his amazing good fortunately indirectly is sickening.

The organisation behind a number of the standards that web development relies on in the present were forged through the World Wide Web Consortium or W3C. Over the years they’ve published countless documents on their web site outlining in great detail the various standards, such as what HTML tags are allowed to be nested within a <table> element in XHTML 1.0 Strict.

At some point they published a great document stating that cool URI’s don’t change. The general trust of the document, is that once a document is published on the internet – that URI should be permanently available as you never know who might link to it or consume it. As every user of the internet can attest to, there is nothing more frustrating than following a link from one web site to another and being greeted with the infamous 404 Page Not Found error.

While randomly clicking some links on my blog today, I noticed that I had two links in the footer to the HTML and CSS validation services. I haven’t clicked on those links for a long time, but for some reason today I did and was greeted with a 404 error. It would appear that over the course of time, W3 have very subtly updated the URI that the CSS validation service exists at:

Old: http://jigsaw.w3.org/css-validator/referer

New: http://jigsaw.w3.org/css-validator/check/referer

I figure maybe someone at W3C will see this pop up and see that one of their older, heavily referenced URI’s no longer works properly and they’ll put in the appropriate redirect.

In the last week or so there have been a number of publishing errors on Search Engine Land, which I consider to be major mistakes that should have been caught by the authors, writers, editors and good quality work flow.

I would have expected that the original authors of those two respective articles would have previewed their work before pushing the publish button. Of course, it is quite possible that the people that wrote the articles don’t have accounts to login to Search Engine Land, in which case I would have expected them to check their work once one of the editors had published it and provide feedback if necessary.

It appears that hasn’t happened and the work flow that Search Engine Land have implemented isn’t solid enough to catch even the most glaring of oversights. It is a shame really because they produce a fantastic web site with valuable content throughout but the simplest of things like the above tarnishes their work in my eyes – it kind of suggests that they don’t care about it as much as I would have hoped that they do.

I came across a clever web site named Wikirank, which provides visualisation tools to explore and compare the usage data from wikipedia.org.

If you’re wondering how Wikirank could manage that, wikipedia.org provide access to their web server traffic logs as a service to the community for free. Wikirank consumes that public data, analyses it and provides a convenient way to see what topics on wikipedia.org are popular at the moment.

Wikirank isn’t just a tool to find out what is popular at the moment though, it also lets you view the usage data on a nominated page over time, up to the last 90 days. That sort of functionality is great, as it lets you see how a particular topic is being received among the community. Not wanting to stop there though, Wikirank also lets you compare different topics as well. The example on the Wikirank home page at the moment is who is more popular out of John, Paul, George or Ringo from The Beatles and according to Wikirank, John Lennon is nearly twice as popular as Paul McCartney.

I think Wikirank is going to be a fantastic companion to the primary wikipedia.org web site. It’d be facinating if they spun off a wikianalytics.com and broke down the usage data from wikipedia.org and allowed people to explore that data in a similar but cutdown fashion to what Google Analtyics provides.

Everyone makes mistakes, it is unavoidable. However, when you’re paying the sort of money to advertise on a high visibility web site like news.com.au – you’d think that someone would have gone through and checked everything was in place before approving the creative to go live on the site.

I figure the air conditioning isn’t working in the CommBank offices and they are just burning money to keep warm.

Today the automatic update kicked in for Java on my notebook, which it does quite regularly. I love the fact that different products implement a relatively unobtrusive upgrade to their software to keep it up to date, I know if they didn’t – all of my non-critical software would quietly go out of date.

During this particular update, I happened to notice (not sure if it was there before) – however Sun are now bundling (optionally of course), Google Toolbar with the Java installer. I’m all for providing the automatic update, however I don’t believe they should be bundling additional software, optional or otherwise with an automatic update.

I have no issue if you just installed Java for the first time and you have chosen to install the additional software, however adding it into an update and having it enabled by default is just a little to slimy for my liking.

I’m an advocate for sensible usability on web sites and fully support the usability guidelines that recommend descriptive link text. There are measurable improvements to a users browsing experience when a webmaster makes a conscious decision to use useful link text, instead of an uninformative ‘click here’.

One particular aspect of useful link text that I try to abide by at all times, is that the link text should be descriptive and should reflect the resource that it is linking to. As an example, if you’re linking to a web page about the Porsche 911 GT3 RS, then a useful link might be Porsche 911 GT3 RS.

A popular technology site, TechCrunch has various web real estates that it promotes at every opportunity – however I think of late they are going a little too far with their frivolous, slap happy linking. Recently, the Governor of California, Arnold Schwarzenegger announced that California has secured the manufacturing plant from Tesla, bringing it back from New Mexico.

In the article on TechCrunch, they provide a number of links (link text and URI below):

Tesla Motors, http://www.teslamotors.com/

the Roadster model, http://www.crunchgear.com/tag/tesla/

“Come with me if you want to live”, http://www.youtube.com/watch?v=hHV6OzHjWV8

“Do it, do it now”, http://www.youtube.com/watch?v=u6ALySsPXt0

and my beef is with the second in the above list. When viewing that article, I expected that link to take me to the Roadster vehicle home page within the Tesla Motors site, instead if took me off to a completely useless page regarding Tesla Motors (the company) within their business information site CrunchGear.

I’m all for TechCrunch promoting their other web assets, however I’m confident that their readers would enjoy their site that much more if they’d find a more appropriate manner in which to promote CrunchGear instead of deceptively linking into that site.

In February, I wrote about receiving comment spam from a guy by the name of Jim Mirkalami. Since that time, there has been a lot of different people writing about the spam that they’ve received from our friend Jim; however it appears that he isn’t liking the new found attention that he is receiving.

This week, I received what would otherwise be considered a cease and desist type of comment. It surprises me that Jim would now be spamming more people telling them to stop writing about him and using his name, when it was clear that was his intention in the first place.

In any case, Jim is just going to have suck it up like everyone else online as it isn’t going to get removed from anyones site in a hurry.