Sam Saffron

Using rather simple techniques I practically eliminated all of the spam on this blog, here is a technical explanation

Anyone who has a blog knows about the dirty little spammers, who toil hard to make the Internet a far worse place.

I knew about this issue when I first launched my blog, and quickly wired up akismet as my only line of defence. Over the years I got a steady stream of rejected spam comments with the occasional false-positive and false-negative.

Once a week I would go to the spam tab and comb through the mountains of spam to see if anything was incorrectly detected and approve, then nuke the rest.

Such a waste of time.

##Akismet should never be your only line of protection.

Akismet is a web service that prides itself on the huge amount of blog spam it traps:

It uses all sort of heuristics, machine learning algorithm, [Bayesian inference] 4 and so on, to detect spam.

Every day people around the world are shipping it way over 31 million bits of spam for it to promptly reject. My experience is that the vast majority of comments on my blog were spam. I think this number is so high due to us, programmers, dropping the ball.

Automated methods of spam prevention can solve a large amount of your spam pain.

##Anatomy of a spammer

Currently, the state-of-the-art for the sleaze-ball spammers on the Internet is very similar to what it was 10 years ago.

The motivation is totally unclear, how could a message advertising an indecipherable message be helping anyone make money?

The technique however is crystal clear.

A bunch or Perl/Python/Ruby scripts are running amuck, posting as many messages as possible on as many blogs as possible.

These scripts have been customised to workaround the various protection mechanisms that WordPress implemented and PhpBB implemented. Captcha solvers are wired in, known javascript traps worked around and so on.

However, these primitive programs are yet to run full headless web browsers. This means they have no access to the DOM, and they can not run JavaScript.

##The existence of a full web browser should be your first line of defence

I eliminated virtually all the spam on this blog by adding a trivial bit of protection:

I expect the client to reverse a random string I give it. If it fails to do so, it gets a reCAPTCHA. This is devilishly hard for a bot to cheat without a full JavaScript interpreter and DOM.

Of course if WordPress were to implement this, even as a plugin, it would be worked around using the monstrous evil-spammer-script and added to the list of 7000 hardcoded workarounds in the mega script of ugliness and doom.

My point here is not my trivial spam prevention that rivals FizzBuzz in its delicate complexity.

There are an infinite number of ways you can ensure your users are using a modern web browser. You can ask them to reverse, sort, transpose, truncate, duplicate a string and so on … and so on.

In fact you could generate JavaScript on the server side that runs a random transformation on a string and confirm that happens on the client.

Possibly this could be outsourced. You could force clients to make a JSONP call to 3rd party that shuffles and changes its algorithms on an hourly basis. Then make a call on the server to confirm.

##reCAPTCHA should be your second line of defence

Notice how I said reCAPTCHA, not CAPTCHA. The beauty of the reCAPTCHA system is that it helps make the world a better place by digitising content that our existing OCR systems failed at. This improves the OCR software Google builds, it helps preserve old content, and provides general good. Another huge advantage is that it adapts to the latest advances in OCR and gets harder for the spammers to automatically crack.

Though sometimes it can be a bit too hard for us humans.

CAPTCHA systems on the other hand are a total waste of human effort. Not only are many of the static CAPTCHA systems broken and already hooked up in the ubur-spammer script, your poor users are doing no good solving them.

There are a tiny fraction of users that seem to be obsessed with running JavaScript-less web browsers. Using addons such as NoScript to provide with a much “safer” Internet experience. I totally understand the reasoning, however these users can deal with some extra work. The general population have fully functioning web browsers and never need to hit this line of defence.

##Throttles, IP bans and so on should be your last line of defence

No matter what you do at a big enough scale some bots will attack you and attempt to post the same comment over and over on every post. If the same IP address is going crazy all over your website the best protection is to ban it.

###I am not sure where Akismet fits in

For my tiny blog, it seems, Akismet is not really helping out anymore. I still send it all the comments for validation. Mainly cause that is the way it always was. It has a secondary optional status.

My advice would be, get your other lines of defence up first, then think of possibly wiring up Akismet.

##What happens when the filthy spammers catch up?

Someday, perhaps, the spammers will catch up, get a bunch of sophisticated developers and hack up chromium for the purpose of spamming. I don’t know. When and if this happens we still have another line of defence that is implementable today.

###Headless web browsers can be thwarted

I guess, some day a bunch of “headless” web browsers will be busy ruining the Internet. A huge advantage of the new canvas APIs have, is that we can now confirm pixels are rendered to the screen with the getImageData API. Render a few colors to the screen, read them out and make sure it rendered properly.

Sure, this will trigger a reCAPTCHA for the less modern browsers, but we are probably talking a few years before the attack of the headless web browsers.

And what do we do when this fails?

###Enter “proof of work” algorithms

We could require a second of computation from people who post comments on a blog. It is called a “proof of work” algorithm. Bitcoin uses such an algorithm. The concept is quite simple.

If the hash starts with 000 or some other predefined rule, the client stops.

Otherwise increase the nonce are repeat step 3: eg: ABC123!2

This means you are forcing the client to do a certain amount of computation prior to allowing it to post a comment, this can heavily impact any automated processes busy destroying the Internet. It means they need to run more computers on their quest of doom, which costs them more money.

###There is no substitute for you

Sure, a bunch of people can always run sophisticated attacks that force you to disable comments on your blog. It’s the sad reality. If you abandon your blog for long enough it will fill up with spam.

That said, if we required all people leaving comments have a full working web browser we would drastically reduce the amount of blog spam.

I have seen this done before in a few spots, the thing is that submitting a simple number is always going to be trivial for a bot. Performing a computation on a string is orders of magnitude harder for these bots.

I have almost the exact same experience from running my own tiny blog and I've been planning to implement something very much like this for some time now. Thanks to you I will finally do it. Going through spam and looking for false positives is absolutely a complete waste of time!

â€œwe can now confirm pixels are rendered to the screen with the getImageData APIâ€

I didn't know this â€“ that's awesome and definitely something I'll be taking advantage of with the rewrite of MotionCAPTCHA, where the visitor needs to draw a shape to submit the form (currently all client side, but will be rewritten so that even headless browsers can't too easily solve it..)

Nice approach. One idea I had was not to simply try decide if a comment was spam or not-spam, but have a middle ground, a â€œwe're not sureâ€ state (after some basic checks to weed out blatantly good or bad comments).

If you get a comment that you're unsure of, perhaps ask the user to fill in a CAPTCHA (or some other client-side challenge). Or follow their links to see if those are spammy, or send to Askimet etc

Could also be an idea to look at user interaction, see how long they spend on your site, whether they act â€œnormallyâ€ when posting a comment (e.g. using JS to monitor mouse movement/key presses etc).

I've had issues with manual spammers, these idiots who actually manually type in spam, tricky to beat those guys 100% of the time! I blogged a bit about this recently (see my profile link)â€¦

Fighting manual spam is really damn hard, you could “greylist” an IP range for such a case and always manually approve comments from that range.

Luckily I have not been attacked by that here yet.

Lulalala
over 6 years ago

Another common way to stop this is to have hidden trap fields. Name your url input field with some random characters like â€œafggjâ€ and have a css-hidden field called â€œurlâ€ next to it. Spambot will know to fill in the â€œurlâ€ field but does not know it is a trap.

Just a thought.
You know the way PuttyGen uses mouse movement to create real randomness?
Is it possible to tell the difference between real randomness and computer generated randomness? If so could that also be used against a headless browser?

I doubt you would need to go that far, the trick with reading pixels from the screen would pretty much kill every bot out there. My current primitive trick has kept my blog spam free for quite a few months now. Only issue I had was last Friday when Akismet decided to mark a few comments as spam that were not.

Last night I added a JavaScript to my blog that randomly shows two images of two single digit numbers. It then asks the commenter to add those numbers. Three hours later I got five spam comments. My conclusion is that it's human entered spam because otherwise it's the most advanced spam bot I've ever heard of!

I've had problems with spam for as long as I can remember and akismet helpedâ€¦ just a bit. Now I'm working on adding captcha. Hoping this would drastically reduce the amount of breakfast spam filtering through my website! I'm looking forward to developers coming up with advanced ways on combating the dreaded trash.

Great blog post. It has creative ideas and interesting techniques.
I recently blogged about this subject and then got into a discussion with a friend about headless browsers. I just want to say that even though all your solutions are nice they are still easily broken, for example:
1. The getImageData function only returns what the browser tells it to return. Change chromium to return a computed value or just let them run headfull (is that a word?) and it is broken.
2. Computations are a nice way to waste a lot of cpu cycles but spammers don't use their own machines and the amount of hackable hardware is just growing and growing.

I'll take it even further if people take your advice and switch to these checks making sure there is a full web browsers spammers will adapt almost immediately thus making your prediction that they are years away ironically wrong. The only real perfect solution to spam will always be on the server.

As mentioned above whatever becomes popular becomes a target. For that reason I would avoid reCAPCHA. There are solving services for it that turn it into a real visitor annoyance especially for those of us who don't see/see well and a minor speed bump for the spammer. Some of the puzzle CAPCHAs are at least fun and as far as I can tell not be gamed the degree that reCAPCHA is.

[image] Does the image of any amount of people having sex in public bother you? Have you stumbled anywhere near people doing it? Do you get annoyed by it in the internet as well? Why? Every mobile phone needs to lock the screen somehow, be it to...

Why just showing 1 link? Well, because discourse defaults new users to 2 links at most and I couldn’t post even 2 without getting warned / blocked.