We have a problem. The problem is so old and so commonplace that we’ve all gotten used to it. But it’s still a problem. The problem is that the WordPress moderation filters are comically primitive. They’re not even up to 1997 email-filtering standards. In fact, I’m starting to suspect that the spam filter is just a random number generator that marks every 20th comment as spam.

Observe:

On the top we have Henson, who has posted a small comment that contains no common spam keywords. This was posted to the most recent episode of Spoiler Warning. It contains no links. Moreover, Henson has successfully left 64 comments in the past without being flagged as a spammer.

On the bottom we have “residential steam showers”. It’s also worth noting that:

This “person” has never commented before.

This comment was left on a post that is half a decade old.

It is loaded with spam phrases that I have marked as spam again and again and again. (What is with you spammers selling showers and bathroom fixtures? Even if I left every single comment stand, your spam would NEVER build up enough search engine credibility to end up anywhere NEAR the top of the search results. It will never happen. Give up.)

It features a long gibberish URL, which is a common trait among spammers.

But Henson was inexplicably marked as spam, and not residential steam showers. Then we have this:

ps238principal has successfully left ONE THOUSAND SIX HUNDRED AND THIRTY-EIGHT non-spam comments. Yet the spam filter felt the need to flag this reasonable, inoffensive comment as spam.

On the top, “real estate” is leaving a word-for-word reproduction of a comment I’ve marked as spam a hundred times in the past. On the bottom, “hack les simpson” is leaving a comment with goofy manual line breaks that are common to 80% of all spam and is never done by any human ever. They’re also loaded with phrases that are very common to spammers. (Seriously, spammers love to tell me how nice my site looks. Also they love to use the word “fastidious”. Incorrectly. As in, “this post is in fact a fastidious one it helps new \n net visitors, who are wishing for blogging”.)

I guess it flagged ET because the comment had two links? But they’re to youtube, just like the spam below it. And ET has nearly 700 valid comments. “Doctor Oz” has zero, plus goofy line breaks and spammy content. And for the record, the “1 comment approved” means THIS comment. It doesn’t mean I’ve approved a comment from them in the past.

Also, I am reminded how nice it is to be rid of the Google Adbot. I can write all this without worrying about pissing it off or being paranoid about what ads it will choose based on my content.

OBAMA DOESN’T WANT YOU TO KNOW THIS TRICK FOR ONE CLICK UNDERAGE PAYDAY LOANS FOR FAST WEIGHT LOSS SHOWER HEADS, NO PRESCRIPTION REQUIRED!

(Diabolical laugh.)

Tell me again about the great strides we’re making in artificial intelligence. </facepalm>

This is beyond pathetic. If you can’t recognize these three flagrantly obvious spam comments as spam, then you have not written a spam filter. I don’t know what your software is doing, but it sure as hell isn’t looking for spam. Once again: Steam showers and sex toys with goofy line breaks and sketchy URLs on ancient posts.

This is the tyranny we live under. Our spam filter is like an airport security checkpoint that waves through men in sunglasses with ticking briefcases that have giant nuclear symbols beside digital countdown timers. But then the guards body tackle and strip search little old ladiesSo, not all that different from real airport security, really.. It would be one thing if it looked like a slightly buggy system that missed every once in a while, but this is so bad I can’t even tell what it’s using as criteria for spam.

Even more embarrassing: This circus of failure is actually the result of three spam filters: Akismet, GROWMAP, and Bad Behavior.

And to be completely fair: Yes, they do catch more than they let through. My comments would by 90% spam without them. Also, Growmap doesn’t do filtering based on content. It just puts the “Confirm you are not a spammer” checkbox in there. So it really just cuts down on the volume of crap the other two have to cope withWhen I first installed it, Growmap worked like magic. No spam for weeks. But spammers always adapt..

I could tolerate the occasional spam getting through. But what I can’t fathom are these false positives. There is no pattern or reason to them.

So if you’re curious why sometimes your harmless comment was put into moderation, now you know: NO REASON WHATSOEVER.

Footnotes:

[1] So, not all that different from real airport security, really.

[2] When I first installed it, Growmap worked like magic. No spam for weeks. But spammers always adapt.

What about a system, where it randomly assembles a logical problem for a human to solve? Then have a multiple-choice type area, with four subtly different answers. Obviously, check-mark boxes, in case the bots are still dumb enough to check them all. ;)

Is ReCaptcha any good? I know it was working as intended for a while, but I hear rumors every once in a while, that it’s either beatable by bots, or by click-farm type things in like [name of country here] where peoples’ wages are cheap enough to buy for pennies.

ReCaptcha is slightly disturbing now that the project to OCR just about every book out of copyright (and then all the ones in it thanks to Google’s excellent lawyers) has finished and now we’re all being used to help identify house numbers on photos of streets so that future generations can be more accurately targeted by the first strike of Skynet (why aim a missile at the postcode GPS tag when you’ve got the house identified precisely in a photo?).

Of course, I have to assume that the hackers, crackers, and pirates realised what ReCaptcha meant as a collective work to defeat OCR errors and protect content from automated machines. Every time someone enters a Captcha to download a questionable link or sign up for a forum to discuss cracking and piracy then I’ll assume that website is talking to a evil(er? See above about house numbers) service like ReCaptcha, only one that is fed images from ReCaptcha and wants to turn them into the right response so it can piggyback that to submit some spam or try entering in a new bank account login without human intervention. Use the humans who are proving they are human on piracy sites to remove the need to be a human to defeat a Captcha to prove you’re not a bot submitting spam on a site that thinks it is protected. Paying pennies for this done via Human Turks? I can fing you thousands of people who just want to crack an executable who will be happy to do the Captcha work for free!

The really funny part, is that my comment was flagged before I edited it, when it only had one link to YouTube. Plus, I can’t even see a pattern sometimes when I’m flagged. Like, about 2/3 of the time, it’s when I use some keyword or too many links, but the rest…as you said – totally no reason. :)

I have an idea, although it might be a bit drastic – disable in-post comments, and then just auto-generate a post in the forums. It’s how the Ghost blogging software does it.* It’s the new sexiness! Although I imagine you’d just want better filters. :P

* Technically, Ghost has no commenting features whatsoever, but there’s a few plugins which auto-gen the forum posts, for a couple popular forum softwares. :)

The common spam patterns are flagged by now, so they try to defeat that by running everything through an auto-substituting thesaurus filter. Which leads to utter absurdities, because synonyms are really freakin’ hard to automate.

As for successfully identifying spammers, I’m not sure there’s currently a solution beyond having a meatbag look at the incoming posts. I’ve kinda-sorta figured out when a post of mine is going to get flagged for review (mentioning the metaphysical, recreational chemicals, too many links, etc.), but sometimes I do wonder if the TSA is in charge of randomly checking my virtual shoes. :)

What I find hysterically funny is that the Garfield strips just read like randomly generated garfield strips, but the Big Lebowski and particularly the X-Files ones actually feel right, but parhaps that’s because I’ve been reading too much of Shanon Garrity’s Monster of the Week http://www.shaenon.com/monsteroftheweek/

Spam filtering is one of those “never-ending” tasks that isn’t very fun, once the initial framework has been laid down. Because at that point, it all boils down to “why that and not that?” and tedious reverse-engineering of mountains of numbers. And anytime you tweak anything, you upset the whole spamcart and need to pick up all the pieces and put them back together into a semblance of a filter, again.

Admittedly, I’ve only been peripherally involved in the writing of spam filtering (but, I ran a usenet server, back when the Canter & Siegel green card spam started making the rounds).

What about Bayesian?
(Isn’t that what they call those things where every time you flag something as spam, or not spam, it learns to do it better? What that has to do with Bayesian statistics I’ve never understood, but they seem to call it that)

I now feel slightly less bad about giving up on trying to decode the spam triggers for this comment system. “NO REASON WHATSOEVER” seems to be about where I’d arrived at when I gave up, realising that there seemed to be very little linking which comments were held for moderation and which weren’t. My occasional propensity for linking to external sources (I blame 15+ years of exposure to a master of anchor use) often fall foul of other filter systems but this one seems rather more random than correlatory on factors like links.

So, while I’m relatively certain that any comment from me that has the word copyright in it gets flagged as spam, I’m willing to bet that now that I’m actually saying this openly, this comment will go through just fine.

I think if an AI came from the internet, it’d have bigger problems of 4chan, reddit, tvtropes and the likes clouding up it’s mind than exterminating the fleshy humans. Being an internet troll is about the best it could hope for…

People keep bringing up the idea that in the future there will be an AI that can effectively troll, without realizing that the only way to know for certain that it isn’t already happening is to manually travel to every troll’s house and make sure they are there.

Memes are repeated ideas that survive based on how funny/entertaining they are judged to be, and re-posted accordingly. Spam is usually built from repeated phrases that survive based on how the bots tending the database notice which ones are the most successful at beating spam filters. If one were to create an AI that joins in the process of mutating memes and filtering it’s own continued attempts by what people seem to re-post the most… are you 100% sure that hasn’t happened already?

Are we really, truly, 100% certain that “Anonymous” is not mostly AI already?

New release is coming soon, this time with adventure mode focus. So we might just get on that.

And I personally find DF Forums to be even more golden than the game. Did you read the spam thread on the Forums, where people were discussing the various spambots? Because that’s what the comment below was inspired by, but I couldn’t find the link.

The WordPress robots saw your website background of dice and decided to use the same selection method.

So, related story: I use Yahoo for email, and they’ve changed their spam systems within the last year. I get hardly any spam in my inbox (good), but I also can’t send email in any suspicious manner; for instance, if I send email from a new location, or if I send a lot of emails in a short time, or if I send emails to a lot of addresses at once. I don’t exactly know how it works, but if Yahoo finds something they think is suspicious, the message I’m writing doesn’t get sent, and I have to go through some authentication to be able to send mail again. This process can take up to a full day. It’s become infuriating that I can lose my ability to send email – my primary mode of communication – just because I hit ‘Reply All’ at the library. I’m battling fascist algorithms and I don’t even have a weapon.

Also: hey, I got marked as spam! I’ve finally arrived! I’ll be sobbing in a corner if you need me.

Edit: And my comment’s in moderation. This seems to happen to me a lot around here.

For various reasons I have email accounts with all the major webmail providers, and In my experience Google actually has the worst spam filters. Yahoo and Hotmail are better. Which is the reason I don’t use Gmail as my primary email.

Although, Yahoo has gone downhill recently in the quality of their spam filtering, becoming almost as bad as GMail. And lets not mention their webmail interface in polite company.

I actually haven’t had too many problems with Yahoo…until this last year. It serves its purpose, it’s organized just fine. And I’m reluctant to change my email address for the first time in twelve years – especially since the only alternative I know about is Gmail, and I feel uncomfortable about Google having so much ubiquity in electronic communication. But there may come a point where this headache is simply too, too much. It will become a day of reckoning.

It has unlimited space(not that the limits are low enough to be relevant in gmail either) and it actually allows you to organize the emails into folders, which is leagues ahead of the un-organized tagging system gmail uses, at least for my purposes.

You can force Gmail’s labels to work as folders, but it’s cludgy. In fact, it might only be possible to make them work like folders, in the case where you’re making filters. I’ve never actually tried manually labeling stuff.

Gmail used to be really good. Now not so much. I’m seeing a lot more false positives with gmail in situations where it doesn’t make any sense.

For example I have an email account that forwards to my main email account. The first account will let a legitimate email go through as not-spam. But the 2nd account flags it as spam! That makes no sense. It’s the same filter running twice! Except the 2nd time it’s going from a trusted and authorized email address to a 2nd trusted and authorized email address. Spam filters really are random.

I also discovered that there’s no way to turn off Gmail’s spam filters or even to create a whitelist. It boggles my mind.

Admittedly I have FAR fewer viewers on my site than you do, but I have a plugin that has a very simple, random-format math problem spambot checker, which commentors can circumvent altogether if they register. Admittedly, I do get some junk registrations on my site, but these bot-users mysteriously never leave any comments.

Of course, I do get some spam in the filter anyway. They’re all of the same style that you have here. What’s the deal, anyway? I wonder if they’re bots at all, but some sort of underpaid Chinese spam-sweatshop workers. It helps explain how they can figure out HOW to comment, but have English so horrible that it basically means nothing.

This kind of situation is what I was dealing with when I was selling my laptop on Kijiji. They show juuust enough intelligence so you know they’re humans, but the texts I was getting were poorly formatted enough that I know they’re not native English speakers. “I’ll pay you X + 300 for shipping it to my sister-in-law. What’s your paypal?” Yeah…like I’m dumb enough to wait for a cheque that’s never going to come. :P

First one (Henson) uses all caps and special symbols of almost an entirely line. Sometimes this indicates flame posts, but obviously not here.

Second one(principal) contains both a well known author’s name and a book store name.

Third one (ET) has got to be the videos along with unconventional punctuation. Nothing wrong with stream of consciousness sentences in my opinion, but they do tend to flag as incorrect English.

But I have no idea how the others got through. Maybe institute a policy that automatically flags everyone’s first ~10 posts? More chores for you, obviously, and that’s the last thing any of us’d want for you.

But how hard is it to impersonate some other user? You have to guess at the email address, and/or the name (…but no, not the name, as there are like five different Bryans leaving comments here, only three of which are me — two are older email addresses, so that’s not *quite* as bad as it sounds), but it’s not like there’s any kind of authentication of that address.

And it’s not like we all want to give Shamus a pgp public key, or an ssh public key, or some such actually-cryptographically-secure option. (Not to mention that browsers can’t do that. And users generally hate it. In fact, forget I mentioned it altogether. :-P ) Or even a password.

Dunno, but I think the idea of requiring some math has some possible merit. The question would be how you decide which math problem to give any given page view, and how to tie the POST data back to the original question you asked. It’s not like you want to provide the numbers — or worse, the answer you need to see — in a hidden input field, as that’s pretty spoofable too.

Large parts of the world (including mine) use variable IPs as standard. Seriously, american systems which try to identify my location to ask if such and such is me run face first into the fact that Everyone in the Country using Telecom (was the main phone company, was a sub unit of the state post office. Has now been split into a bunch of sub bits… long story) as their ISP shows up as being in Auckland, because the fact that the ISP’s systems are there is the last bit of geography based IP data there is.

So… yeah, IP addresses don’t help much unless you want to ban significant sections of what may well be entire small countries.

Speaking of wordpress being silly, I know it’s super late but I still want to apologize for spamming twentysided a couple times in the past. For whatever reason when I still had my blog, mentioning anyone else’s in any form of link would have wordpress go back and automatically comment on whoever’s blog I linked to. And then post a link back to my blog. God that was mortifying to find out, since I’d linked to more than a couple pages and people.

I was writing my own RSS reader a few months ago and ran afoul of your Bad Behavior plugin. After trying a dozen permutations of settings, I eventually found that if I spoofed the User-Agent and specified what encodings I was accepting, the plugin would graciously allow me to get the RSS feed so that I could put links to your posts along with all the other links to posts on blogs I follow.

I’d been meaning to mail you about it, since the Bad Behavior rules seem completely arbitrary. However, since this was just a silly little script to aggregate links for my own personal use, I felt I hadn’t really put enough dilligence into figuring out what was wrong on my end and exactly what innocent behaviors were frowned upon by the plugin, so I figured I’d just stick with what works and put a more thorough investigation onto the backburner.

As I did my Honours year in AI (Machine Learning), I’d offer to write you your very own neural-net enhanced filter, but ewww, PHP. I’m more of a Perl, Java and C++ guy. The one email classification system I did make was nice and accurate, but it was academic software designed to rigorously come up with accuracy metrics for different algorithms and not really usable as a day-to-day system.

After my first ten posts or so that got that and somehow still got responded to, I just guessed the site automatically body-checks anything sufficiently verbose and lets it through as soon as a human has taken a look at it. Nowadays I hardly even notice when it shows up.

Shamus, maybe it’s time to use a login for the comments and then white list people that have made maybe 10 comments without pissing off people too much. (and you can always un-whitelist somebody later).

Provided the login stuff can be done “in” the comment box/area somehow then I would personally not mind that.
The “regulars” are usually the biggest posters, and while the post count may go down a little, the false positives should drop dramatically (and make moderating a little easier hopefully with less posts to check).

I thought of this as well,but after some thought I figured that its a bad idea.It has happened to me a few times that I forgot the correct “login” for myself when writing posts from different computers,and I dont change them that much.So I can imagine why this would be a hassle for people that often switch them.

If our email addresses aren’t easily visible, then even that method isn’t a concern, I mean it’s obvious both to us and the computer that you aren’t 4th D. The chances of a spammer correctly guessing the correct username and email address are pretty low.

If not, there are other methods for keeping track of people which are more effective.

A lot of the regulars here probably have a email address that is not that hard to dig up.

Basing whitelisting on a login-less solution is just asking for trouble as spammers would then directly target long time posters and spam on even old posts on the site. If whitelisted then Shamus would be unaware of this occurring.

You might say that some spam checking should be done still on whitelisted ones, but what is the point of doing that, the purpose of whitelisting is that you do not have to do that.

Personally, I think shamus should just keep changing the check box logic. First, change the checkbox to “Confirm you ARE a spammer”. This may not help – now that I think about it – since it is probably holding at bay hundreds of comments per post.

But next, add another. Make it so we have to check ONE but not the other. After a while, reverse them. Then after a while, make it so you have to check BOTH. Get creative. Try double negatives: “Confirm you aren’t NOT a spammer.”

Edit: And… “Please check the box to confirm you are NOT a spammer” got me again.

That will just put it into DRM territory of punishing legit users while doing little to remove spam(granted,it will reduce spam,but will the reduction be more than the annoyance of legit posters?I doubt it).

Not sure that this helps, but when we had an ungodly number of spammers on our phpbb forum, we added another field into the PHP of the signup sheet with just ‘Type student’, with an appropriate next to it.

Then the script that the form submits to was edited to die if the field didn’t have the word student put into it.

It seems that many spambots will autotick boxes in a form but they can’t fill out a field that just tells you to type a word.

Obviously very hacky and might not be editable in wordpress, but yeah that’s how we got around it

“All right I confess, I’m a spammer. This entire post is crammed full of links to irrelevant useless sales pitches. I’ve purposefully been trying to deceive WordPress’ spam filter. I’ve been a bloody fool.”

You should add a question to your comment form, which would be simple to answer for a human and impossible to guess for a robot. For example: “What is the first name of this blog owner?” and reject any comment in which the answer is not mathing a PCRE regular expression “/shamus/i”.
And when spammers adapt – just change this question.

Include some form fields called Name, Email, Website, URL, Favourite Colour, whatever.
Rename the existing fields to something nonsensical.
Update the form submission code to use the nonsensical fields, and flag anything filling in the “normal” fields as being spam.
Use some CSS to hide the honeypot fields and rename the nonsensical fields to what they should be.

Now this probably raises a few accessibility issues, but it might cut down on a lot of the automated comment posts, because the bot will see the trapped fields and fill them in.

The next day Shamus will put up a Diecast in which Chris talks for twenty minutes about Steam Shower Simulator and Rutskarn has a roleplaying anecdote about real estate investment, following which Mumbles and Josh have a long and complex argument about sex toys. The moderation queue will be epic.

Here’s a guess why they mention those showers. (I don’t want to use the phrase, because I don’t want to be labeled as spam.) Steam showers are a real thing, if you look it up on google. But because you mention Steam from time-to-time on here, usually as an example of the right way to do electronic stores (as opposed to say… Origin *shudders*) the spambots see the word steam, and decide you must mean those showers. Those posts may have more comments than most, or the word may come up in comments more often.

So they come on mentioning showers.

Have I asked if you’ve ever considered the advantages of owning a really fine set of encyclopedias? They would let you right to information bedazzling with many fastidious facts.

(That might have got me flagged as spam. Sorry for the extra work. I couldn’t resist.)

Does your website send my me ail (or aparently a hash of my email) to gravatar each time I write a post?

I’d rather have the option to not to.
Heck if you added a “Use Gravatar” checkbox that defaulted to checked then I’d happily uncheck it every time. (that would preserve current behavior but allow people to opt out).
It might speed up making posts too?

BTW! There is a way to somewhat anonymize Wavatars, by simply doing md5(email + sitename).
Down side is that the Wavatars would be unique to this site only.

I guess a new way to pass the hash to Gravatar could be devised.
Maybe by passing md5(email + sitename and/or salt) and then pass the sitename, Gravatar could then look up the right site in the database and locate the gravatar hash for the user,
and those with a Gravatar account could then tie each site specific gravatar hash to their account.
This would reduce privacy leaking to a minimum. (you would only be able to track someone within the same site rather than across the web).

I guess if you wanted to be extra clever you could do md5(email + sitename + salt)
and register the site at Gravatar, that way the salt would become a sort of shared secret between the site and gravatar.

Greetings! Thy webhold is a vista of pleasantry, and thy article a trove of wisdom! Related to your efforts is this electronic bazaar, from where your disciples may trade their lucre for devices that use the power of steam to cleanse their bodies, or shoes and pouches from famous artisans!

That was way to intelligible to be real spam, but kudos for not including a url though, I sometime get email spam that advertises something and no urls to anything, makes you wonder if some spammers are just a Turing test gone wrong.

As a point of interest – my Dad does random manual page breaks.
He came to PC use at his work later in life (late 60s) and his first word processing and computer experience was an Outlook client with no word-wrapping.
Now he can’t break the habit of /n every time he feels like the line will run on.

As a more depressing point of interest, at the start of the article I spent about 30-40 seconds trying to make actual sense of the steam-shower cubicle comment before realising it was an example of spam. I thought he was referring to something from a Spoiler Warning episode.
So…score 1 for spammers?

They definitely need more steam showers in Spoiler Warning. I don’t really know if there are actually any games with them but it’d surely be very interesting.

Maybe some Hitman game might have them? Hide the damn spambot’s corpse in its own Steam shower!

However, I still don’t understand why anyone would go to the extent of spamming to promote their brand/product. To be honest stuff like that just puts me off instead of raising interest. I can’t imagine things like a slightly better search rank actually doing anything for them. It is even weirder when you look at it on an international basis:
Why would I ever buy “canadian pharmacy” products? My healthcare covers everything I could possibly buy from those, not even assuming the’d send you placebos at best and poison at worst.

I wonder if my programming had any influence on my propensity to use semi-colons in my typing or if I’ve just successfully trained myself to avoid the rampant comma-splicing practiced by most of the internet.

Thankfully we’re about to change the back-end, so hopefully the new one will be less brain-dead when it comes to spammers – hint, if the customer has never posted before and the post has got a link in it, it’s either spam or they are asking for help on integrating a specific 3rd party product.

And no real user has ever actually included that necessary info in their first post when asking for help!

Hm, Shamus, may I suggest replacing all three with just Antispam Bee? It has done pretty well for me so far (together with “new commenters must be approved first”). Maybe it’ll do so for you as well. :)
(And it’s not like enabling/disabling plugins takes long with WordPress anyway.)

Anyways, can’t YOU write it yourself some rudimentary code to make a third (well, fourth) pass on these comments? I am almost certain it’d be trivial to at least write a code that’d block every “A bunch of great guidance on this great site. need a steam shower unit in my bathroom”.

Though it’d be less trivial to write a tracer that’d find the source of these posts, find their personal e-mail, and flood it with EXABYTES OF PRON.

If you haven’t done it yet, add a tiny checkbox (“Do not confirm this”) next to the Spammer-Confirm, and throw out any comment that marks it. We normal users won’t click it (you could also hide it behind something else), and bots will probably either mark both boxes or none.

I’ve been reading your blog and lurking in the comments for years. I’ve only rarely made a comment myself, but I think I’ve seen the “awaiting moderation” message every time I did. I’ve always wondered what the criteria was (or I guess what it’s supposed to be).

Confirmed. In order to make your profile unavailable for browsing no matter what you need to use a guid or a hash as your username when you sign up for gravatar. You can use your email to login but using hash disables redirect to your profile since it seems that they decode your md5 hash to get your username.