Mismanagement, I can understand, but imperfect data and sabotage? How so?

> A Bayesian filter should, in theory, be
> both accurate and less reliant on the
> continued vigilance of others

Comment spam is unlike email spam. They aren’t trying to sell you something with big huge colorful text and offers for 100% improvement in whatever!!!!!!! Most often, the text of the comment looks fairly normal but it contains links that are the problem. How can a Bayesian filter catch that?

Oh, and you still have to train it. And Lord if you train it wrong, you are in trouble…

Re: sabotage, follow the “just like with email blacklists” linked text in my post. Essentially, when the Osirusoft DNS-based email blacklist was subjected to a denial-of-service attack, the owner decided to screw the ‘net by changing it to report all mail hosts as spam havens. It was weeks before most mail systems were modified to deal with his sabotage.

Re: imperfect data, there are two problems. First, the presence of a machine in a shared blacklist is generally based solely on one person’s decision to put it there. Why did he put it there? Nobody knows; it could have been real, but it also could have been spite, intolerance of a poster, or just a misunderstanding of what someone was saying in a comment. Yes, most comment spam is obvious, but just as it happens with email, I can imagine people heaping comments into the spambox just because they don’t want to deal with them. (For example, would people trust that every machine that would appear in Dave Winer’s blacklist would really be a spammer, or would there be a good chance that a good chunk of them were just the machines of people who rubbed him the wrong way?)

Second, the presence of a machine — or, more accurately, an IP address — in a blacklist implies that that IP address solely belongs to one person. But in this day and age of shared IP addresses, either via multiuser machines or dynamically-assigned addresses, means that that’s not always true, and as such, the blacklist has the real potential of sweeping people into it that don’t belong. That’s imperfect data.

Only if someone uses regular expressions incorrectly as I did in one of my recent releases. Other than that, there are no false positives. There’s no acceptable comment except maybe this one, that contains the string kinky-granny.pornwww.com. If the user is careless, then there are false positives, but again, then it only affects that user’s website.

Re: sabotage. Jason, someone can only sabotage their OWN blacklist, causing their own site to be hurt. That’s not really a problem…

> For example, would people trust that every machine
> that would appear in Dave Winerís blacklist would
> really be a spammer, or would there be a good
> chance that a good chunk of them were just the
> machines of people who rubbed him the wrong way?)

Jason, just like in real life, if you’re going to trust someone, you open yourself up to their goodness and maliciousness. In any case, a site owner trusting someone untrustworthy is the site owner’s problem, not mine. MT-Blacklist doesn’t solve stupidity. In fact, no software does.

And as far as IP blacklists, I’5tgb-[ve screamed it from every mountain. They are useless, ridiculous, imperfect, whack-a-mole solutions. Whoever is using IP blacklists needs to learn a little bit more about the internet.

While I will agree that many blacklist implementations and models are flawed, I still haven’t heard an valid criticism specifically of MT-Blacklist’s implementation (other than a couple of bugs which will be ironed out in the next version), but would be happy to hear some and adapt the program as necessary to best serve the needs of the community.

And James, don’t get me wrong. I love that you created MT-Bayesian, and I hope to one day soon be able to open up MT-Blacklist as a general engine for other people’s filters including yours. Users should have at their disposal as many tools as possible and be able to use them all easily and seamlessly. I look forward to trying out MT-Bayesian. I am skeptical that it would be very successful, but I hope that it is. Regardless, I and many others appreciate your efforts.

Effectively, so long a comment have a blacklisted word (even substring), it will be banned. Suppose I have “porn” as a blacklisted word, almost every comments to my entry on “Should we ban porn?” would be banned.

Hence, the simple mindedness of blacklist logic is the problem, whether it is IP blacklist or content blacklist. Bayesian, at least, is a fuzzy logic which analysis the content before giving you a probability of spam.

Jay, your implementation is based solely on strings, not IP addresses? That changes things a little, and I agree that it’s for the good. But I also don’t see this as a competition — I, like you, would love to find whatever it takes to just deal with this problem before it really gets going. For me, Bayesian filters have been a godsend on the email front… as part of SpamAssassin, which includes both functionalities. Maybe a combined attempt will ultimately be what works here, too!

Please note that comments automatically close after 60 days; the comment spammers love to use the older, rarely-viewed pages to work their magic. If comments are closed and you want to let me know something, feel free to use the contact page!

Search

Who am I?

I'm Jason Levine, and have been keeping this site since the waning days of 1999. I'm a physician, a husband, a father, a scientist, an uncle, a photographer, and an unapologetic geek. I currently live in Washington, DC, and wear the two hats of a bioinformatics researcher and a clinical pediatric hematologist and oncologist.