I see Twitter getting beaten up a lot for not deleting the spammers faster. Etsy gets beaten up for not deleting the “resellers” faster. Flickr used to get yelled at for not catching the photo stealers or porn spammers faster.

“It’s so fucking easy, they’re right over there, here, let me show them to you, what’s your problem?”

This comes from not understanding the cost benefit ratio of false positives in identifying abuse of a social site at scale.

Imagine you’ve got a near perfect model for detecting spammers on Twitter. Say, Joe’s perfectly reasonable model of “20+ tweets that matched ‘^@[\w]+ http://'”. Joe is (presumably hyperbolically) claiming 99% accuracy for his model. And for the moment we’ll imagine he is right. Even at 99% accuracy, that means this algorithm is going to be incorrectly flagging roughly 2 million tweets per day as spam that are actually perfectly legitimate.

If you’ve never run a social software site (which Joe of course has, but for the folks who haven’t) let me tell you: these kinds of false positives are expensive.

They’re really expensive. They burn your most precious resources when running a startup: good will, and time. Your support staff has to address the issues (while people are yelling at them), your engineers are in the database mucking about with columns, until they finally break down about build an unbanning tool which inevitably doesn’t scale to really massive attacks, or new interesting attack vectors, which means you’re either back monkeying with the live databases or you’ve now got a team of engineers dedicated just to building tools to remediate false positives. And now you’re burning engineer cycles, engineering motivation (cleaning up mistakes sucks), staff satisfaction AND community good will. That’s the definition of expensive.

And this is all a TON of work.

And while this is all going down you’ve got another part of your company dedicated to making creating new accounts AS EASY AS HUMANLY POSSIBLE. Which means when you do find and nuke a real spammer, they’re back in minutes. So now you’re waging asymmetric warfare AGAINST YOURSELF.

Fine, fine, fine whatever. You’ll build a better model. You know, this is a social site, we’ll use social signals. People can click and say “This is spam” and then when, I don’t know, 10 people say a tweet is spam, we’ll delete it and ban that account. But you know, people are fuckwits, and people are confused, and people are unpredictable and the scope of human activity at scale is amazingly wide and vast and deep, so a simple additive, easy to explain, fundamentally fair model isn’t going to work. (protip: if you’re site is growing quickly, make sure to use variables for those threshold numbers, otherwise you might DOS yourself)

But you’re smart, so now you’ve got a machine learning model, that’s feeding social signals into a real time engine, that’s bubbling up the top 0.01% of suspicious cases (and btw if you’ve gotten this far, you’re really really good, and you’re probably wasting your time on whatever silly sheep poking/nerd herding site you’re working on, so call me, I’ve got something more meaningful for you to do), and in at least Twitter’s case we’re now talking about a mere 200,000 potential spam tweets to be manually reviewed daily.

How many people do you need to review 200k spam tweets per day? How many desks do they need? Are you doing to do that in house or are you going to outsource it? And if you outsource it, how are you going to explain the cultural peculiarities of your community, because while your product might have gone global, you’re still your own funky nation of behavior, and some things that look strange (say, retweeting every mention of your own name) are actual part of your community norm.

And if you don’t explain those peculiarities, how long do you think it is until this small army you’ve assembled to review 200k tweets a day, gets tired, makes a mistake and accidentally deletes one of your social network hub early adopter types (because the sad truth is early adopters are outliers in the data, and they look funny).

And what do you think the operational cost of making that mistake is? (see also: fakesters)

Also, whats your data recovery strategy look like on a per account basis?

There are solutions. Some of them are straightforward. Many of them aren’t. None of them are as easy as you think they are unless you’ve been there. And I’m happy to talk to you about them over a beer, but just posting them on a blog, well that would be telling other people’s secrets. And they already have a really hard job.

A much more cogent blog post by Bruce Schneier from 2006, on Data Mining for Terrorists really drills into this problem from a theoretical model. (where “for Terrorists” is to be taken in the “finding Terrorists” sense and not in the “for Dummies” sense) (update: via rafe a good BBC article on base rate fallacy)

12 Responses to “Cost of false positives”

Great post (one I think a few Flickr help forum trolls could use to see). Here’s some disconnected thoughts.

Clarification: how are you arriving at the 2 million number? total # of tweets / day * .01? I’m not sure that makes sense, though I suppose the accuracy of that number is not central to your argument.

I think there’s another angle to look at this, though: let’s say that there are N spam tweets a day – 2 million might be a start, and that is a lot. How many accounts are making them, though? I bet it’s several orders of magnitude fewer. If you gear your tools towards accounts instead of individual pieces of content, things become easier. Of course, for twitter that would still suck – you’d have to read at least some of the tweets, as opposed to just look at the thumbnails, which was usually sufficient in Flickr’s case.

With the right tools, the answer to your “how many desks” question is fewer than you’re implying. We had as few as 1 person working part time on responding to spammers on Flickr, and that was sufficient, though building out the tools took a few weeks of my time (so 1 engineer) and a few hours here and there for adding additional functionality. It’s also important to consider the cost of the noise to your services. The resources you don’t spend on curtailing the spam up front, you’ll spend on figuring out why your emails are suddenly being blocked by various filters and how to fix it. Turns out it’s a very Kafkaesque experience.

You also point out yourself that no data was lost in either of the false positive cases – probably because newer learned from previous mistakes and now have the ability to disable accounts without actually deleting any data. Having that capability tremendously reduces the cost of false positives.

I think Twitter now has the torch as the premier place to spam, so the volume of garbage they see is going to be incredibly high. While I agree with your sentiment (it’s a hard problem), there are now half a thousand people working there. I would expect them to do a bit better. You hint at the real problem in your inlined job offer: working on spam can be hard, boring, and only occasionally rewarding. Nobody really wants to do it.

This all comes down to Bayes’ theorem
I agree the cost of false positive is high, but the cost or spam is growing and is getting higher. There is not a day where I (and I’m certainly no big user) get at least one spam @reply. I try to report them, but I feel it’s wasting my time as well…

Actually to both Mike and Julien’s point I think working on the spam/abuse problem is a hard and interesting problem (the best type), the comment about something better to do was to folks who had actually made a decent start solving it but were stuck in some disposable bubble era silly startup and might want to take a stab at solving risk/abuse issues in an interesting context. (say, with $$$$ on the line)

Mike, your couple of weeks and 1 person was to catch the tail end of spammers — there was actually a fairly extensive suite of techniques before that kicked in (and a large tier 1 staff) which of course meant the last bit that got through were that much more clever.

So you don’t seem to want to count the cost of doing nothing, or worse, making it easyer for the spammers, which seems to be your case. You suffer from Google syndrome, which is that your site so far has been so wildly successful that your vanity tells you it is because your such a hot shot, when in reality, you just won the lottery. I dont use your glue stuff together and call it art marketplace, but I know people who do, and believe me the cost of doing nothing will be your ruination in the long run.

I’d disagree with your assessment. Spam wasn’t Tier I’s only purpose – surely everyone has a general low-level support staff; am I being naive? I don’t know what state things are in now, but when I was leaving, the tools I built (it was much more than two weeks) were the primary way spam accounts were getting deleted en masse. There were even tools to automatically close support cases that were related to a spam account when said account was deleted. As a result, Tier I didn’t actually deal much with spam beyond the very very obvious, precisely for the reasons you outline. And I didn’t even get around to building the REALLY fancy things I wanted to build – because what was already there was good enough and I was assigned to something else. I’m not trying to inflate my contribution here, those tools were still being worked on last I heard (again, by one guy); my point is merely to indicate that dealing with the problems you describe isn’t an infinite resource sink. There’s the usual 80/99/99.9% problem, but you can get pretty damn close. When you look at it in terms of account numbers and not content numbers, the numbers aren’t as daunting.

I agree with you that separating the outliers/early adopters/power users from spammers is the hardest part of automated spam detection. I just read this as an apology of sorts for why spam is so bad on Twitter. If $$ is on the line, resources need to be dedicated to the problem. I don’t believe it requires an insurmountable amount of resources.

There’s also a bit of a reverse incentive — spammers make all your top-line numbers look really good.

At WP.com we now mark around 6k blogs as spam every day and try to back-adjust stats that the spammers’ activity impacted. It’s a bummer to go back and see that 14% growth week was actually 4%. You have to be careful which numbers you trump publicly because few companies have the will to say they’re going down because we missed X spammers.

However I think it’s totally worth it, and it’s a problem that deserves devoting at least 5% of your engineering organization to because you’ll get at least that much in resource savings.

As a tip for avoiding the outlier problem: whitelists are as important as automated spam detection. Apply the same approaches for trust that you do for finding spam.

BTW you forgot the good news: if you’re getting spammed, it means you’ve made it.

You’re quite right. I once founded and sold a company which had an anti-spam tool which didn’t require constant maintenance, which is pretty unheard of in an industry where teams of people update spam signatures every day. When the buyer abandoned it, I continued to use the service for three years without updates. It was fantastic at catching spam…but it had a high (maybe .5%) false positive rate. You can’t flag one in every 200 messages as spam, users are not going to check their spam folders often enough (and worse, it wasn’t by message, it was particular senders who tended to get flagged). And while we had ways a user could train it to be better over time, that just wasn’t good enough to make it work as a standalone product.

[…] with the argument the “false positives are expensive” argument (Yes, I’ve read @kellan’s excellent write-up, and have firsthand experience with this as well) let me call out that this is an entirely […]