Wednesday, June 2, 2010

Anyone who has tried to use Mechanical Turk for any large scale task knows what is the biggest problem: Spammers!

Many people try initially to post small tasks on Turk and are amazed to get back high-quality results of high quality, very quickly, and with very small cost. Then, they post similar, but much larger, tasks and they realize that now the results are pretty much crap. Spammers start giving random answers, and since they spend no effort, they can easily be the most prolific workers! For example, in the Snow et al. 2008 paper, almost 40% of the submissions came from spam workers, according to Bob Carpenter's analysis.

Unfortunately, this problem is widespread, and tends to discourage new requesters. People just want to get things done on Mechanical Turk, they do not want to spend their time fighting spammers.

Spam Protection Measures: Qualification tests and "best of 3" policy

Qualification tests: Some people, to avoid spammers, use qualification tests. While some measures, such as qualification tests help, they are not a panacea. Determined spammers simply pass the qualification test, and proceed with spamming. Nevertheless, it is a good idea to have a qualification test, as it also tends to filter out workers that come and complete only a couple of HITs. The main disadvantage is that you need programming or use of the command-line tools to create and require a qualification test (i.e., the web interface is not enough).

Best of Three: Other requesters use a "best of 3" (or "2+1") policy: if two workers agree on their answer, the answer is accepted. This technique is relatively widespread across seasoned requesters (and drives some legitimate workers crazy).

Why do legitimate workers hate this verification mechanism? Because every time a worker does not agree with the other two, the submission gets rejected. This drops the acceptance ratio, an important reputation characteristics on Mechanical Turk.

Interestingly, the "best of 3" strategy led to an interesting counter-strategy from spammers: they come in pairs and vote concurrently at the same task, giving the same votes. This defeats the simple "best of 3" approach. (Update: Josh in the comments suggests a good strategy to avoid this: Randomize the HITs by picking examples from a pool, instead of having "static" HITs. This is a generally good suggestion, even if someone uses the more sophisticated tools described below.)

While the "best of 3" approach ensures some basic quality of the data, it does not work well in the long run. Legitimate workers tend to avoid such HITs, as their approval rating can plummet if there are a couple of spammers lurking around. Also, legitimate workers get penalized too severely for honest mistakes.

A better approach for detecting the quality of the workers is to perform the analysis over a larger set of HITs, rather than operating on single HITs, as "best of 3" does.

Using gold standard: One approach, followed often, is to slip into the stream of HITs some tasks with known correct answers. By comparing the answer given by the worker with the correct answer, we can compute the "confusion matrix" for each worker. In other words, we can see how often a worker gives answer B when the correct answer is A, how often workers says B when correct answer is B, and so on.

Initialize the error-rates of the workers (e.g., assume all workers are perfect)

Initialize the most likely class for each example (e.g., using majority vote)

Compute the confusion matrix (aka error rates) for each worker: Compare the answers given by the worker with the estimated correct answers.

Compute the most likely class for each example: Use the answers given by the workers, and weight the answers using the confusion matrix for each worker.

Go to step 3, until convergence

Interestingly enough, the supervised and unsupervised techniques can be naturally combined: In step 4, for all the examples for which we know the correct answer, we simply do not update the class label. Trivially easy. (In a dual fashion, we can also incorporate trusted workers by not updating the error rates in step 3.) By providing some known data, the algorithm can actually estimate with less data the quality of the workers and the correct labels for the unknown examples.

where $Pr[A \rightarrow B]$ is the probability that the worker will classify an object that belongs to class A into class B. Of course, a perfect worker will have a confusion matrix that is perfectly diagonal