Tuesday, July 27, 2010

In HCOMP this year, one of the memorable and discussed presentations (although highly unconventional) was by M. Six Silberman who discussed the "Sellers' problems in human computation markets". The basic question: can we protect the workers there from exploitation and from sweatshop salaries?

Luis von Ahn posted a similar post on his blog. In the comments of the blog post, someone suggested that the low wages on Mechanical Turk is simply the result of high supply of workers and low demand for their work. As there is more supply, the salaries drop. And having minimum wages, would interfere with the free market.

I actually disagree with this interpretation. First of all, there is no oversupply of labor on Mechanical Turk. The distribution of completion times (follows a power law), suggests that the market operates at maximum capacity. My gut instinct actually tells me that there are not enough workers available for the posted work, not vice versa.

I can hear the protests: If there is not enough supply of workers, why don't requesters simply increase the offered prices?

My explanation: The requesters already pay minimum wages for work that is worth minimum wage. How is that possible given the effective hourly rate of \$2/hour?

The basic problem: Spammers. Given that many large tasks attract spammers, most requesters rely on redundancy to ensure quality. So instead of having a single worker to do a task, they get 5 workers to work on it. This increases the effective rate from \$2/hr to \$10/hr.

Effectively, what Amazon Mechanical Turk is today is a market for lemons, following the terminology of Akerlof's famous paper, for which he got the 2001 Nobel prize.

A market for lemons is a market where the sellers cannot evaluate beforehand the quality of the goods that they are buying. So, if you have two types of products (say good workers and low quality workers) and cannot tell who is whom, the price that the buyer is willing to pay will be proportional to the average quality of the worker. So the offered price will be between the price of a good worker and a low quality worker. What a good worker would do? Given that good workers will not get enough payment for their true quality, they leave the market. This leads the buyer to lower the price even more towards the price for low quality workers. At the end, we only have low quality workers in the market (or workers willing to work for similar wages) and the offered price reflects that.

This is exactly what is happening on Mechanical Turk today. Requesters pay everyone as if they are low quality workers, assuming that extra quality assurance techniques will be required on top of Mechanical Turk.

So, how can someone resolve such issues? The basic solution is the concept of signalling. Good workers need a method to signal to the buyer their higher quality. In this way, they can differentiate themselves from low quality workers. Unfortunately, Amazon has not implemented a good reputation mechanism. The "number of HITs worked" and the "acceptance percentage" are simply not sufficient signalling mechanisms.

Here are some ideas:

Allowing workers to get endorsements from reputable requesters (to avoid scam rings like on eBay)

Allowing requesters to post machine readable feedback on the performance of the workers, disconnecting evaluation from the approval rate.

Publishing the reputation history of the workers, so that requesters can evaluate the quality of the worker.

Of course, similar measures can be adopted for requesters! There is a symmetric market for lemons on that side! Scam requesters post HITs, behave badly, and cause good workers to avoid any newcomer. New requesters then get only low quality workers, get disappointed with the quality of the results and they leave the market.

In other words, Amazon can only gain by taking the time to build a more robust reputation system on top of Mechanical Turk. Trust is at the very core of marketplaces. If Mechanical Turk wants to "grow up", then a good reputation system for both sides of the market is grossly overdue.

Everyone who has attended a conference knows that the quality of the talks is very uneven. There are talks that are highly engaging, entertaining, and describe nicely the research challenges and solutions. And there are talks that are a waste of time. Either the presenter cannot present clearly, or the presented content is impossible to digest within the time frame of the presentation.

We already have reviewing for the written part. The program committee examines the quality of the written paper and vouch for its technical content. However, by looking at a paper it is impossible to know how nicely it can be presented. Perhaps the seemingly solid but boring paper can be a very entertaining presentation. Or an excellent paper may be written by a horrible presenter.

Why not having a second round of reviewing, where the authors of accepted papers submit their presentations (slides and a YouTube video) for presentation to the conference. The paper will be accepted and be included in the proceedings anyway but having a paper does not mean that the author gets a slot for an oral presentation.

Under an oral presentation peer review, a committee looks at the presentation, votes on accept/reject and potentially provides feedback to the presenter. The best presentations get a slot on the conference program. This also allows the conference to accept more papers that are worthy of inclusion to the proceedings, without worrying about capacity constraints.

Some other side benefits of this scheme:

Presentations are accessible in an archival format

Authors have hard incentives to be better presenters

The time of the attendees in conferences is not wasted in clearly sub-par presentations.

And if someone says that this system is biased towards good and sleek presenters, I would argue that the system is already biased towards good authors. A well-written paper will eventually have a higher impact than one that is badly written. Same thing for presentations.

Learning to communicate properly the results of our research should be a goal, not an afterthought.

Monday, July 26, 2010

Following last year's practice, I am blogging about the workshop this year as well. (If I do it well for a few more years, I hope to be a tenured blogger for HCOMP.)

The workshop was very well-attended, despite the strong competition for attention (there are 14 workshops at KDD this year).

8.40am, Invited Talk

The workshop started with an invited talk by Ross Smith from Microsoft, who talked about "Using Games to Improve Productivity in Software Engineering". He described how Microsoft used games internally to improve the quality of its products.

He particularly emphasized the attempts on using engineers to help with the localization/internationalization of the Microsoft products. One of the things that was interesting was the pride that engineers had when translating the messages into their own native language. The "my-work-is-in-the-product-that-you-are-using-mom"-factor was a strong motivation for engineers to contribute their time and effort to such volunteer efforts.

Ross also covered a variety of behavioral aspects of the process. For example, leaderboards has different effect, depending on how competitive the workers are in this particular country. When people compete, they have a positive factor. However, if someone has big difference than the rest, this is typically a demotivating factor, as many workers know that they cannot reach the top in any case.

Another interesting factor is that games should not be similar to the duties of the worker. If a programmer writes C++, then playing a game that requires the worker to write C++ is bad. First, the programmer may spend time in the game that he would have otherwise spend programming. Second if a worker is way too good in a task, and spends a lot of time there, other workers cannot compete easily and get demotivated.

An interesting aspect going forward is that games are increasingly being used to tap into the "discretionary" time of the workers, so there is now competition to make the games more interesting, more attractive, more meaningful etc. For example, currently workers accumulate points in the games, and their reward is that these points are translated into dollars that can be donated to disaster relief efforts.

Finally, games should not come with a mandate from the management. (Luis von Ahn mentioned that when you say "play the game" people do not.) The counterexample was Japan, in which the linguistic pride to get the applications translated into Japanese worked (despite? due to?) the clear mandate from the management to play the game that aids into the localization of Microsoft products.

The paper described how workers tend to search for tasks on Amazon Mechanical Turk. The analysis indicated that "Most HITs Available" and "Most Recently Posted" are the most commonly ranking techniques for users to find tasks. By monitoring HITs and scraping the website every 30 seconds, the authors figured out how quickly different tasks are being done.

Plus they run a survey, trying to target rankings that are NOT frequently chosen by workers and compared "best case" scenario and "worst case" scenario. Interestingly enough, there is almost a 30x factor in the rate of completion. This explains all the gaming that is going on today, where major requesters keep posting HITs within their HITgroups to keep their HITs in the first page.

This paper really highlights the negative aspects of the prioritization schemes currently used on Mechanical Turk. Allowing workers to find easier tasks to work on, and employing some randomization in the presentation, Mechanical Turk can really contribute to more predictable completion times for the tasks.

This paper described Rabj, the platform used by Metaweb to improve the quality of FreeBase by having humans to look at the ambiguous cases, that cannot be handled well by automatic techniques. The basic goal is to take a system that is 99% accurate, and improve precision well above 99%

Metaweb did not use Mechanical Turk for this task. Instead, they hired people through oDesk, by first training them for a day, so that they can do their tasks properly and then let them work. By building some long term relationship, they were able to improve the quality of the results, without employing too complicated solutions for solving the worker quality problem. They use the oDesk API as well, and pay an hourly wage that varies from \$5 to \$15 per hour, depending on the complexity of the task

One thing that was interesting is that they are paying per hour, and not per piece. This is a conscious choice. The distribution of completion times for various tasks follows a lognormal distribution. At the very tail, we have the hard tasks that need a lot of time. These are actually the tasks that MetaWeb cares a lot to get right. Paying by piece means that workers have the incentives to do these tasks quickly, and move to the next. Paying for time means that workers can spend some time more in such hard tasks. The quality control process of Metaweb includes testing workers for throughput (if a worker is very significantly slower than the others gets warned and then dismissed).

One interesting aspect of this presentation was its slideless nature. The speaker just read the conclusions from his notes. Although I found the mode of presentation difficult to follow, I think the message was clear: Do we care about the workers? Do we pay them fairly? There was significant discussion afterwards, and I bet this is the only place in KDD (or in any other CS conference) where people engaged into discussion about the fairness of minimum wage laws, issues of immigration and labor, and so on :-)

Markus had a very interesting game, for discovering synonyms and antonyms. You control a spaceship, and you try to shoot down the antonyms of the word given to you, and you try to collect the synonyms. This was a real arcade game, with graphics, collision detection, and so on. Markus mentioned that he writes it in Flash, because it is fast, and because there are websites where you post your game for people to procrastinate, and then you get effortlessly users. He routinely gets 3 million players a month. Even very simply games (e.g., click the boxes) get 5000 users to play them.

This a case study of two attempts to crowdsource writing a novel. The first attempt by Penguin Books and De Montfort University used a wiki to crowd source a novel. The result was a failure. No organization, disconnected elements, incoherent result. When BBC attempted the same a couple of years later, the result was a success. The difference? BBC assigned a curator, who overlooked the process. Lesson? Any attempt to harness the wisdom of the crowds needs a reliable aggregator that will kick out the junky contributors and their contributions, keeping only the good contributions from the crowd.

Interesting real-life game: The goal is to cover and create a 3D reconstruction of a city. Players get points when they go out, take a photo, and cover a part of a city/building that was not covered before. Using the images, they can reconstruct in 3D the buildings without gaping holes.

A game in which you listen to a song and try to guess the tempo and sentiment of the song, and agree with a co-listener. There is a continuous, intermittent feedback about the choice of the other player. The player that moves first to the agreed location gets extra points as influencer. I make it sound more complicated that it seems. I played it and it was very very intuitive and easy to play.

A very interesting study about how the design of a HIT can influence participation. They change HIT parameters (price/design/etc) and examine for how long users will keep doing HITs. Reminded me a little bit of Dan Ariely's work on how motivation affects desire to work on a task.

A demo that showed how two monolingual humans can collaborate to translate a document. They start with a human translation, and the human examines which part of the human translation do not make sense. After rephrasing and sending back (again through machine translation), the other human check if the translation makes sense and whether it corresponds to the original sentence that was translated. What I missed was how users can get motivated to participate in this system.

This game worked as follows: The user looks at a sentence, then the sentence disappears, and the user has to type the sentence again. Typically people cannot retype the exact sentence but type something similar. The main outcome is that through this game we can discover paraphrases and (especially when played by non-natives) typical mistakes in specific language constructs.

The goal of this game is to disambiguate words (e.g., think of the different meanings of "bass" in "I can hear bass sounds" and "I like grilled bass"). The idea follows the ESP game, and asks users to type alternate words for the given underlined word in a phrase. If two people agree, then move on. Taboo words appear when their usage does not allow the disambiguation of a word (e.g., the word is associated with two senses). The experimental results clearly showed the fact that users are learning over time and perform better.

For many tasks on Mechanical Turk, there are spammers submitting wrong results. Using repeated labeling and an algorithm like Dawid and Skene, we can estimate the error rates of the workers. The question is, can we infer from the confusion matrixes who is a spammer? Error rate alone is not enough: Spammers that put everything in the majority class have lower rates than honest but imperfect workers. Also, biased workers who are systematically off (e.g., more conservative or more liberal than other workers) end up having very high error rates. The solution is to compensate for the errors and see how the assigned class looks like after compensating for the errors. If the corrected labels are concentrated in one class, the worker is good. If they are spread across all classes, the worker is bad.

The TurkIt toolkit introduced the idea of iterative tasks, introducing the ideas of iterative elimination voting, the idea of iternative tasks in which workers build on each others results, and so on. This paper examines the outcomes of different task designs. Basic question: Does it make sense to run tasks in parallel, or does it make sense to let workers build on each other's results? For description of images, iterative tends to be better, as people really build on each other's results. Similarly for transcriptions of highly noisy results. However, for tasks with shorter answers (e.g., coming up with company names) there is an interesting tradeoff: Iterative process tends to have higher average, but parallel has higher variance. If you are interested in the max and not in the average rating of the responses, then parallel is better. Iterative will find the consensus, but it will not be great. Parallel will generate some disasters, but also some gems. So if the goal is to find the "best", then parallel processes (i.e., independence) should work best. However, if you are afraid of disastrous outcomes, then workers should interact to eliminate outliers.

The final talk of the workshop focused on optimizing task design, an area that I see as having significant potential for follow-up work. The basic question asked is: How should we design optimally a task for crowdsourcing, given a set of constraints? What will generate best quality? What design aspects will improve speed? In a sense, how can we start moving crowdsourcing from an ad-hoc execution, into a mode in which we specify the task, and a black box optimizer selects all the appropriate aspects of the design for us. The paper gave some first results on predicting the quality and quantity of tags assigned to an image and showed that designs that are predicted to be optimal before execution indeed perform much better than designs that are suboptimal.

12:00noon, Concluding Remarks

Yours truly, at the end, was assigned with the task of coming up with conclusions and describing the overall themes. I think the keyword is this workshop was "Design". Design for individual tasks (either games or MTurk HITs), and design of processes in handling such crowdsources tasks in marketplaces. One theme that I would have liked to see more is incentive designs to motivate people to participate and contribute. But I was very happy overall.

After the concluding remarks there was some discussion of quality control and examining the robustness of crowdsourcing systems to manipulation attacks. While we have no definite answers on how to guarantee protection from coordinated attacks in the absence of ground truth, in the current settings we rarely see extensive collusion and coordination across attackers. Most of the current spammers are there to make an easy buck, and not to spend extensive amounts of time trying to scam pennies. (There are better targets for that.)

Of course, having ground truth for verification of answers and for worker evaluation can help significantly in that respect: Luis von Ahn mentioned the attack on reCAPTCHA from the 4chan clan, which randomly entered the word "penis" as one of the two words, hoping to fill in the digitizations of books with the word "penis". (Given that they were failing in 50% of the attempts, it was easy to isolate them and remove their entries.)

The problem of handling a completely anonymous crowd, without any ground truth knowledge, and getting good results is hard to solve. Perhaps some security people will need to take a look and examine the theoretical guarantees.

Sunday, July 4, 2010

In my previous post, I gave a brief overview of different techniques used to improve the quality of the results on Amazon Mechanical Turk. The main outcome of these techniques is a matrix that describes the error rate of each worker.

For example, consider the task of categorizing webpages as porn or not. We have three target categories:

G-rated: Pages appropriate for a general audience, children included.

PG-rated: Pages with adult themes but without any explicit sexual scenes, appropriate only for children above 13

R-rated: Pages with content appropriate for adults only.

In this case, the confusion matrix of a worker, inferred using the techniques described in my previous post, would look like:

where $Pr[X \rightarrow Y]$ is the probability that the worker will give the answer $Y$ when given a question where the correct answer is $X$. (The sum of the elements in each line should sum up to 1.)

And here is the question that seems trivially easy: Given the confusion matrix, how can we detect the spammers?

Computing the Error Rate

The simple answer is: Just sum up the elements out of the diagonal! Since every non-diagonal element corresponds to an error, if the sum is high, the worker is a spammer. Of course, this ignores the fact that class priors will often differ. So, instead of giving equal weights to each category, we weight the errors according to the class priors (i.e., how often we expect to see each correct answer).

Notice that the error rates for the first line, which correspond to category $G$, got weighted more heavily.

Unfortunately, this method does not work very well.

When we started using this technique, we ended up marking legitimate workers as spammers (false positives), and classifying spammers as legitimate workers (false negatives). Needless to say, both mistakes were hurting us. Legitimate workers were complaining and (understandably) badmouthing us, and spammers kept polluting the results.

Let me give some more details on how such errors appear.

False Negatives: Strategic Spammers and Uneven Class Priors

Spammers on Mechanical Turk are often smart and lazy. They will try to submit answers that seem legitimate but without spending too much time. (Otherwise, they may as well do the work :-)

In our case, we were categorizing sites as porn or not. Most of the time the sites were not porn, and only 10%-20% of the time we had sites that were falling into one of the porn categories. Some workers noticed this fact, and realized that they could keep their error rate low by simply classifying everything as not-porn.

Following the standard way of computing an error rate, these workers were faring much better than legitimate workers that were misclassifying some of the not-porn sites.

Here is an illustration. With three categories (G-, PG13-, and R-rated), the confusion matrix for a spammer looks like this:

The second type of error is when we classify honest workers as spammers. Interestingly enough, when we started evaluating workers, the top "spammers" ended up being members of the internal team. Take a look at the error rate of this worker:

The error rate would imply that this worker is essentially random. A clear case of a worker that should be banned.

After a careful inspection though, you can see that this is not the case. This is the confusion matrix of a worker that tends to be much more conservative than others and classifies 65% of the "G" pages as "PG13". Similarly, all the pages tat are in reality "PG13" are classified as "R". (This worker was a parent with young children and was much more strict on what content would pass as "G" vs "PG13.)

In a sense, this is a a pretty careful worker! Even though this worker does mix up R and PG13 pages, there is a very clear separation between G and PG13/R pages. Still the error rate alone would put this worker very clearly in the spam category.

Solution: Examine Ambiguity not Errors

You will notice that one thing that separates spammers from legitimate workers is the information provided by their answers. A spammer that gives the same reply all the time does not give us any information. In contrast, when the biased worker gives the answer "PG13", we know that this corresponds to a page that in reality belongs to the "G" class. Even if the answer is wrong, we can always guess the correct answer!

So, by "reversing" the errors, we can see how ambiguous are the answers of a worker, and use this information to decide whether to reject a worker or not.

You can also find a demo of the algorithm at http://qmturk.appspot.com/ and you can plug your own data to see how it works. The code will take as input the responses of the workers, the misclassification costs, and the "gold" data points, if you have any. The demo returns the confusion matrix for each worker, and the estimated "cost" of each worker. The output is plain text and kind of ugly but you can find what you need.

About Me

Panos Ipeirotis is an Associate Professor and George A. Kellner Faculty Fellow at the Department of Information, Operations, and Management Sciences at Leonard N. Stern School of Business of New York University.