Latest Publications

Motivation: MTurk is a labor market and in order for a labor market to be efficient, workers have to sort through all the jobs and pick ones they want to do and employers have to find suitable workers out of all the possible workers available. Unlike canonical market examples such as the corn market, or the stock market, all workers and jobs are unique, thus the matching problem requires special tools. It’s really important for search to be efficient on MTurk. For full-time employment, workers are willing to spend 6 months doing a job search. For a 20-second task, workers probably aren’t willing to spend even an hour searching. The future of online human computation depends on efficient worker-employer pairs.

Results: We published a paper studying how workers on MTurk currently. MTurk has six ways workers can sort for tasks:

1. Title (e.g., “Choose the best category for this product”)

2. Requester (e.g., “Dolores Labs”)

3. HIT Expiration Date (e.g., “Jan 23, 2011 (38 weeks)”)

4. Time Allotted (e.g., “60 minutes”)

5. Reward (e.g., “$0.02″)

6. HITs Available (e.g., 17110)

7. Required qualications (if any)

(Turkers can also search by keyword, but we didn’t study this).

We found strong evidence that Turkers sort by the most number of HITs available (so they can find one task, and then do 100 instances of them in a row) and the most recently posted HITS (so they get the latest and greatest HITs).

This second result is interesting. It indicates that Turkers are interested in the newess of HITs. New HITs are constantly coming in, so they can almost always find something interesting to do on the first N pages of the most recently posted HITs. Also, it seems like Turkers enjoy getting things that are new – doing the newest tasks is more interesting than labeling images all day long.

Notably, Turkers don’t sort by the highest reward, we hypothesize that this is may be because:

1. high reward usually means a long task and Turkers either don’t like long tasks, or find it difficult to estimate how long they take, because Turkers presumably want to maximize their wage rate.

2. many high reward HITs are undesirable, and since you can’t delete items your not interested in from MTurk’s interface, it becomes very hard to revisit the “highest reward’ page to find new things, because you also see all the old things you didn’t want.

We feel that adding better search functionality to MTurk would substantially improve the service for both workers and requesters.

Employers! How to get your tasks done: If you’ve had trouble getting your tasks done, one strategy we saw some of the biggest requesters using is to use a script that constantly updates your HITs so that they stay on the first couple pages of most recently posted HITs. You don’t actually have to change the HIT very much to be considered new again.

Abstract: In order to understand how a labor market for human computation functions, it is important to know how workers search for tasks. This paper uses two complementary methods to gain insight into how workers search for tasks on Mechanical Turk. First, we perform a high frequency scrape of 36 pages of search results and analyze it by looking at the rate of disappearance of tasks across key ways Mechanical Turk allows workers to sort tasks. Second, we present the results of a survey in which we paid workers for self-reported information about how they search for tasks. Our main findings are that on a large scale, workers sort by which tasks are most recently posted and which have the largest number of tasks available. Furthermore, we nd that workers look mostly at the rst page of the most recently posted tasks and the rst two pages of the tasks with the most available instances but in both categories the position on the result page is unimportant to workers. We observe that at least some employers try to manipulate the position of their task in the search results to exploit the tendency to search for recently posted tasks. On an individual level, we observed workers searching by almost all the possible categories and looking more than 10 pages deep. For a task we posted to Mechanical Turk, we conrmed that a favorable position in the search results do matter: our task with favorable positioning was completed 30 times faster and for less money than when its position was unfavorable.

Can we eke creativity out of Mechanical Turk? Clearly, lots of Turkers try to get away with as little thought as possible. But, if we frame the task correctly, we can get some veryclever outcomes.

Here’s another stake in the ground: reinterpreting content “in the style of” an artist or writer. Let’s take as input some news stories from this past summer. Whose style should we reinterpret them into? Only one of the most famous children’s book writers this side of Cloudy with a Chance of Meatballs, of course: Dr. Seuss.

We gathered headlines and bylines from the news, then asked Turkers to rewrite each headline in the style of Dr. Seuss. To prompt them with examples, we listed several Dr. Seuss book titles. Then, we gathered all the results and asked a second set of Turkers to rate how amusing the transformation was, on a 1-7 Likert scale. Ten Turkers rated each headline.

Here, then, are the most amusing headlines and their Dr. Seussifications:

Ten World Cup headlines yet to be written → How Many Ounces Will The World Cup Hold?

4

Wounded taxi driver ‘watched friend shot in face’ → Bird Shoots Man

3

BP tries again to cap well; protests set to start → BP and the Oilbleck

3

Ten World Cup headlines yet to be written → The (Cat)Rabbit’s Still in the Hat(Bag)

2.5

Russia to launch 520-day mock mission to Mars → I Can’t See Red With No Windows

2

BP tries again to cap well; protests set to start → BP Plugs Ears At Protestors

2

Russia to launch 520-day mock mission to Mars → The 520 Days of the Mission of Mars

There is definitely a lot of unfunny content, but some pretty clever stuff in there as well. For the most part, Turkers are successful at recognizing creativity, but less eager to create it. (It took a week or so to get just these suggestions.)

Sorites Paradox is something like this: Is this tile red? Sure. What about this tile ? No, it looks orange. Would you say that two sufficiently similar tiles are the same color? I suppose so, if they were so similar that I couldn’t tell them apart (if you can tell these particular tiles apart, kudos, but image two even more similar tiles). So, if we had a long line of tiles that slowly progressed from red to orange, and each pair of adjacent tiles was so similar that you couldn’t tell them apart, where would the red tiles stop and the orange tiles begin?

Some philosophers puzzle over this even today. The problem is that logic appears to contradict intuition. Classical logic concludes that there must be a red tile next to a non-red tile. Intuition concludes that this is pretty silly when we can’t tell any two adjacent tiles apart.

Now that you’re in a philosophical mood, you might ask, what does it mean for a tile to be red? Good question. Some philosophers say that the meaning of a word is defined by how people use it. So let’s ask Mechanical Turk.

We took 8 tiles on the gradient from red to orange. We showed each tile to 10 different turkers, and asked them whether the tile was red or not. This plot shows how many people said that each tile was red:

This chart suggests that there is no clear boundary between red and orange. It also suggests that it doesn’t really make sense to say that all tiles are either red or not-red, since people disagree about tiles in the middle. It might make more sense to say that a tile is 70% red, or 20% red. This view is called “Group consensus” on the Wikipedia page for Sorites Paradox.

We could probably run this experiment with more people over more tiles to get an even smoother curve from red to orange. If we did this, then we could develop a better intuition for what happens with adjacent tiles. No individual can distinguish them, but can the crowd distinguish them?

It might also be fun to capture the gradient of other terms, like “tall” or “rich”.

TurKit Online

We have been busy this past month or so developing an online version of TurKit, which was used to run this experiment. This version allows people to execute long running experiments “in the cloud”, without leaving their personal computer on all night. TurKit online runs on Google App Engine, and uses your Google id. The entire web-app is open source.

Where’s the code and data for this experiment?

The online version of TurKit doesn’t export projects yet, but it will soon. We’ll post the code and data soon.

This is a guest post by Michael Bernstein, a computer science PhD student in our research group.

In all the discussion galvanizing Mechanical Turk workers to pursue worthwhile activities, we’ve overlooked one of the basic human social needs: humor. In a moment of fleeting fancy, I decided to ask 50 Mechanical Turk workers, “What would you do for a Klondike Bar?”

Turns out, they would do a lot — the responses were pretty funny. But I needed to know more. Who was the funniest turker of all? I resubmitted the Turkers’ answers to other Turkers, asking them to rate how funny the responses were on a Likert (1 – 7) scale. Five Turkers rated each response, and I used the median rating to determine final scores.

Ladies and gentlemen, the results. The funniest responses are at top:

Median Funniness

What would you do for a Klondike Bar?

7

I would literally punch a rabid bear in the face, knocking it unconscious, then use its body as a snowboard down an icy mountain until I reached the entrance to the caves of the Elder Deep Dwarves. I would sacrifice the bear in an blazing bonfire to the god Kord, lord of brawling, strength, and courage, and be granted the strength to kick down the dwarven stone gate. I would fight my way through their finest warriors and take from them their legendary fire repelling Mithral armor, and the sword Soulcutter, forged from star metal. With my prizes in hand, I would travel to the domain of Snarv Coalbreath, legendary dragon. A ten hour battle would ensue between us. I would slay the dragon, but my left arm would be torn off in the fight. Weary and wounded, I would brush aside the gold and diamonds of the dragon horde and take the single, perfect Klondike bar.

6

I would wrestle a polar bear for a Klondike bar

6

I would eat three meals of only tofu.

5

I’d walk to the store and it’s like 3 miles from here.

5

Hit the stupid Klondike bear in the head for one.

5

I would go shopping all day wih my mother-in-law who drives me nuts.

5

I would make arrangement for all the enjoyment available in the town

5

wait till summer – it’s too cold to think about ice cream

5

Nothing. I’m lactose intolerant

5

I would do anything I judged worthy of the current value of Klondike bar. If a Klondike bar was $.75, I would do anything I judged to be worth $.75. Look, you wanna see me in person, and hash out some deals for ice cream, we can talk. But here’s some starters. I’ll recite poetry, solve basic math problems, teach you how to tango, or give you some quick dating advice. If you give me a lot of Klondike bars, I am willing to do some higher end stuff, including car repairs, building houses, dam maintenance, and contract killings.

5

cluck like a chicken

5

Dance with a baby monkey !

5

I would do anything that is legal.

5

Jump off a cliff.

5

I would spend an evening listening to my mother in-law telling me how inferior I am as a person.

5

For a Klondike bar, I would jump rope with 12 kindergarteners.

5

I’d swim through a sea of chocolate syrup!!!

5

I would wear a trout suit and swim up a waterfall then fight off the bear for a klondike bar.

5

tell you about the latest sale at Jos. A Banks

5

I would accept this HIT.

5

I would dress up like a polar bear and wade into a public fountain in the middle of winter!

5

I’d get out of my warm bed when I’m watching TV and go down a flight of stairs and back again at 2 in the morning. Wouldn’t do that for much else.

I said before that I had difficulty getting 500 people to do a task, but I had forgotten the details of whatever experiment that was, so I ran a fresh experiment. Turkers were offered 1 cent to pick a number from 1 to 10. I wanted 1000 people to do the task.

This chart shows the cumulative number of workers over time, starting at about 1pm on the Sunday after New Years.

We got 100 turkers in the first 3 hours, but it took a little over 4.5 days to get 500 turkers. The blue “update” indicates a time when I updated the task on Mechanical Turk, causing it to become the most-recently-added task again. We see a spike of workers after this update (though not quite as big as the initial spike of workers). Interestingly, the rate of turkers seems fairly constant over the rest of the time. I stopped the experiment after a week (with ~850 turkers), thinking it would take a while to get the 1000 turkers, but looking at the graph, we probably only needed to wait another day or two.. or three.

So, this experiment was more about how many turkers we could get, but since we have the data, what number did people choose?

Whoever keeps telling me that people like to choose 7 is right, it would seem. Note that all the odd numbers 3, 5, 7 and 9 are picked more often than their neighbors, though this is not significant except between 4 and 5, and between 7 and it’s neighbors.

Inspired by previous posts about proofreading research papers (1, 2), I used a proofreading task to explore how iterative tasks on MTurk converge or diverge. There was an interesting effect caused by a small change in instructions.

The HITs used a paragraph drawn from one of our research papers. Each iteration introduced a single random error in it — an inserted character, a deleted character, or a transposition of two adjacent characters. The paragraph was presented to the turker in a textbox, and the turker was asked to proofread it, correct any errors found, and submit it. For example, here is the paragraph with a random error (highlighted in red):

Automatic clustering generally helps separate different kinds of records that need to be edited differently, but it isn't perfect. Sometimes it creates more clusters than needed, because the differences in structure aren't important to the user's particular editing task. For example, if the user only needs to edit near the end of each line, then differences at the start of the line are largely irrelevant, and it isn't necessary to split based on those differences. Conversely, sometimes the clustering isn't fine enough, leaving heterogeneous clusters that must be edited one line at a time. One solution to this problem would be to let the user rearrange the clustering manually, perhaps using drag-and-drop to merge and split clusters. Clustering and selection generalizaxtion would also be improved by recognizing common text structure like URLs, filenames, email addresses, dates, times, etc.

The turker didn’t see the error highlighted like this, but the turker’s web browser may have highlighted it anyway. Firefox, for example, puts a red underline under suspiciously-spelled words in a textbox. So “generalizaxtion” would in fact be underlined above. So, incidentally, would be “filenames,” because Firefox prefers “file names.”

After one turker edited the paragraph (hopefully fixing the introduced error), a new error would be introduced in the submitted version, and another turker would edit the paragraph. The structure was iterative, so if one turker made radical edits to the paragraph, those changes would persist. No voting or other validation process was used to approve the edits, so it would be possible for the paragraph to significantly diverge from the original if a turker thought it would be better written differently. That was the goal of this little exploration — to see what might encourage or discourage this kind of divergence.

The first time I tried this task, the instructions were:

Please proofread and make corrections to the text below.

Each HIT paid $0.01, and the paragraph went through five iterations of editing before I terminated the process. The full results for trial 1 show exactly the kind of divergence I was hoping for — turkers not only fixed the introduced error, but made other changes as well. Here is the final version of the paragraph (with all changes from the original paragraph highlighted in yellow):

Automatic clustering generally helps separate different kinds of records that need to be edited differently, but it isn't perfect.Sometimes it creates more clusters than needed because the differences in structure aren't important to the user's particular editing task.For example, if the user only needs to edit near the end of each line, then differences at the start of the line are largely irrelevant, and it isn't necessary to split based on those differences.Conversely, sometimes the clustering isn't fine enough which leaves heterogeneous clusters that must be edited one line at a time.One solution to this problem would be to let the user rearrange the clustering manually, perhaps using drag-and-drop to merge and split clusters.Clustering and selection generalization would also be improved by recognizing common text structure such as URLs, file names, e-mail addresses, dates, times, and so on.

In fact, all five turkers made at least two edits, even though there was only one glaring typo in each iteration. (One might argue that “filenames” was also a glaring typo, because Firefox’s spellchecker pointed it out.) One turker changed “email” to “e-mail”; another preferred “such as” instead of “like”; another changed “etc.” to “and so on.” The most radical change was made by a turker who split the text into one-sentence paragraphs — possibly a newspaper copy editor?

The second trial of this task started with the same paragraph, but slightly different instructions:

Please proofread and correct the text below.

and a larger payout: $0.04 per HIT, instead of just a cent. This process ran for 10 iterations, but no divergence occurred. The full results for trial 2 end with this final version:

Automatic clustering generally helps separate different kinds of records that need to be edited differently, but it isn't perfect. Sometimes it creates more clusters than needed, because the differences in structure aren't important to the user's particular editing task. For example, if the user only needs to edit near the end of each line, then differences at the start of the line are largely irrelevant, and it isn't necessary to split based on those differences. Conversely, sometimes the clustering isn't fine enough, leaving heterogeneous clusters that must be edited one line at a time. One solution to this problem would be to let the user rearrange the clustering manually, perhaps using drag-and-drop to merge and split clusters. Clustering and selection generalization would also be improved by recognizing common text structure like URLs, file names, email addresses, dates, times, etc.

which differs from the original only in the place where Firefox’s spellchecker suggested a typo, “filenames.” The filenames->file names edit was made in the first iteration of trial 2, just as it was in trial 1, which strongly suggests that Firefox is to blame.

All 15 turkers who worked on this task in trial 1 or 2 were different. TurKit allowed me to enforce that even though the iterations were posted as separate HITs.

Discussion

It’s interesting to speculate why divergence occurred in trial 1 but not in trial 2. Note that trial 1 involved only half as many iterations as trial 2, but it diverged much more, and divergent editing happened on every iteration of trial 1, and no iterations of trial 2. Something must be up.

In fact, trial 1 turkers actually did more work (more edits each) for less money (only one cent instead of four cents). The consequence was that a lot of their edits were unnecessary and unhelpful, at least in the opinion of one author of the paragraph (me).

My guess is that the wording of the trial 1 instructions (“…make corrections…”) biased them to do more than one edit, lest their work not be accepted. So trial 1 turkers were actually hunting for something to change.

Trial 2 turkers, on the other hand, merely had to correct the text. So it was sufficient to make the obvious corrections that Firefox suggested, and not introduce arbitrary changes for the sake of earning their pay.

One trial 2 turker actually made no changes at all, leaving the introduced error unfixed. The next turker fixed both errors, however, so trial 2 successfully converged, even in the absence of voting or other verification.

One idea that this experiment suggests is that turkers will work hard to find something to do in a proofreading task, even if there isn’t anything useful for them to do. So a proofreading application that uses MTurk may be more effective if it intentionally introduces at least one error in each work piece — not only to catch lazy turkers, but also to reduce the risk of divergence due to unnecessary changes by honest turkers who are just trying to prove they’re really working.

In part 2, we had turkers vote on the best topic ideas. The top two ideas were:

How effective is recycling in saving the environment?

How important is a college education in todays job market?

I originally intended for the essay to be persuasive, so I wanted the topics to be statements rather than questions. Here is a statement version of each question:

Recycling is effective in saving the environment.

A college education is important in today’s job market.

We took all four of these topics — both the question and statement versions of the top two topics — and asked turkers to think of points in favor of them, or possibly against them if the topic was a question. We used the brainstorming algorithm from part 1, and paid each turker 1 cent to think of 3 new points. Each turker saw all the ideas suggested so far for the given topic.

Results:

Here are the points supplied from the first turker for each topic:

A college education is important in today’s job market.

many of the highest paying professions like doctor require a degree

employers see a degree as proof of education because high schools don’t teach well

employers seek specialized knowledge that high school does not teach, so a college degree is required

…

Recycling is effective at saving the environment.

Recycling saves energy it would take to make a new product from scratch.

Recycling cuts down on waste in the trash dumps, freeing up more land to be used more productively.

Recycling makes one stop and think about the environment each time you look for a bin – it makes you more conscientious and could encourage you to think of other ways to save the environment.

…

How effective is recycling in saving the environment?

False if the process of recycling involves using more chemicals to make the item reusable.

False – if “recycling fervor” doesn’t encourage people to reduce their consumerism.

If the first turker uses a clear style guide for their points, this guide tends to be followed. One turker began all their points with “Recycling X”, and this pattern was followed by all subsequent turkers. The person who began all their points with “False if X” or “True if X” influenced most subsequent contributors. I’m not sure yet whether this has any effect on quality, pro or con. I suspect it probably has a positive influence, since it gives people a framework within which to frame their ideas (which would be one less thing to think about in coming up with an idea). This may be something to test.

A couple forms of cheating: at least two users copied/pasted blurbs of text from the internet. Another user copied/pasted points directly from the list of previous points. These both seem easy to catch — if the text is abnormally long, flag it as being copied from the internet, and look for inputs identical to existing inputs — however, I’m not sure it is worth worrying about. Both of these forms of cheating will probably get weeded out anyway in the next step when we decide which points to keep in our essay — we’ll see.

Next Step:

The next step is to find the best supporting points for each topic (or for whichever one we decide to concentrate on). Part of this task may involve clustering the points into similar ideas, so that our top points aren’t all essentially the same.

In part 1, turkers brainstormed some essay topic ideas. Now we want to choose the best topic. We’ve had turkers rate and compare items in the past. For this experiment, we tried something a little different: each turker saw all 18 essay topics, and selected the best 3.

We solicited 100 responses for 2 cents each. Due to negligence on my part, the HIT only worked in Google Chrome, so we only got 50 responses over the course of a week.

Results:

Here we show the essay topics sorted by the number of votes each received:

28

How effective is recycling in saving the environment?

25

How important is a college education in todays job market?

18

Smoking should not be allowed in restaurants and bars.

13

Should the drinking age be lowered?

12

Has the invention of the internet and the expansion of new-age media been good or bad news for the music industry?

10

How has the number of hours a kid is on his computer or watching tv affected his grades in school?

6

What ar the pros and cons of the bailouts our government has given?

6

Should illegal immigrants be deserving of health benefits?

6

Do you think sports stars get special treatment from the judicial system?

6

Would it be good for the United States to put a cap on the number of children you can have while on Federal Welfare?

5

What country has the most gender equality?

5

Should Physician-Assisted-Suicide, or euthanasia be legalized?

3

why do so many older cats die of cancer?

3

What new direction will hip hop music take?

2

Should instant-replay be used for more than home runs in baseball?

2

Is a salary cap necessary to restore competitive parity in Major League Baseball?

0

Is Hugo Chavez and his anti-United States of America sentiment good for Latin America?

Each user saw a randomized ordering of the essay topics. One might worry that users are likely to choose whichever essay topic appears at the top. This chart shows a histogram of votes given to essay topics appearing at different places in the list:
The chart is more-or-less uniform, which is good.

Next step:

I’m considering going with the turker’s second choice of essay topic, since I think people will have more first-hand experience with the role of education in the job market. I fear that the recycling topic will be too difficult.

The next step will be to have turkers brainstorm topics for the 3 supporting paragraphs of our 5 paragraph essay.

We have run a number of experiments that involve asking turkers to rate items on a scale of 1 to 10. This post explores data obtained from three prior experiments, including ratings of image descriptions, company name ideas, and photos.

Here is a histogram of all 7350 ratings obtained, from 314 different turkers:

The distribution looks somewhat like a bell curve with a peak at 8.

Here is a similar histogram for each type of item:

descriptions

company names

photos

N = 3600

N = 3450

N = 300

These look fairly similar, which is interesting, since these ratings are being supplied for different types of items. Of course, about half of these ratings (49%) are made by people who supplied ratings to more than one experiment, so that may account for some of the similarity in rating distributions.

A prior post noted that about 20% of turkers do 80% of the work. In this case, 20% of turkers did 63% of the work. Here are histograms of the ratings supplied by the most prolific 20% of turkers (64 in all), and the remaining 80% of turkers (250 in all).

most prolific 20%

remaining 80%

N = 4639

N = 2711

Again, both rating distributions have a peak at 8, but the variance of ratings supplied by the most prolific turkers appears smaller. Of course, 16% of these ratings come from the top 4 contributors, who have the following distributions:

1st turker

2nd turker

3rd turker

4th turker

N = 313

N = 175

N = 142

N = 125

This suggests that individual raters may be quite unique in their rating style, though there does appear to be a preference amongst these “power turkers” to employ a small variance in their ratings — the 3rd turker’s is so low I suspect cheating. The notable exception is turker 4, who almost looks like they are drawing their ratings from two separate distributions centered at 4 and 7.

I thought that all turkers might start with similar rating distributions, and then reduce their variance over time as they become more familiar/bored with the task. However, I haven’t found much evidence for this. Here are the ratings supplied by the 1st and 4th turkers over time:

1st turker

4th turker

Note that the rating styles appear fairly consistent from beginning to end. If anything, the 4th turker actually increases in variance over time.

Discussion

It would be nice if we could model how turkers rate and compare items. Then we could make predictions about which algorithms were the most efficient at, say, sorting a bunch of items on MTurk.

We’ll need to do more work to achieve this goal, and we’ll probably need to collect more data with specific hypotheses in mind. At this point, it may be worth spending more time looking at the current data and trying to generate some hypotheses.

The Turker proofreaders were helpful in place, e.g.:
line 28 – spelling mistake – should be “transcribing” not transcribe
line 33 – grammatical error – word “trivially” is not needed
line 38 – spelling mistake – “behvioral” should be “behavioral”
line 38 – spelling mistake – “idiosyncracies” should be “idiosyncrasies”
line 41 – spelling mistake – “posses” should be “possess”

But also some annoying “corrections” from a brit w/ too much time on their hands:
Line 46 – spelling mistake – should be “labour” not “labor”
Line 47 – spelling mistake – should be “labour” not “labor”
Line 48 – spelling mistake – should be “labour” not “labor”
Line 49 – spelling mistake – should be “labour” not “labor”
Line 52 – spelling mistake – should be “labour” not “labor”
Line 52 – spelling mistake – should be “labour” not “labor”
Line 56 – spelling mistake – should be “labour” not “labor”
Line 57 – spelling mistake – should be “behaviour” not “behavior”
Line 63 – spelling mistake – should be “labour” not “labor”
Line 66 – spelling mistake – should be “labour” not “labor”
Line 66 – spelling mistake – should be “labour” not “labor”
Line 68 – spelling mistake – should be “labour” not “labor”
Line 70 – spelling mistake – should be “labour” not “labor”
Line 70 – spelling mistake – should be “labour” not “labor”

Taken together, I spent less than $2 and fixed some pretty embarrassing typos. I’ll probably try this again for real when I’m a little farther along with the paper.