This experiment is larger, using 30 images instead of 10 (all taken from publicdomainpictures.net). We also made a number of changes to the process. First, we tried to make the instructions for HITs in each process as similar as possible. For instance, the title for HITs in the previous iterative process said “Improve Image Description” and the title for HITs in the previous parallel process said “Describe Image”. In this experiment, the title in both cases is “Describe Image Factually”. This title points to another change — we added the instruction “Please describe the image factually“. This was intended to discourage turkers from thinking they needed to advertise these images, and to make the descriptive styles more consistent. Here is an example HIT:

This image shows a HIT in the iterative process. It contains the instruction “You may use the provided text as a starting point, or delete it and start over.” This instruction deliberately avoids suggesting that the turker just needs to improve the HIT. The idea is that we wanted each process to be as similar as possible, so it didn’t seem fair for turkers in one condition to think they only needed to make a small improvement, whereas turkers in the other condition think they need to write an entire final draft. Note that the very presence of text in the box may alert turkers to the possibility of other turkers seeing their work and being asked to write a description using it as a starting point, but we do not explicitly validate this hypothesis for workers.

This instruction is omitted for the parallel HITs. It is the only difference between the two, except of course that all of the parallel HITs start with a blank textarea, whereas all but the first iterative HITs will show prior work.

Finally, in order to compare the output from each process, we wanted some way of selecting a description in the parallel process to be the output. We do this by voting between descriptions, and keeping the best one in exactly the same way as the iterative process. (Note: one difference is that the iterative process highlights differences between the descriptions, whereas the parallel process does not. Since the descriptions in the parallel process are not based on each other, they are likely to be completely different, making the highlighting a distraction.)

This graph shows the average rating of descriptions generated in each iteration of the iterative processes (blue), along with the average rating of all descriptions generated in the parallel processes (red). Error bars show standard error.

Discussion:

The final description from each iterative process averaged a rating of 7.86, which is statistically significantly greater than the 7.4 average rating for the final description in each parallel process (paired-ttest t(29) = 2.12, p = 0.043). We can also see from the graph that ratings appear to improve with each iteration.

This suggests there may be a positive correlation between the quality of prior work shown to a turker, and the quality of their resulting description. Of course, a confounding factor here is that turkers who choose to invest very little effort are not likely to harm the average rating as much when they are starting with an already good description, since any small change to it will probably still be good, whereas a very curt description of the whole image is likely to be rated much worse. This factor alone could explain an increase in the average rating of all descriptions in the iterative process, but it would not explain an increase in the average rating of the best description from the iterative process — for that, it must be that some people are writing descriptions that are better than they would have written if they were not shown prior work.

So why are we seeing a difference now when we didn’t before? We changed a number of factors in this experiment, but my guess is that the most important change was altering the instructions in the iterative process. I think the instructions in the old version encouraged turkers not to try as hard, since they merely needed to improve the description, rather than write a final-quality description. In this experiment, all turkers were asked to write a final description, but some were given something to start with.

I think this same idea explains the results seen in the previous blog post about brainstorming company names. Turkers in the iterative process of that experiment were required to generate the same number of names as turkers in the parallel process, whereas the experiment before that suggested that turkers in the iterative process just needed to add names to a growing list, and we saw that they generated fewer names.

2 Comments »

[...] have often used Mechanical Turk to rate things, like brainstorming ideas and image descriptions. We were curious how MTurk ratings compared to ratings we might obtain from a traditional user [...]