Utility data annotation with Amazon Mechanical Turk

Transcription

1 Utility data annotation with mazon Mechanical Turk lexander Sorokin, avid orsyth University of Illinois at Urbana-hampaign N oodwin Urbana,IL bstract We show how to outsource data annotation to mazon Mechanical Turk. oing so has produced annotations in quite large numbers relatively cheaply. The quality is good, and can be checked and controlled. nnotations are produced quickly. We describe results for several different annotation problems. We describe some strategies for determining when the task is well specified and properly priced.. Introduction ig annotated image datasets now play an important role in omputer Vision research. Many of them were built inhouse ([,,,,, ] and many others). This consumes significant amounts of highly skilled labor, requires much management work, is expensive and creates a perception that annotation is difficult. nother successful strategy is to make the annotation process completely public ([]) and even entertaining [, ]), at the cost of diminished control over what annotations are produced and necessary centralization to achieve high volume of participation. inally, dedicated annotation services ([]) can produce high volume quality annotations, but at high price. We show that image annotation work can be efficiently outsourced to an online worker community (currently mazon Mechanical Turk []) (sec. ). The resulting annotations are good (sec...), cheap (sec...) and can be aimed at specific research issues.. How to do it ach annotation task is converted into a Human Intelligence Task (HIT). The tasks are submitted to mazon Mechanical Turk (MT). Online workers choose to work on the submitted tasks. very worker opens our web page with a HIT and does what we ask them to do. They submit the result to mazon. We then fetch all results from mazon MT and convert them into annotations. The core tasks for xp Task img labels cost time effective US pay/hr $ m $. $ m $. $ m $. $ m $. $ m $. Total: $ Table. ollected data. In our five experiments we have collected labels for distinct images for only US $. In experiments and the throughput exceeds annotations per hour even at low ($/hour) hourly rate. We expect further increase in throughput as we increase the pay to effective market rate. a researcher are: () define an annotation protocol and () determine what data needs to be annotated. The annotation protocol should be implemented within an IRM of a web browser. We call the implementation of a protocol an annotation module. The most common implementation choices will be HTML/JS interface, Java or lash applet. The annotation module must be developed for every radically new annotation protocol. We have already built different annotation modules(in lash) for labeling images of people. s the design process is quite straightforward, we aim to accomodate requests to build annotation modules for various research projects. Our architecture requires very little resources administered by the researcher (bash, python, Matlab and a web server or mazon S)... Quality assurance There are three distinct aspects of quality assurance: (a) nsuring that the workers understand the requested task and try to perform it well; (b) cleaning up occasional errors; (c) detecting and preventing cheating in the system. We discuss three viable strategies for Q: multiple annotations, grading This number includes around % of poor annotations.

2 and gold standard evaluation (with immediate feedback). The basic strategy is to collect multiple annotations for every image. This will account for natural variability of human performance, reduce the influence of occasional errors and allow us to catch malicious users. However, this increases the cost of annotation. The second strategy is to perform a separate grading task. worker looks at several annotated images and scores every annotation. We get explicit quality assesments at a fraction of the cost, because grading is easy. The third strategy is to build a gold standard - a collection of images with trusted annotations. Images from the gold standard are injected into the annotation process. The worker doesn t know if an image comes from the new data or from the gold standard. If the annotations provided by the worker significantly deviate from the gold standard, we suspect that the worker is not doing what we asked for. We reveal the gold standard annotation to the worker after they sumbit their own annotation. This immediate feedback clarifies what we expect and encourages to follow the protocol. This strategy is again cheap, as only a fraction of images comes from the gold standard. It is most important to ensure that contributors with high impact understand the task and follow the requested protocol. s can be seen in fig, the bulk of annotation is produced by a few contributors. In our experiments we collected multiple annotations to study consistency. In only one experiment did we have a significant contributor providing poor annotations (ig, experiment, see the low times among the first contributors. See also figure experiment, example, yellow curve)... nnotation protocols We implemented four annotation protocols (fig ): two coarse object segmentation protocols, polygonal labeling and -point human landmark labeling. Object segmentation protocols show an image to the worker and a small image of the query (person). We ask the worker to click on every circle (site) overlapping with the query (person). Protocol one places sites on a regular grid, whereas protocol two places sites at the centers of superpixels (computed with [, ]). The third protocol, polygonal labeling, is very similar to the one adopted in LabelMe[]. We ask the worker to trace the boundary of the person in the image. The fourth protocol labels the landmarks of the human body used for pose annotation in []. We ask the worker to click on locations of the points in the specified order: right ankle, right knee, right hip, left hip, left knee, left ankle, right wrist, right elbow, right shoulder, left shoulder, left elbow, left wrist, neck and head. The worker is always reminded what the next landmark is... nnotation results So far we have run five annotation experiments using data collected from Youtube (experiments,, ), the dataset of people from [] (exp., ) and small sample of data from LabelMe[], Weizman [] and our own dataset (exp. ). In all experiments we are interested in people. s shown in table we have a total of annotations for distinct images collected for a total cost of US$. This is very cheap as discussed in section... We describe the quality of annotations in section... We present sample annotation results (fig,,) to show the representative annotations and highlight the most prominent failures. We are extremely satisfied with the quality of the annotations taking into account that workers receive no feedback from us. We are currently implementing Q strategies described above to provide feedback to workers so we can stop using the multiple duplicate annotations strategy... Pricing The work throughput is elastic and depends on the price of the task. If the price is too low, workers will participate out of curiosity and for entertainment, but may feel underpaid and will loose motivation. If the price is too high, we could be wasting resources and possibly attracting inefficient workers. s table shows, the hourly pay in experiments and was roughly $/hour. In these experiments we had a comments field and some comments suggested that the pay should be increased by a factor of. rom this we conclude that the perceived fair pricing is about US $/hour. The fact that our experiments - finished completely shows the elasticity of the workforce. We note that even at US $/hour we had a high throughput of annotations per hour... nnotation quality To understand the quality of annotations we use three simple consistency scores for a pair of annotations (a and a) of the same type. or protocols, and we divide the area where annotations disagree by the area marked by any of the two annotations. We can think about this as XOR(a,a)/OR(a,a). or protocols and XOR counts of sites with the different annotations, OR counts the sites marked by any of the two annotations a and a. or protocol, XOR is the area of the symmetric difference and OR is the area of the union. or protocol we measure the average distance between the selected landmark locations. Ideally, the locations coincide and the score is. We then select the two best annotations for every image by simply taking a pair with the lowest score, i.e. we take the most consistent pair of annotations. or protocol we

3 Protocol Protocol Protocol Protocol igure. xample results show the example results obtained from the annotation experiments. The first column is the implementation of the protocol, the second column show obtained results, the third column shows some poor annotations we observed. The user interfaces are similar, simple and are easy to implement. The total cost of annotating the images shown in this figure was US $.. further assume that the polygon with more vertices is a better annotation and we put it first in the pair. The distribution of scores and a detailed analysis appears in figures,. We show all scores ordered from the best (lowest) on the left to the worst (highest) on the right. We select :: per- through with step centiles of quality and show the respective annotations. Looking at the images we see that the workers mostly try to accomplish the task. Some of the errors come from sloppy annotations (especially in the heavily underpaid experiment - polygonal labeling). Most of the disagreements come from difficult cases, when the question we ask is dif-

4 ficult to answer. onsider figure, experiment, sample, leftmost circle. One annotator decided to mark the bat, while the other decided not to. This is not the fault of the annotators, but is rather a sign for us to give better instructions. The situation is even more difficult in experiment, where we ask to label landmarks that are not immediately visible. In figure we show consistency of the annotations of each landmark between the th and the th percentile of figure. It is obvious from this figure that hips are much more difficult to localize compared to shoulders, knees, elbows, wrists, ankles, the head and the neck.. Related work risp understanding of the purpose of annotated data is crucial. When it is clear what annotations should be made, quite large annotated datasets appear [,,,,, ]. Such datasets last for a long time and allow for significant advances in methods and theories. or object recognition, there isn t really a consensus on what should be annotated and what annotations are required, so we have a large number of competing datasets. To build large scale datasets researchers have made people label images for free. LabelMe[] is a public online image annotation tool. LabelMe has over images and video frames with at least one object labeled []. The current web site counter displays labeled objects. The annotation process is simple and intuitive; users can browse existing annotations to get the idea of what kind of annotations are required. The dataset is freely available for download and comes with handy Matlab toolbox to browse and search the dataset. The dataset is semi-centralized. MIT maintains a publicly-accessible repository, they accept images to be added to the dataset and they distribute the source code to allow interested parties to set up a similar repository. To our knowledge this is the most open project. On the other hand LabelMe has no explicit annotation tasks and annotation batches. The progress can only be measured in the number of images annotated. In contrast we aim at annotating project-specific data in well-defined batches. We also minimized the need for maintenance of a centralized database. n annotation project can run with only researcher s laptop and computing utility services easily accessible online. The SP game [] and Peekaboom [] are interactive games that collect image annotations by entertaining people. The players cooperate by providing textual and location information that is likely to describe the content of the image to the partner. The games are great success. They are known to have produced over million [] and million [] annotations respectively. The Peekaboom project recently released a collection of images annotated through gameplay. The game-based approach has two inconveniences. The first is centralization. To achieve proper scale, it is necessary to have a well-attended game service that features the game. This constrains publishing of a new game to obtain project-specific annotations. The second one is the game itself. To achieve reasonable scale one has to design a game. The game should be entertaining or else nobody will play it. This will require creativity and experimentation to create appropriate annotation interface. In contrast, our model serves as a drop-in, minimum effort, utility annotation. uilding in-house datasets was another common strategy. The most prominent examples here include: erkeley segmentation dataset [], altech / []/ [], Pascal VO datasets [, ], UIU car dataset [], MIT [] and INRI [] pedestrian datasets, Yale face dataset [], RT [], MU PI [] and (Labeled []) aces in the Wild []. very dataset above is a focused data collection targeted at a specific research problem: segmentation, car detection, pedestrian detection, face detection and recognition, object category recognition. The datasets are relatively small compared to those produced by large scale annotation projects. inally, dedicated annotation services can provide quality and scale, but at a high price. ImageParsing.com has built one of the world largest annotated datasets[]. With over images, video frames and,, annotated physical objects [] this is a really invaluable resource for vision scientists. t the same time, the cost of entry is steep. Obtaining standard data would require at least US $ investment and custom annotations would require at least US $ []. In contrast our model will produce a images with custom annotations for under US $. ImageParsing.com provides high quality annotations and has a large number of images available for free. It is important to note that [] presents probably the most rigorous and the most varied definition of the image labeling task. Their definitions might not fit every single research project, but we argue that this degree of rigor must be embraced and adopted by all researchers.. iscussion We presented a data annotation framework to obtain project-specific annotations very quickly on a large scale. It is important to turn annotation process into a utility, because this will make the researchers answer the important research issues: What data to annotate? and What type of annotations to use?. s annotation happens quickly, cheaply and with minimum participation of the researchers, we can allow for multiple runs of annotation to iteratively refine the precise definition of annotation protocols. inally, we shall ask What happens when we get // million annotated images?. We plan to implement more annotation protocols ([,,,, ], other suggestions are welcome) and the qual-

5 xperiment xperiment xperiment lick on sites lick on sites Trace boundary Low quality annotations Number of HITs Time spent in minutes (net) xperiment lick on pts xp xperiment lick on pts Number of submitted hits Time spent in minutes (gross=submit load) Time spent in minutes (net=last_click first_click) Percentage of contribution xperiment xperiment xperiment xperiment xperiment Percentage of workers igure. ontributions. The first five graphs plot the contribution and the time spent against the rank of the worker. The rank is determined by the total amount of the contribution by a particular worker. The lower the rank the higher the contributions. Note that the scales differ from experiment to experiment, because of different complexity of the tasks. The sixth graph plots the total contribution against the percentage of the top workers. It is really astonishing how closely the curves follow each other. These graphs give insight into the job distribution among the workers: () single top contributors produce very significant amounts spending hours on the task () top contributors are very effective in performing the tasks and () top % of annotators produce % of the data. lick on sites lick on sites Trace boundary lick on pts lick on pts Submission time (minutes) Submission time (minutes) Submission time (minutes) xperiment xperiment xperiment xperiment xperiment Submission time (minutes) Submission time (minutes) igure. Temporal structure of annotations. We show a scatterplot of all submitted annotations. The horizontal axis is time in minutes when we receive the annotation. The vertical axis is the rank of the worker who produced the annotation. The bottom lines have many dots, as they show when the most significant contributors participated in the annotation process. Note the different scales of the scatterplots. The horizonal scale reflects the total time of the annotation while the vertical scale reflects the total number of people who participated in the annotation. The plots show how interesting the tasks are to the workers. In experiments and the workers start early and participate until the available tasks are exhausted - the dots all end at the same time, when no more tasks are left. In experiments, and it takes much longer for significant annotators to come. This is a direct consequence of the task pricing (sec..). xperiments and pay % less than experiments and, while experiment pays % less. ity assurance strategies we discussed. We will make all the code and data available online. References [] S. garwal,. wan, and. Roth. Learning to detect objects in images via a sparse, part-based representation. PMI, ():, November.

6 xperiment : click on the sites overlapping with the person #disagree / (#disagree+#agree). The lower the better Mean., std., median..... knee xperiment : click on the sites (superpixels) overlapping with the person #disagree / (#disagree+#agree). The lower the better Mean., std., median. (no person).... knee igure. Quality details. We present detailed analysis of annotation quality for experiments and. or every image the best fitting pair of annotations is selected. The score of the best pair is shown in the figure. We count the number of the sites where the two annotators disagree and divide by all sites labeled by at least one of the two annotators. The scores are ordered low (best) to high (worst). This is effectively a cumulative distribution function of the annotation scores. or clarity we render annotations at :: percentiles of the score. lue and red dots show annotations provided by annotator. Yellow circle shows the disagreement. Not surprisingly, superpixels make annotations more consistent compared to a regular grid. [] mazon mechanical turk. [] K. arnard, Q. an, R. Swaminathan,. Hoogs, R. ollins, P. Rondot, and J. Kaufhold. valuation of localized semantics: ata, methodology, and experiments. IJV,., [] P. N. elhumeur, J. P. Hespanha, and. J. Kriegman. igenfaces vs. isherfaces: Recognition using class specific linear projection. PMI, ():,. Special Issue on

7 xperiment : trace the boundary of the person... area(xor)/area(n). The lower the better. Mean., std., median. knee.. xperiment : click on landmarks Mean error in pixels between annotation points. The lower the better. Mean., std., median.. figure knee igure. Quality details. We present detailed analysis of annotation quality for experiments and. or every image the best fitting pair of annotations is selected. The score of the best pair is shown in the figure. or experiment we score annotations by the area of their symmetric difference (XOR) divided by the area of their union(or). or experiment we compute the average distance between the marked points. The scores are ordered low (best) to high (worst). or clarity we render annotations at :: percentiles of the score. lue curve and dots show annotation, yellow curve and dots show annotation of the pair. or experiment we additionally assume that the polygon with more vertices is a better annotation, so annotation (blue) always has more vertices. ace Recognition. [] T. L. erg,.. erg, J. dwards, and. orsyth. Who s in the picture? In Proc. NIPS,., [] M. lank, L. orelick,. Shechtman, M. Irani, and R. asri. ctions as space-time shapes. IV, pages,. [] N. alal and. Triggs. Histograms of oriented gradients for human detection. In VPR,.

The Greenfoot Programming Environment MICHAEL KÖLLING University of Kent Greenfoot is an educational integrated development environment aimed at learning and teaching programming. It is aimed at a target

Perfect For RTI Getting the Most out of STAR Math Using data to inform instruction and intervention The Accelerated products design, STAR Math, STAR Reading, STAR Early Literacy, Accelerated Math, Accelerated

Experimental Computer Science: The Need for a Cultural Change Dror G. Feitelson School of Computer Science and Engineering The Hebrew University of Jerusalem 91904 Jerusalem, Israel Version of December

How Do You Know It? How Can You Show It? Penny Reed Wisconsin Assistive Technology Initiative Gayl Bowser Oregon Technology Access Program Jane Korsten Responsive Centers for Psychology and Learning Wisconsin

The Scrum Guide The Definitive Guide to Scrum: The Rules of the Game July 2013 Developed and sustained by Ken Schwaber and Jeff Sutherland Table of Contents Purpose of the Scrum Guide... 3 Definition of

Climate Surveys: Useful Tools to Help Colleges and Universities in Their Efforts to Reduce and Prevent Sexual Assault Why are we releasing information about climate surveys? Sexual assault is a significant

Using Case Studies to do Program Evaluation E valuation of any kind is designed to document what happened in a program. Evaluation should show: 1) what actually occurred, 2) whether it had an impact, expected

Chapter 9 Recommendation Systems There is an extensive class of Web applications that involve predicting user responses to options. Such a facility is called a recommendation system. We shall begin this

Exploring the Duality between Product and Organizational Architectures: A Test of the Mirroring Hypothesis Alan MacCormack John Rusnak Carliss Baldwin Working Paper 08-039 Copyright 2007, 2008, 2011 by

Julie Arendt, Megan Lotts 155 What Liaisons Say about Themselves and What Faculty Say about Their Liaisons, a U.S. Survey Julie Arendt, Megan Lotts abstract: Liaison librarians and faculty in chemistry,

Is Connectivity A Human Right? For almost ten years, Facebook has been on a mission to make the world more open and connected. For us, that means the entire world not just the richest, most developed countries.