ABSTRACT Amazon's Mechanical Turk is an online labor market where requesters post jobs and workers choose which jobs to do for pay. The central purpose of this article is to demonstrate how to use this Web site for conducting behavioral research and to lower the barrier to entry for researchers who could benefit from this platform. We describe general techniques that apply to a variety of types of research and experiments across disciplines. We begin by discussing some of the advantages of doing experiments on Mechanical Turk, such as easy access to a large, stable, and diverse subject pool, the low cost of doing experiments, and faster iteration between developing theory and executing experiments. While other methods of conducting behavioral research may be comparable to or even better than Mechanical Turk on one or more of the axes outlined above, we will show that when taken as a whole Mechanical Turk can be a useful tool for many researchers. We will discuss how the behavior of workers compares with that of experts and laboratory subjects. Then we will illustrate the mechanics of putting a task on Mechanical Turk, including recruiting subjects, executing the task, and reviewing the work that was submitted. We also provide solutions to common problems that a researcher might face when executing their research on this platform, including techniques for conducting synchronous experiments, methods for ensuring high-quality work, how to keep data private, and how to maintain code security.

[Show abstract][Hide abstract]ABSTRACT:
In four experiments, we demonstrated a new phenomenon called "slow-change deafness." In Experiment 1 we presented listeners with continuous speech that changed three semitones in pitch over time, and we found that nearly 50 % failed to notice the change. Experiments 2 and 3 replicated the finding, demonstrated that the changes in the stimuli were well above threshold, and showed that when listeners were alerted to the possibility of a change, detection rates improved dramatically. Experiment 4 showed that increasing the magnitude of the change that occurred in the stimulus decreased the rate of change deafness. Our results are consistent with previous work that had shown that cueing listeners to potential auditory changes can significantly reduce change deafness. These findings support an account of change deafness that is dependent on both the magnitude of a stimulus change and listener expectations.

[Show abstract][Hide abstract]ABSTRACT:
We study the causal effects of financial incentives on the quality of
crowdwork. We focus on performance-based payments (PBPs), bonus payments
awarded to workers for producing high quality work. We design and run
randomized behavioral experiments on the popular crowdsourcing platform Amazon
Mechanical Turk with the goal of understanding when, where, and why PBPs help,
identifying properties of the payment, payment structure, and the task itself
that make them most effective. We provide examples of tasks for which PBPs do
improve quality. For such tasks, the effectiveness of PBPs is not too sensitive
to the threshold for quality required to receive the bonus, while the magnitude
of the bonus must be large enough to make the reward salient. We also present
examples of tasks for which PBPs do not improve quality. Our results suggest
that for PBPs to improve quality, the task must be effort-responsive: the task
must allow workers to produce higher quality work by exerting more effort. We
also give a simple method to determine if a task is effort-responsive a priori.
Furthermore, our experiments suggest that all payments on Mechanical Turk are,
to some degree, implicitly performance-based in that workers believe their work
may be rejected if their performance is sufficiently poor. Finally, we propose
a new model of worker behavior that extends the standard principal-agent model
from economics to include a worker's subjective beliefs about his likelihood of
being paid, and show that the predictions of this model are in line with our
experimental findings. This model may be useful as a foundation for theoretical
studies of incentives in crowdsourcing markets.

[Show abstract][Hide abstract]ABSTRACT:
We recruit an online labor force through Amazon.com's Mechanical Turk platform to identify clouds and cloud shadows in Landsat satellite images. We find that a large group of workers can be mobilized quickly and relatively inexpensively. Our results indicate that workers' accuracy is insensitive to wage, but deteriorates with the complexity of images and with time-on-task. In most instances, human interpretation of cloud impacted area using a majority rule was more accurate than an automated algorithm (Fmask) commonly used to identify clouds and cloud shadows. However, cirrus-impacted pixels were better identified by Fmask than by human interpreters. Crowd-sourced interpretation of cloud impacted pixels appears to be a promising means by which to augment or potentially validate fully automated algorithms.

Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed.
The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual
current impact factor.
Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence
agreement may be applicable.

Page 1

Conducting behavioral research on Amazon’sMechanical TurkWinter Mason & Siddharth Suri# Psychonomic Society, Inc. 2011Abstract Amazon’s Mechanical Turk is an online labormarket where requesters post jobs and workers choose whichjobs to do for pay. The central purpose of this article is todemonstrate how to use this Web site for conductingbehavioral research and to lower the barrier to entry forresearchers who could benefit from this platform. We describegeneral techniques that apply to a variety of types of researchand experiments across disciplines. We begin by discussingsome of the advantages of doing experiments on MechanicalTurk,suchaseasyaccesstoa large,stable, anddiverse subjectpool, the low cost of doing experiments, and faster iterationbetween developing theory and executing experiments. Whileother methods of conducting behavioral research may becomparable to or even better than Mechanical Turk on one ormore of the axes outlined above, we will show that whentaken as a whole Mechanical Turk can be a useful tool formany researchers. We will discuss how the behavior ofworkerscompareswiththatofexpertsandlaboratorysubjects.Then we will illustrate the mechanics of putting a task onMechanical Turk, including recruiting subjects, executing thetask, and reviewing the work that was submitted. We alsoprovidesolutionstocommonproblemsthataresearchermightface when executing their research on this platform, includingtechniquesfor conducting synchronous experiments, methodsfor ensuring high-quality work, how to keep data private, andhow to maintain code security.Keywords Crowdsourcing.Online research.Mechanical turkIntroductionThe creation of the Internet and its subsequent widespreadadoption has provided behavioral researchers with an addition-al medium for conducting studies. In fact, researchers from avariety offields, such as economics (Hossain & Morgan, 2006;Reiley, 1999), sociology (Centola, 2010; Salganik, Dodds, &Watts, 2006), and psychology (Birnbaum, 2000; Nosek,2007), have used the Internet to conduct behavioral experi-ments.1The advantages and disadvantages of online behav-ioral research, relative to laboratory-based research, have beenexplored in depth (see, e.g., Kraut et al., 2004; Reips, 2000).Moreover, many methods for conducting online behavioralresearch have been developed (e.g., Birnbaum, 2004; Gosling& Johnson, 2010; Reips, 2002; Reips & Birnbaum, 2011). Inthis article, we describe a tool that has emerged in the last5 years for conducting online behavioral research: crowd-sourcing platforms. The term crowdsourcing has its origin inan article by Howe (2006), who defined it as a job outsourcedto an undefined group of people in the form of an open call.The key benefit of these platforms to behavioral researchers isthat they provide access to a persistently available, large set ofpeople who are willing to do tasks—including participating inresearch studies—for relatively low pay. The crowdsourcingsite with one of the largest subject pools is Amazon’sMechanical Turk2(AMT), so it is the focus of this article.1This is clearly not an exhaustive review of every study done on theInternet in these fields. We aim only to provide some salient examples.2The name “Mechanical Turk” comes from a mechanical chess-playing automaton from the turn of the 18th century, designed to looklike a Turkish “sorcerer,” which was able to move pieces and beatmany opponents. While it was a technological marvel at the time, thereal genius lay in a diminutive chess master hidden in the workings ofthe machine (see http://en.wikipedia.org/wiki/The_Turk). Amazon'sMechanical Turk was designed to hide human workers in an automaticprocess; hence, the name of the platform.W. Mason (*):S. SuriYahoo! Research,New York, USAe-mail: m@winteram.comS. Surie-mail: suri@yahoo-inc.comBehav ResDOI 10.3758/s13428-011-0124-6

Page 2

Originally, Amazon built Mechanical Turk specificallyfor human computation tasks. The idea behind its designwas to build a platform for humans to do tasks that are verydifficult or impossible for computers, such as extractingdata from images, audio transcription, and filtering adultcontent. In its essence, however, what Amazon created wasa labor market for microtasks (Huang, Zhang, Parkes,Gajos, & Chen, 2010). Today, Amazon claims hundreds ofthousands of workers and roughly ten thousand employers,with AMT serving as the meeting place and market(Ipeirotis, 2010a; Pontin, 2007). For this reason, it alsoserves as an ideal platform for recruiting and compensatingsubjects in online experiments. Since Mechanical Turkwas initially invented for human computation tasks,which are generally quite different than behavioralexperiments, it is not a priori clear how to conductcertain types of behavioral research, such as synchronousexperiments, on this platform. One of the goals of thiswork is to exhibit how to achieve this.Mechanical Turk has already been used in a smallnumber of online studies, which fall into three broadcategories. First, there is a burgeoning literature on how tocombine the output of a small number of cheaply paidworkers in a way that rivals the quality of work by highlypaid, domain-specific experts. For example, the output ofmultiple workers was combined for a variety of tasksrelated to natural language processing (Snow, O'Connor,Jurafsky, & Ng, 2008) and audio transcription (Marge,Banerjee, & Rudnicky, 2010) to be used as input to otherresearch, such as machine-learning tasks. Second, therehave been at least two studies showing that the behavior ofsubjects on Mechanical Turk is comparable to the behaviorof laboratory subjects (Horton, Rand, & Zeckhauser, inpress; Paolacci, Chandler, & Ipeirotis, 2010). Finally, thereare a few studies that have used Mechanical Turk forbehavioral experiments, including Eriksson and Simpson(2010), who studied gender, culture, and risk preferences;Mason and Watts (2009), who used it to study the effectsof pay rate on output quantity and quality; and Suri andWatts (2011), who used it to study social dilemmas overnetworks. All of these examples suggest that MechanicalTurk is a valid research environment that scientists areusing to conduct experiments.Mechanical Turk is a powerful tool for researchers thathas only begun to be tapped, and in this article, we offerinsights, instructions, and best practices for using this tool.In contrast to previous work that has demonstrated thevalidity of research on Mechanical Turk (Buhrmester,Kwang, & Gosling, in press; Paolacci et al., 2010), thepurpose of this article is to show how Mechanical Turk canbe used for behavioral research and to demonstrate bestpractices that ensure that researchers quickly get high-quality data from their studies.There are two classes of researchers who may benefitfrom this article. First, there are many researchers whoare not aware of Mechanical Turk and what is possible todo with it. In this guide, we exhibit the capabilities ofMechanical Turk and several possible use cases, soresearchers can decide whether this platform will aidtheir research agenda. Second, there are researchers whoare already interested in Mechanical Turk as a tool forconducting research but may not be aware of theparticulars involved with and/or the best practices forconducting research on Mechanical Turk. The relevantinformation on the Mechanical Turk site can be difficultto find and is directed toward human computation tasks,as opposed to behavioral research, so here we offer adetailed “how-to” guide for conducting research onMechanical Turk.Why Mechanical Turk?There are numerous advantages to online experimentation,many of which have been detailed in prior work (Reips,2000, 2002). Naturally, Mechanical Turk shares many ofthese advantages, but also has some additional benefits. Wehighlight three unique benefits of using Mechanical Turk asa platform for running online experiments: (1) subject poolaccess, (2) subject pool diversity, and (3) low cost. We thendiscuss one of the key advantages of online experimenta-tion that Mechanical Turk shares: faster iteration betweentheory development and experimentation.Subject pool access Like other online recruitment methods,Mechanical Turk offers access to subjects for researcherswho would not otherwise have access, such as research-ers at smaller colleges and universities with limitedsubject pools (Smith & Leigh, 1997) or nonacademicresearchers, with whom recruitment is generally limited toads posted online (e.g., study lists, e-mail lists, socialmedia, etc.) and flyers posted in public areas. While someresearch necessarily requires subjects to actually comeinto the lab, there are many kinds of research that can bedone online.Mechanical Turk offers the unique benefit of having anexisting pool of potential subjects that remains relativelystable over time. For instance, many academic researchersexperience the drought/flood cycle of undergraduate subjectpools, with the supply of subjects exceeding demand at thebeginning and end of a semester and then dropping toalmost nothing at all other times. In addition, standardmethods of online experimentation, such as building a Website containing an experiment, often have “cold-start”problems, where it takes time to recruit a panel of reliablesubjects. Aside from some daily and weekly seasonalities,the subject availability on Mechanical Turk is fairly stableBehav Res

Page 3

(Ipeirotis, 2010a), with fluctuations in supply largely due tovariability in the number of jobs available in the market.The single most important feature that Mechanical Turkprovides is access to a large, stable pool of people willing toparticipate in experiments for relatively low pay.Subject pool diversity Another advantage of MechanicalTurk is that the workers tend to be from a very diversebackground, spanning a wide range of age, ethnicity, socio-economic status, language, and country of origin. As withmost subject pools, the population of workers on AMT isnot representative of any one country or region. However,the diversity on Mechanical Turk facilitates cross-culturaland international research (Eriksson & Simpson, 2010) at avery low cost and can broaden the validity of studiesbeyond the undergraduate population. We give detaileddemographics of the subject pool in the Workers section.Low cost and built-in payment mechanism One distinctadvantage of Mechanical Turk is the low cost at whichstudies can be conducted, which clearly compares favorablywith paid laboratory subjects and comparably to otheronline recruitment methods. For example, Paolacci et al.(2010) replicated classic studies from the judgment anddecision-making literature at a cost of approximately$1.71 per hour per subject and obtained results thatparalleled the same studies conducted with undergraduates ina laboratory setting. Göritz, Wolff, and Goldstein (2008)showed that the hassle of using a third-party paymentmechanism, such as PayPal, can lower initial response ratesin online experiments. Mechanical Turk skirts this issue byoffering a built-in mechanism to pay workers (both flat rateand bonuses) that greatly reduces the difficulties of compen-sating individuals for their participation in studies.Faster theory/experiment cycle One implicit goal inresearch is to maximize the efficiency with which onecan go from generating hypotheses to testing them,analyzing the results, and updating the theory. Ideally,the limiting factor in this process is the time it takes todo careful science, but all too often, research is delayedbecause of the time it takes to recruit subjects andrecover from errors in the methodology. With access to alarge pool of subjects online, recruitment is vastlysimplified. Moreover, experiments can be built and puton Mechanical Turk easily and rapidly, which further reducesthe time to iterate the cycle of theory development andexperimental execution.Finally, we note that other methods of conductingbehavioral research may be comparable to or even betterthan Mechanical Turk on one or more of the axes outlinedabove, but taken as a whole, it is clear that Mechanical Turkcan be a useful tool for many researchers.V alidity of worker behaviorGiven the novel nature of Mechanical Turk, most of theinitial studies focused on evaluating whether it couldeffectively be used as a means of collecting valid data. Atfirst, these studies focused on whether workers on MechanicalTurk could be used as substitutes for domain-specific experts.For instance, Snow et al. (2008) showed that for a variety ofnatural language processing tasks, such as affect recognitionand word similarity, combining the output of just a fewworkers can equal the accuracy of expert labelers. Similarly,Marge et al. (2010) compared workers’ audio transcriptionswith domain experts and found that after a small biascorrection, the combined outputs of the workers were ofa quality comparable to that of the experts. Urbano,Morato, Marrero, and Martin (2010) crowdsourced similarityjudgments on pieces of music for the purposes of musicinformation retrieval. Using their techniques, they obtained apartially ordered list of similarity judgments at a far cheapercost than hiring experts, while maintaining high agreementbetween the workers and the experts. Alonso and Mizzaro(2009) conducted a study in which workers were askedto rate the relevance of pairs of documents and topicsand compared this with a gold standard given by experts.The output of the Turkers was similar in quality to thatof the experts.Of greater interest to behavioral researchers is whetherthe results of studies conducted on Mechanical Turk arecomparable to results obtained in other online domains,as well as offline settings. To this end, Buhrmester et al.(in press) compared Mechanical Turk subjects with a largeInternet sample with respect to several psychometricscales and found no meaningful differences between thepopulations, as well as high test–retest reliability in theMechanical Turk population. Additionally, Paolacci et al.(2010) conducted replications of standard judgment anddecision-making experiments on Mechanical Turk, as wellas with subjects recruited through online discussionboards and subjects recruited from the subject pool at alarge Midwestern university. The studies they replicatedwere the “Asian disease” problem to test framing effects(Tversky & Kahneman, 1981), the “Linda” problem to testthe conjunction fallacy (Tversky & Kahneman, 1983), andthe “physician” problem to test outcome bias (Baron &Hershey, 1988). Quantitatively, there were only very slightdifferences between the results from Mechanical Turk andsubjects recruited using the other methods, and qualita-tively, the results were identical. This is similar to theresults of Birnbaum (2000), who found that Internet userswere more logically consistent in their decisions than werelaboratory subjects.There have also been a few studies that have comparedMechanical Turk behavior with laboratory behavior. ForBehav Res

Page 4

example, the“Asiandisease” problem(Tversky& Kahneman,1981) was also replicated by Horton et al. (in press), who alsoobtained qualitatively similar results. In the same study, theauthors found that workers “irrationally” cooperated in theone-shot Prisoner’s Dilemma game, replicating previouslaboratory studies (e.g., Cooper, DeJong, Forsythe, & Ross,1996). They also found, in a replication of another, morerecent laboratory study (Shariff & Norenzayan, 2007), thatproviding a religious prime before the game increased thelevel of cooperation. Suri and Watts (2011) replicated a publicgoods experiment that was conducted in the classroom (Fehr& Gachter, 2000), and despite the difference in context andthe relatively lower pay on Mechanical Turk, there were nosignificant differences from a prior study conducted in theclassroom (Fehr & Gachter, 2000).In summary, there are numerous studies that showcorrespondence between the behavior of workers onMechanical Turk and behavior offline or in other onlinecontexts. While there are clearly differences betweenMechanical Turk and offline contexts, evidence thatMechanical Turk is a valid means of collecting data isconsistent and continues to accumulate.Organization of this guideIn the following sections, we begin with a high-leveloverview of Mechanical Turk, followed by an expositionof methods for conducting different types of studies onMechanical Turk. In the first half, we describe the basicsof Mechanical Turk, including who uses it and why, andthe general terminology associated with the platform. Inthe second half, we describe, at a conceptual level, howto conduct experiments on Mechanical Turk. We willfocus on new concepts that come up in this environmentthat may not arise in the laboratory or in other onlinesettings around the issues of ethics, privacy, and security.In this section, we also discuss the online communitythat has sprung up around Mechanical Turk. Weconclude by outlining some interesting open questionsregarding research on Mechanical Turk. We also includean appendix with engineering details required forbuilding and conducting experiments on MechanicalTurk, for researchers and programmers who are buildingtheir experiments.Mechanical Turk basicsThere are two types of players on Mechanical Turk:requesters and workers. Requesters are the “employers,”and the workers (also known as Turkers or Providers) arethe “employees”—or more accurately, the “independentcontractors.” The jobs offered on Mechanical Turk arereferred to as Human Intelligence Tasks (HITs). In thissection, we discuss each of these concepts in turn.WorkersIn March of 2007, the New York Times reported that therewere more than 100,000 workers on Mechanical Turk inover 100 countries (Pontin, 2007). Although this interna-tional diversity has been confirmed in many subsequentstudies (Mason & Watts, 2009; Paolacci et al., 2010; Ross,Irani, Silberman, Zaldivar, & Tomlinson, 2010), as of thiswriting the majority of workers come from the UnitedStates and India, because Amazon allows cash paymentonly in U.S. dollars and Indian Rupees—although workersfrom any country can spend their earnings on Amazon.com.Over the past 3 years, we have collected demographicsfor nearly 3,000 unique workers from five different studies(Mason & Watts, 2009; Suri & Watts, 2011). We compiledthese studies, and of the 2,896 workers, 12.5% chose not togive their gender, and of the remainder, 55% reported beingfemale and 45% reported being male. These demographicsagree with other studies that have reported that the majorityof U.S. workers on Mechanical Turk are female (Ipeirotis,2010b; Ross et al., 2010). The median reported age ofworkers in our sample is 30 years old, and the average ageis roughly 32 years old, as can be seen in Fig. 1; the overallshape of the distribution resembles reported ages in otherInternet-based research (Reips, 2001). The different studieswe compiled used different ranges when collecting infor-mation about income, so to summarize we classify workersby the top of their declared income range, which can beseen in Fig. 2. This shows that the majority of workers earnroughly U.S. $30 k per annum, although some respondentsreported earning over $100 k per year.Having multiple studies also allows us to check theinternal consistency of these self-reported demographics.Of the 2,896 workers, 207 (7.1%) participated in exactlytwo studies, and of these 207, only 1 worker (0.4%)changed the answer on gender, age, education, or income.Thus, we conclude that the internal consistency of self-reported demographics on Mechanical Turk is high. Thisagrees with Rand (in press), who also found consistency inself-reported demographics on Mechanical Turk, and withV oracek, Stieger, and Gindl (2001), who compared thegender reported in an online survey (not on MechanicalTurk) conducted at the University of Vienna with that in theschool’s records and found a false response rate below 3%.Given the low wages and relatively high income, onemay wonder why people choose to work on MechanicalTurk at all. Two independent studies asked workers toindicate their reasons for doing work on Mechanical Turk.Ross et al. (2010) reported that 5% of U.S. workers and13% of Indian workers said “MTurk money is alwaysBehav Res

Page 5

necessary to make basic ends meet.” Ipeirotis (2010b)asked a similar question but delved deeper into themotivations of the workers. He found that 12% of U.S.workers and 27% of Indian workers reported that“Mechanical Turk is my primary source of income.”Ipeirotis (2010b) also reported that roughly 30% of bothU.S. and Indian workers indicated that they were currentlyunemployed or held only a part-time job. At the other endof the spectrum, Ross and colleagues asked how importantmoney earned on Mechanical Turk was to them: Only 12%of U.S. workers and 10% of Indian workers indicated that“MTurk money is irrelevant,” implying that the moneymade through Mechanical Turk is at least relevant to thevast majority of workers. The modal response for both U.S.and Indian workers was that the money was simply niceand might be a way to pay for “extras.” Perhaps the bestsummary statement of why workers do tasks on MechanicalTurk is the 59% of Indian workers and 69% of U.S.workers who agreed that “Mechanical Turk is a fruitful wayto spend free time and get some cash” (Ipeirotis, 2010b).What all of this suggests is that most workers are not tryingto scrape together a living using Mechanical Turk (fewerthan 8% reported earning more than $50/week on the site).The number of workers available at any given time is notdirectly measurable. However, Ipeirotis (2010a) has trackedthe number of HITs created and available every hour (andrecently, every minute) over the past year and has usedthese statistics to infer the number of HITs being completed.With this information, he has determined that there areslight seasonalities with respect to time of day and day ofweek. Workers tend to be more abundant betweenTuesday and Saturday, and Huang et al. (2010) foundfaster completion times between 6 a.m. and 3 p.m. GMT,(which resulted in a higher proportion of Indian workers).Ipeirotis (2010a) also found that over half of the HITgroups are completed in 12 hours or less, suggesting alarge active worker pool.To become a worker, one must create a worker accounton Mechanical Turk and an Amazon Payments account intowhich earnings can be deposited. Both of these accountsmerely require an e-mail address and a mailing address.Any worker, from anywhere in the world, can spend themoney he or she earns on Mechanical Turk on theAmazon.com Web site. As was mentioned before, to beable to withdraw their earnings as cash, workers must takethe additional step of linking their Payments account to averifiable U.S. or Indian bank account. In addition, workerscan transfer money between Amazon’s Payment accounts.While having more than one account is against Amazon’sTerms of Service, it is possible, although somewhat tedious,for workers to earn money using multiple accounts andtransfer the earnings to one account to either be spent onAmazon.com or withdrawn. Requesters who use externalHITs (see The Anatomy of a HITsection) can guard againstmultiple submissions by the same worker by using browsercookies and tracking IP addresses, as Birnbaum (2004)suggested in the context of general online experiments.Another important policy forbids workers from usingprograms (“bots”) to automatically do work for them.Although infringements of this policy appear to be rare (butIncome (maximum)count020040060080010000 20000 400006000080000100000 120000 140000 160000Fig. 2 Distribution of the maximum of the income (in U.S. dollars)interval self-reported by workersReported Age of TurkersDensity0.000.010.020.030.040.052040 6080Fig. 1 Histogram (gray) and density plot (black) of reported ages ofworkers on Mechanical TurkBehav Res

Page 6

see McCreadie, Macdonald, & Ounis, 2010), there are alsolegitimate workers who could best be described asspammers. These are individuals who attempt to make asmuch money completing HITs as they can, without regardto the instructions or intentions of the requester. Theseindividuals might also be hard to discriminate from bots.Surveys are favorite targets for these spammers, since theycan be completed easily and are plentiful on MechanicalTurk. Fortunately, Mechanical Turk has a built-in reputationsystem for workers: Every time a requester rejects aworker’s submission, it goes on their record. Subsequentrequesters can then refuse workers whose rejection rateexceeds some specified threshold or can block specificworkers who previously submitted bad work. We willrevisit this point when we describe methods for ensuringdata quality.RequestersThe requesters who put up the most HITs and groups ofHITs on Mechanical Turk are predominantly companiesautomating portions of their business or intermediarycompanies that post HITs on Mechanical Turk on the behalfof other companies (Ipeirotis, 2010a). For example, searchcompanies have used Mechanical Turk to verify therelevance of search results, online stores have used it toidentify similar or identical products from different sellers,and online directories have used it to check the accuracyand “freshness” of listings. In addition, since businessesmay not want to or be able to interact directly withMechanical Turk, intermediary companies have arisen, suchas Crowdflower (previously called Dolores Labs) andSmartsheet.com, to help with the process and guaranteeresults. As has been mentioned, Mechanical Turk is alsoused by those interested in machine learning, since itprovides a fast and cheap way to get labeled data such astagged images and spam classifications (for more market-wide statistics of Mechanical Turk, see Ipeirotis, 2010a).In order to run studies on Mechanical Turk, one mustsign up as a requester. There are two or three accountsrequired to register as a requester, depending on how oneplans to interface with Mechanical Turk: a requesteraccount, an Amazon Payments Account, and (optionally)an Amazon Web Services (AWS) account.One can sign up for a requester account at https://requester.mturk.com/mturk/beginsignin.3It is advisable touse a unique e-mail address for running experiments,preferably one that is associated with the researcher or theresearch group, because workers will interact with theresearcher through this account and this e-mail address.Moreover, the workers will come to learn a reputationand possibly develop a relationship with this account onthe basis of the jobs being offered, the money beingpaid, and, on occasion, direct correspondence. Similarly,we recommend using a name that clearly identifies theresearcher. This does not have to be the researcher’sactual name (although it could be) but also should besufficiently distinctive that the workers know who theyare working for. For example, the requester name“University of Copenhagen” could refer to many re-search groups, and workers might be unclear about whois actually doing the research; the name “Perception Labat U. Copenhagen” would be better.To register as a requester, one must also create anAmazon Payments account (https://payments.amazon.com/sdui/sdui/getstarted) with the same account details as thoseprovided for the requester account. At this point, a fundingsource is required, which can be either a U.S. credit card ora U.S. bank account. Finally, if one intends to interact withMechanical Turk programatically, one must also create anAWS account at https://aws-portal.amazon.com/gp/aws/developer/registration/index.html. This provides one withthe unique digital keys necessary to interact with theMechanical Turk Application Programming Interface(API), which is discussed in detail in the Programminginterfaces section of the Appendix.Although Amazon provides a built-in mechanism fortracking the reputation of the workers, there is nocorresponding mechanism for the requesters. As a result,one might imagine that unscrupulous requesters couldrefuse to pay their workers, irrespective of the quality oftheir work. In such a case, there are two recourses for theaggrieved workers. One recourse is to report this toAmazon. If repeated offenses have occurred, the requesterwill be banned. Second, there are Web sites whereworkers share experiences and rate requesters (see theTurker community section for more details). Requestersthat exploit workers would have an increasingly difficulttime getting work done because of these external reputa-tion mechanisms.The Anatomy of a HITAll of the tasks available on Mechanical Turk are listedtogether on the site in a standardized format that allows theworkers to easily browse, search, and choose between thejobs being offered. An example of this is shown in Fig. 3.Each job posted consists of many HITs of the same “HITtype,” meaning that they all have the same characteristics.Each HIT is displayed with the following information: thetitle of the HIT, the requester who created the HIT, the wagebeing offered, the number of HITs of this type available tobe worked on, how much time the requester has allotted for3The Mechanical Turk Web site can be difficult to search andnavigate, so we will provide URLs whenever possible.Behav Res

Page 7

completing the HIT, and when the HITexpires. By clickingon a link for more information, the worker can also see alonger description of the HIT, keywords associated with theHIT, and what qualifications are required to accept the HIT.We elaborate on these qualifications later, which restrictwho can work on a HITand, sometimes, who can previewit. If the worker is qualified to preview the HIT, he or shecan click on a link and see the preview, which typicallyshows what the HITwill look like when he or she works onthe task (see Fig. 4 for an example HIT).All of this information is determined by the requesterwhen creating the HIT, including the qualifications neededto preview or accept the HIT. Avery common qualificationrequires that over 90% of the assignments a worker hascompleted have been accepted by the requesters. Anothercommon type of requirement is to specify that workersmust reside in a specific country. Requesters can alsodesign their own qualifications. For example, a requestercould require the workers to complete some practice itemsand correctly answer questions about the task as aprerequisite to working on the actual assignments. Morethan one of these qualifications can be combined for agiven HIT, and workers always see what qualifications arerequired and their own value for that qualification (e.g.,their own acceptance rate).Another parameter the requester can set when creating aHIT is how many “assignments” each HIT has. A singleHIT can be made up of one or more assignments, and aworker can do only one assignment of a HIT. For example,if the HITwere a survey and the requester only wanted eachworker to do the survey once, he or she would make oneHIT with many assignments. As another example, if thetask was labeling images and the requester wanted threedifferent workers to label every image (say, for data qualitypurposes), the requester would make as many HITs as thereare images to be labeled, and each HIT would have threeassignments.When browsing for tasks, there are several criteria theworkers can use to sort the available jobs: how recentlythe HIT was created, the wage offered per HIT, the totalnumber of available HITs, how much time the requesterallotted to complete each HIT, the title (alphabetical),and how soon the HIT expires. Chilton, Horton, Miller,and Azenkot (2010) showed that the criterion mostfrequently used to find HITs is the “recency” of the HIT(when it was created), and this has led some toperiodically add available HITs to the job in order tomake it appear as though the HIT is always fresh. Whilethis undoubtedly works in some cases, Chilton andcolleagues also found an outlier group of recent HITsthat were rarely worked on—presumably, these are thejobs that are being continually refreshed but are unappealingto the workers.The offered wage is not often used for finding HITs,and Chilton et al., (2010) found a slight negativerelationship at the highest wages between the probabilityof a HIT being worked on and the wage offered. Thisfinding is reasonably explained by unscrupulous requestersusing high wages as bait for naive workers—which iscorroborated by the finding that higher paying HITs are morelikely to be worked on, once the top 60 highest paying HITshave been excluded.Fig. 3 Screenshot of the Mechanical Turk marketplaceBehav Res

Page 8

Internal or external HITs Requesters can create HITs in twodifferent ways, as internal or external HITs. An internal HITuses templates offered by Amazon, in which the task and allof the data collection are done on Amazon’s servers. Theadvantage of these types of HITs is that they can begenerated very quickly and the most one needs to know tobuild them is HTML programming. The drawback is thatthey are limited to be single-page HTML forms. In anexternal HIT, the task and data are kept on the requester’sserver and are provided to the workers through a frame onthe Mechanical Turk site, which has the benefit that therequester can design the HIT to do anything he or she iscapable of programming. The drawback is that one needsaccess to an external server and, possibly, more advancedprogramming skills. In either case, there is no explicit cuethat the workers can use to differentiate between internaland external HITs, so there is no difference from theworkers’ perspective.Lifecycle of HIT The standard process for HITs onAmazon’s Mechanical Turk begins with the creation of theHIT, designed and set up with the required information.Once the requester has created the HITand is ready to haveit worked on, the requester posts the HIT to MechanicalTurk. A requester can post as many HITs and as manyassignments as he or she wants, as long as the totalamount owed to the workers (plus fees to Amazon) canbe covered by the balance of the requester’s AmazonPayments account.Once the HIT has been created and posted to MechanicalTurk, workers can see it in the listings of HITs and choose toaccept the task. Each worker then does the work and submitsthe assignment. After the assignment is complete, requestersreview the work submitted and can accept or reject any or alloftheassignments.Whentheworkisaccepted,thebasepayistaken from the requester’s account and put into the worker’saccount. At this point requesters can also grant bonuses toworkers. Amazon charges the requesters 10% of the total paygranted (base pay plus bonus) as a service fee, with aminimum of $0.005 per HIT.If there are more HITs of the same type to work on afterthe workers complete an assignment, they are offered theopportunity to work on another HITof the same type. Thereis even an option to automatically accept HITs of the sametype after completing one HIT. Most HITs have some kindof initial time cost for learning how to do the task correctly,and so it is to the advantage of workers to look for taskswith many HITs available. In fact, Chilton et al. (2010)found that the second most frequently used criterion forsorting is the number of HITs offered, since workers lookfor tasks where the investment in the initial overhead willpay off with lots of work to be done. As was mentioned, theFig. 4 Screenshot of an example image classification HITBehav Res

Page 9

requester can prevent this behavior by creating a single HITwith multiple assignments, so that workers cannot havemultiple submissions.The HIT will be completed and will disappear from thelist on Mechanical Turk when either of two things occurs:All of the assignments for the HIT have been submitted, orthe HIT expires. As a reminder, both the number ofassignments that make up the HIT and the expiration timeare defined by the requester when the HIT is created. Also,both of these values can be increased by the requester whilethe HIT is still running.Reviewing work Requesters should try to be as fair aspossible when judging which work to accept and reject. If arequester is viewed as unfair by the worker population, thatrequester will likely have a difficult time recruiting workersin the future. Many HITs require the workers to have anapproval rating above a specified threshold, so unfairlyrejecting work can result in workers being preventedfrom doing other work. Most importantly, wheneverpossible requesters should be clear in the instructions ofthe HIT about the criteria on which work will beaccepted or rejected.One typical criterion for rejecting a HIT is if it disagreeswith the majority response or is a significant outlier (Dixon,1953). For example, consider a task where workers classifya post from Twitter as spam or not spam. If four workersrate the post as spam and one rates it as not spam, this maybe considered valid grounds for rejecting the minorityopinion. In the case of surveys and other tasks, a requestermay reject work that is done faster than a human couldhave possibly done the task. Requesters also have theoption of blocking workers from doing their HIT. Thisextreme measure should be taken only if a worker hasrepeatedly submitted poor work or has otherwise tried toillicitly get money from the requester.Improving HITefficiencyHow much to pay One of the first questions asked by newrequesters on Mechanical Turk is how much to pay for atask. Often, rather than anchoring on the costs for onlinestudies, researchers come with the prior expectation based onlaboratory subjects, who typically cost somewhat more thanthe current minimum wage. However, recent research on thebehavior of workers (Chilton et al., 2010) demonstrated thatworkers had a reservation wage (the least amount of pay forwhich they would do the task) of only $1.38 per hour, withan average effective hourly wage of $4.80 for workers(Ipeirotis, 2010a).There are very good reasons for paying more in labexperiments than on Mechanical Turk. Participating in alab-based experiment requires aligning schedules with theexperimenter, travel to and from the lab, and the effortrequired to participate. On Mechanical Turk, the effort toparticipate is much lower since there are no travel costs,and it is always on the worker’s schedule. Moreover,because so many workers are using AMT as a source ofextra income using free time, many are willing to acceptlower wages than they might otherwise. Others have arguedthat because of the necessity for redundancy in collectingdata (to avoid spammers and bad workers), the wage thatmight otherwise go to a single worker is split among theredundant workers.4We discuss some of the ethical argu-ments around the wages on Mechanical Turk in the Ethicsand privacy section.A concern that is often raised is that lower pay leads tolower quality work. However, there is evidence that for atleast some kinds of tasks, there seems to be little to noeffect of wage on the quality of work obtained (Marge etal., 2010; Mason & Watts, 2009). Mason and Watts usedtwo tasks in which they manipulated the wage earned onMechanical Turk, while simultaneously measuring thequantity and quality of work done. In the first study, theyfound that the number of tasks completed increased withgreater wages (from $0.01 to $0.10) but that there was nodifference in the quality of work. In the second study, theyfound that subjects did more tasks when they received paythan when they received no pay per task but saw no effectof actual wage on quantity or quality of the work.These results are consistent with the findings from thesurvey paper of Camerer and Hogarth (1999), whichshowed that for most economically motivated experiments,varying the size of the incentives has little to no effect. Thissurvey article does, however, indicate that there are classesof experiments, such as those based on judgments anddecisions (e.g., problem solving, item recognition/recall,and clerical tasks) where the incentive scheme has an effecton performance. In these cases, however, there is usually achange in behavior going from paying zero to some lowamount and little to no change in going from a low amountto a higher amount. Thus, the norm on Mechanical Turk ofpaying less than one would typically pay laboratorysubjects should not impact large classes of experiments.Consequently, it is often advisable to start by paying lessthan the expected reservation wage, and then increasing thewage if the rate of completed work is too low. Also, oneway to increase the incentive to subjects without drasticallyincreasing the cost to the requester is to offer a lottery tosubjects. This has been done in other online contexts(Göritz, 2008). It is worth noting that requesters can postHITs that pay nothing, although these are rare and unlikelyto be worked on unless there is some additional motivation4http://behind-the-enemy-lines.blogspot.com/2010/07/mechanical-turk-low-wages-and-market.html.Behav Res

Page 10

(e.g., benefiting a charity). In fact, previous work hasshown that offering subjects financial incentives increasesboth the response and retention rates of online surveys,relative to not offering any financial incentive (Frick,Bächtiger, & Reips, 2001; Göritz, 2006).Time to completion The second most often asked questionis how quickly work is completed. Of course, the answer tothe question depends greatly on many different factors: howmuch the HIT pays, how long each HIT takes, how manyHITs are posted, how enjoyable the task is, the reputation ofthe requester, and so forth. To illustrate the effect of one ofthese variables, the wage of the HIT, we posted threedifferent six-question multiple-choice surveys. Each surveywas one HITwith 500 assignments. We posted the surveyson different days so that we would not have two surveys onthe site at the same time. But we did post them on the sameday of the week (Friday) and at the same time of day(12:45 p.m. EST). The $0.05 version was posted on August13, 2010; the $0.03 version was posted on August 27,2010; and the $0.01 version was posted on September 17,2010. We held the time and day of week constant because,as was mentioned earlier, both have shown to haveseasonality trends (Ipeirotis, 2010a). Figure 5 shows theresults of this experiment. The response rate for the $0.01survey was much slower than those for the $0.03 and $0.05versions, which had very similar response rates. While thisis not a completely controlled study and is just meant forillustrative purposes, Buhrmester et al. (in press) and Huanget al. (2010) found similar increases in completion timewith greater wages. Looking across these studies, one couldconclude that the relationship between wage and completiontime is positive but nonlinear.Attrition Attrition is a bigger concern in online experimentsthan in laboratory experiments. While it is possible forsubjects in the lab to simply walk out of an experiment, thishappens relatively rarely, presumably because of the socialpressure the subjects might feel to participate. In the onlinesetting, however, user attrition can come from a variety ofsources. A worker could simply open up a new browserwindow and stop paying attention to the experiment athand, he or she could walk away from their computers inthe middle of an experiment, a user’s Web browser or entiremachine could crash, or his or her Internet connectivitycould cut out.One technique for reducing attrition in online experi-ments involves asking subjects how serious they are aboutcompleting the experiment and dropping the data fromthose whose seriousness is below a threshold (Musch &Klauer, 2002). Other techniques involve putting anythingthat might cause attrition, such as legal text and demo-graphic questions, at the beginning of the experiment. Thus,subjects are more likely to drop out during this phase thanduring the data-gathering phase (see Reips, 2002, andfollow-up work by Göritz & Stieger, 2008). Reips (2002)also suggested using the most basic and widely availabletechnology in an online experiment to avoid attrition due tosoftware incompatibility.Conducting studies on Mechanical TurkIn the following sections, we show how to conduct researchon Mechanical Turk for three broad classes of studies.Depending on the specifics of the study being conducted,experiments on Mechanical Turk can fall anywhere onthe spectrum between laboratory experiments and fieldexperiments. We will see examples of experiments thatcould have been done in the lab but were put onMechanical Turk. We will also see examples of whatamount to online field experiments. We outline thegeneral concepts that are unique to doing experimentson Mechanical Turk throughout this section and elaborateon the technical details in the Appendix.SurveysSurveys conducted on Mechanical Turk share the sameadvantages and disadvantages as any online survey(Andrews, Nonnecke, & Preece, 2003; Couper, 2000).The issues surrounding online survey methodologies havebeen studied extensively, including a special issue of PublicOpinion Quarterly devoted exclusively to the topic (Couper& Miller, 2008). The biggest disadvantage to conductingsurveys online is that the population is not representative ofany geographic area or segment of population, andMechanical Turk is not even particularly representative ofthe online population.Days from HIT creationPercent Completed204060801005 10 1520Wage0.010.030.05Fig. 5 Response rate for three different six-question multiple-choicesurveys conducted with different pay ratesBehav Res

Page 11

Methods have been suggested for correcting these selectionbiases in surveys generally (Berk, 1983; Heckman, 1979), andthe appropriate way to do this on Mechanical Turk is an openquestion. Thus, as with any sample, whether it be online oroffline, researchers must decide for themselves whether thesubject pool on Mechanical Turk is appropriate for their work.However, as a tool for conducting pilot surveys or forsurveys that do not depend on generalizability, MechanicalTurk can be a convenient platform for constructing surveysand collecting responses. As was mentioned in theIntroduction, relative to other methodologies, MechanicalTurk is very fast and inexpensive. However, this benefitcomes with a cost: the need to validate the responses tofilter out bots and workers who are not attending to thepurpose of the survey. Fortunately, validating responses canbe managed in several relatively time- and cost-effectiveways, as outlined in the Quality assurance section.Moreover, because workers on Mechanical Turk aretypically paid after completing the survey, they are morelikely to finish it once they start (Göritz, 2006).AmazonprovidesaHITtemplatetoaidintheconstructionofsurveys (Amazon also provides other templates, which wediscuss in the HIT templates section of the Appendix). Using atemplate means that the HITwill run on an Amazon machine.Amazon will store the data from the HIT, and the requester canretrieve the data at any point in the HIT’s lifecycle. The HITtemplate gives the requester a simple Web form where he orshe defines all the values for the various properties of the HIT,such as the number of assignments, pay rate, title, anddescription (see the Appendix for a description of all of theparameters of an HIT). After specifying the properties for theHIT, the requester then creates the HTML for the HIT. In theHTML, the requester specifies the type of input and content foreach input type (e.g., survey question), and for multiple-choicequestions, the value for each choice. The results are given backto the requester in a column-separated file (.csv). There is onerow for each worker and one column for each question, wherethe worker’s response is in the corresponding cell. Requestersare allowed to preview the modified template to ensure thatthere are no problems with the layout.Aside from standard HTML, HIT templates can alsoinclude variables that can have different values for eachHIT, which Mechanical Turk fills in when a workerpreviews the HIT. For example, suppose one did a simplesurvey template that asked one question: What is yourfavorite ${object}? Here, ${object} is a variable. Whendesigning the HIT, a requester could instantiate this variablewith a variety of values by uploading a .csv file with${object} as the first column and all the values in the rowsbelow. For example, a requester could put in values ofcolor, restaurant, and song. If done this way, three HITswould be created, one for each of these values. Each one ofthese three HITs would have ${object} replaced with color,restaurant, and song, respectively. Each of these HITs wouldhave the same number of assignments as specified in theHIT template.Another way to build a survey on Mechanical Turk is touse an external HIT, which requires you to host the surveyon your own server or use an outside service. This has thebenefit of increased control over the content and aestheticsof the survey, as well as allowing one to have multiplepages in a survey and, generally, more control over the formof the survey. This also means the data is secure because itis never stored on Amazon’s servers. We will discussexternal HITs more in the next few sections.It is also possible to integrate online survey tools such asSurveyMonkey and Zoomerang with Mechanical Turk. Onemay want to do this instead of simply creating the surveywithin Mechanical Turk if one has already created a longsurveyusingoneofthesetoolsandwouldsimplyliketorecruitsubjects through Mechanical Turk. Tointegrate with a premadesurveyonanothersite,onewouldcreateaHITthatprovidestheworker with a unique identifier, a link to the survey, and asubmit button. In the survey, one would include a text field forthe workertoentertheirunique identifier.Onecouldalsodirectthe worker to the “dashboard” page (https://www.mturk.com/mturk/dashboard) that includes their unique worker ID, andhave them use that as their identifier on the survey site. Therequester would then know to approve only the HITs that havea survey with a matching unique identifier.Random assignmentThe cornerstone of most experimental designs is randomassignment of subjects to different conditions. The key torandom assignment on Mechanical Turk is ensuring thatevery time the study is done, it is done by a new worker.Although it is possible to have multiple accounts (see theWorkers section), it is against Amazon’s policy, so randomassignment to unique Worker IDs is a close approximationto uniquely assigning individuals to conditions. Additionally,tracking worker IP addresses and using browser cookies canhelp ensure unique workers (Reips, 2000).One way to do random assignment on Mechanical Turkis to create external HITs, which allows one to host anyWeb-based content within a frame on Amazon’s MechanicalTurk. This means that any functionality one can havewith Web-based experiments—including setups based onJavaScript, PHP , Adobe Flash, and so forth—can be doneon Mechanical Turk. There are three vital components torandom assignment with external HITs. First, the URL ofthe landing page of the study must be included in theparameters for the external HIT so Mechanical Turk willknow where the code for the experiment resides. Second,the code for the experiment must capture three variablespassed to it from Amazon when a worker accepts the HIT:Behav Res

Page 12

the “HITId,” “WorkerId,” and “AssignmentId.” Finally,the experiment must provide a “submit” button thatsends the Assignment ID (along with any other data)back to Amazon (using the externalSubmit URL, asdescribed in the Appendix).For a Web-based study that is being hosted on anexternal server but delivered on Mechanical Turk, there area few ways to ensure that subjects are being assigned toonly one condition. The first way is to post a single HITwith multiple assignments. In this way, Mechanical Turkensures that each assignment is completed by a differentworker: each worker will see only one HIT available.Because every run through the study is done by a differentperson, random assignment can be accomplished byensuring that the study chooses a condition randomly everytime a worker accepts a HIT.While this method is relatively easy to accomplish, it canrun into problems. The first arises when one has to rerun anexperiment. There is no built-in way to ensure that a workerwho has already completed a HITwill not be able to returnthe next time a HIT is posted and complete it again,receiving a different condition assignment the second timearound. Partially, this can be dealt with by careful planningand testing, but some experimental designs may need to berepeated multiple times while ensuring that subjects arereceiving the same condition each time. A simple but moreexpensive way to deal with repeat workers is to allow allworkers to complete the HIT multiple times and disregardsubsequent submissions. A more cost-effective way is tostore the mapping between a Worker ID (passed to the sitewhen the worker accepts the HIT) and that worker’sassigned condition. If the study is built so that this mappingis checked when a worker accepts the HIT, the experimentercan be sure that each worker experiences only a singlecondition. Another option is to simply refuse entry toworkers who have already done the experiment. In thiscase, requesters must clearly indicate in the instructions thatworkers will be allowed to do the experiment only once.Mapping the Worker ID to the condition assignment doesnot, of course, rule out the possibility that the workers willdiscuss their condition assignments. As we discuss in theTurker community section, workers are most likely tocommunicate about the HITs on which they worked in theonline forums focused on Mechanical Turk. It is possiblethat these conversations will include information about theircondition assignments, and there is no way to preventsubjects from communicating. This can also be an issuein general online experiments and in multisessionoffline experiments. Mechanical Turk has the benefitthat these conversations on the forums can be monitoredby the experimenter.When these methods are used, the preview page must bedesigned to be consistent with all possible conditionassignments. For instance, Mason and Watts (2009)randomized the pay the subjects received. Because thewage offered per HIT is visible before the worker evenpreviews the HIT, the different wage conditions had to bedone through bonuses and could not be revealed until afterthe subject had accepted the HIT.Finally, for many studies, it is important to calculate andreport intent-to-treat effects. Imagine a laboratory study thatmeasurestheeffectofblaringnoisesonreadingcomprehensionthat finds the counterintuitive result that the noises improvecomprehension. This result could be explained by the fact thattherewasahigherdropoutrateinthe“noises”conditionandtheremainder either had superior concentration or were deaf and,therefore, unaffected. In the context of Mechanical Turk, oneshould be sure to keep records of how many people acceptedand how many completed the HITin each condition.Synchronous experimentsMany experimental designs have the property that onesubject’s actions can affect the experience and, possibly, thepayment of another subject. Mechanical Turk was designedfor tasks that are asynchronous in nature, in which the workcan be split up and worked on in parallel. Thus, it is not apriori clear how one could conduct these types of experi-ments on Mechanical Turk. In this section, we describe oneway synchronous participation can be achieved: by buildinga subject panel, notifying the panel of upcoming experi-ments, providing a “waiting room” for queuing subjects,and handling attrition during the experiment. The methodsdiscussed here have been used successfully by Suri andWatts (2011) in over 100 experimental sessions, as well asby Mao, Parkes, Procaccia, and Zhang (2011).Building the panel An important part of running synchro-nous experiments on Mechanical Turk is building a panel ofsubjects to notify about upcoming experiments. We recom-mend building the panel by either running several small,preliminary experiments or running a different study onMechanical Turk and asking subjects whether they wouldlike to be notified of future studies. In these preliminaryexperiments, the requester should require that all workerswho take part in the experiment be first-time players,indicate this clearly in the instructions, and build it into thedesign of the HIT. Since the default order in which workersview HITs is by time of creation, with the newest HITs first, anew HITis seen by quite a few workers right after it has beencreated(Chiltonetal.,2010). Thus, we found requiring only 4to 8 subjects works well, since this ensures that the firstworker to accept the HITwill not have to wait too long beforethe last worker accepts this HITand the session can begin.At the end of the experiment, perhaps during an exitsurvey, the requester can ask the workers whether theyBehav Res

Page 13

would like to be notified of future runs of this or otherexperiments. When subjects are asked whether they wouldlike to be notified of future studies, we recommend makingthe default option to not be notified and asking the workersto opt in. Since most tasks on Mechanical Turk are rathertedious, even a moderately interesting experiment will havea very high opt-in rate. For example, the opt-in rate was85% for Suri and Watts (2011). In addition, since theworkers are required to be fresh (i.e., never having done theexperiment before), this method can be used to grow thepanel fairly rapidly. Figure 6 shows the growth of one panelusing this method, and we have seen even faster growth insubsequent studies. It should be clear to the subjects joiningthe panel whether they are being asked to do more studiesof the same type or studies of a different type from the samerequester. If they agree to the latter, the panels can be reusedfrom experiment to experiment. Göritz et al. (2008) showedthat paying individuals between trials of an experiment canincrease response and retention rates, although their resultswere attenuated by the fact that their subjects had to takethe time to sign up for a PayPal account, which isunnecessary on Mechanical Turk.In our experience, small preliminary experiments havea benefit beyond growing the panel: they serve to exposebugs in the experimental system. Systems where usersconcurrently interact can be difficult to test and debug,since it can be challenging for a single person to get theentire system in a state where the bug reveals itself.Also, it is better for problems to reveal themselves with asmall number of workers in the experiment than with alarge number.Notifying workers Now that we have shown how toconstruct a panel, we next show how to take advantage ofit. Doing so involves a method that Mechanical Turkprovides for sending messages to workers. Before theexperiment is to run, a requester can use the NotifyWorkersAPI call to send workers a message indicating the times atwhich the next experiment(s) will be run (see the Appendixfor more details, including how to ensure that the e-mailsare delivered and properly formatted). We found thatsending a notification the evening before an experimentwas sufficient warning for most workers. We also found thatconducting experiments between 11 a.m. and 5 p.m. ESTresulted in the experiment filling quickly and proceedingwith relatively few dropouts. Also, if one wants to conductexperiments with n subjects simultaneously, experience hasshown us that one needs a panel with 3n subjects in it.Using this rule of thumb, we have managed to run as manyas 45 subjects simultaneously. If the panel has substantiallymore than 3n subjects, many workers might get shut out ofthe experiment, which can be frustrating to them. In thiscase, one could either alter the experiment to allow moresubjects or sample 3n subjects from the panel.Waiting room Since the experiment is synchronous, all ofthe workers must begin the experiment at the same time.However, there will inevitably be differences in the timethat workers accept the HIT. One way to resolve this issueis to create an online “waiting room” for the workers. Asmore workers accept the HIT, the waiting room will fill upuntil the requisite number of workers have arrived and theexperiment can begin. We have found that indicating to theworkers how many people have joined and how many arerequired provides valuable feedback on how much timethey can expect to wait. Once one instance of theexperiment has filled up and begun, the waiting room canthen either inform additional prospective workers that theexperiment is full and they should return the HITor funnelthem into another instance of the experiment. The waitingroom and the message that the experiment is full are goodopportunities to recruit more subjects into the study and/oradvertise future runs of the experiment.Attrition In the synchronous setting, it is of paramountimportance to have a time-out after which, if a subject hasnot chosen an action, the system chooses one for him or her.Including this time-out and automated action avoids havingan experiment stall, with all of the subjects waiting for amissing subject to take an action. Because experiments onMechanical Turk are inexpensive, an experimenter cansimply throw out trials with too much attrition. Alterna-Days from first entrantPanel Size204060801001201400 50 100150 200 250300Fig. 6 Rate of growth of panel from Suri and Watts (2011). Periodswithout growth indicate times between experimental runsBehav Res

Page 14

tively, the experimenter can use the dropouts as anopportunity to have a (dummy) confederate player act in aprescribed way to observe the effect on the subjects. In thework of Suri and Watts (2011), the authors discardedexperiments where fewer than 90% of the actions weredone by humans (as opposed to default actions chosen bythe experimental system). Out of 94 experiments run with20–24 players, 21 had to be discarded using this criterion.Quality assuranceThe downside to fast and cheap data is the potential for lowquality. From the workers’ perspective, they will earn themost money by finding the fastest and easiest way tocomplete HITs. As was mentioned earlier, most workers arenot motivated primarily by the financial returns andgenuinely care about the quality of their work, but nearlyall of them also care, at least a little, about how efficientlythey are spending their time. However, there are a fewworkers who do not care about the quality of the work theyput out as long as they earn money (they are typicallycharacterized as spammers). Moreover, there are reports ofprograms (bots) designed to automatically complete HITs(McCreadie etal.,2010), and these are essentially guaranteedto provide bad data.To ensure that the instructions for the HIT are clear,requesters can add a text box to their HIT asking whetherany part of it was confusing. In addition, there has been asignificant amount of research put into methods forimproving and assuring data quality. The simplest and,probably most commonly used method is obtainingmultiple responses. For many of the common tasks onMechanical Turk, this is a very effective and cost-efficientstrategy. For instance, Snow and colleagues comparedworkers on Mechanical Turk with expert labelers for naturallanguage tasks and determined how many Mechanical Turkworker responses were required to get expert-level accuracy(Snow et al., 2008), which ranged from two to nine with asimple majority rule and one or two with more sophisticatedlearningalgorithms.Sheng,Provost,andIpeirotis(2008) usedlabels acquired through Mechanical Turk as input to amachine-learning classifier and showed over 12 data setsthat, using the “majority vote” label obtained from multiplelabels, improved classification accuracy in all cases. Infollow-up work, Ipeirotis, Provost, and Wang (2010)developed an algorithm that factors both per-item classifica-tion error and per-worker biases to reduce error with evenfewer workers and labels.However, for most survey and experimental data, whereindividual variability is an important part of the dataobtained, receiving multiple responses may not be anoption for determining “correct” responses. For surveysand some experimental designs, one option is to include aquestion designed to discourage spammers and bots,something that requires human knowledge and the sameamount of effort as other questions in the survey but has averifiable answer that can be used to vet the submittedwork. Kittur, Chi, and Suh (2008) had Mechanical Turkworkers rate the quality of Wikipedia articles and comparedthem with experts. They found a significant increase in thequality of the data obtained when they included additionalquestions that had verifiable answers: The proportion ofinvalid responses went from 48.6% to 2.5%, and thecorrelation of responses to expert ratings became statisti-cally significant. If you include these “captcha” or “reverseTuring test” questions, it is advisable to make it clear thatworkers will not be paid if the answers to the verifiablequestions are not answered correctly. Also, if the questionsare very incongruent with the rest of the study, it should beclear that they are included to verify the legitimacy of theother answers. Two examples of such questions are “Who isthe president of the United States?” and “What is 2 + 2?”We asked the former question as a captcha question in oneof the surveys described in Fig. 5. Out of 500 responses,only six people got the question wrong, and three peopledid not answer the question.In some cases, it may be possible to have the workerscheck their own work. If responses in a study do not havecorrect answers but do have unreasonable answers, it maybe possible to use Mechanical Turk workers to vet theresponses of others’ work. For instance, if a response to astudy requires a free-text response, one could create anotherHIT for the purpose of validating the response. It would bea very fast and easy task for workers (and therefore,inexpensive for requesters) to read these responses andverify that they are a coherent and reasonable response tothe question asked. Little, Chilton, Goldman, and Miller(2010) found that this sort of self-correction can be a veryefficient way of obtaining good data.Finally, another effective way of filtering bad responsesis to look at the patterns of responses. Zhu and Carterette(2010) looked at the pattern of responses on surveys andfound that low-quality responses had very low-entropypatterns of response—always choosing one option (e.g., thefirst response to every question) or alternating between asmall number of options in a regular pattern (e.g., switchingbetween the first and the last responses). The time spentcompleting individual tasks can also be a quick and easymeans of identifying poor/low-effort responses—so muchso that filtering work by time spent is built into theMechanical Turk site for reviewing output. When Kittur etal. (2008) included verifiable answers in their study, theyfound that the time spent completing each survey went upfrom 1.5 min to over 4 min. It is usually possible todetermine a lower bound on the amount of time required toBehav Res

Page 15

actually participate in the study and to filter responses thatfall below this threshold.SecurityAs was stated above, the code for an external HIT typicallyresides on the requester’s server. The code for the HIT issusceptible to attacks from the general Internet population,because it must be executable by any machine on theInternet to work on Mechanical Turk. Here, we provide ageneral overview of some security issues that could affect astudy being run as an external HITand ways to mitigate theissues. In general, it is advisable to consult an expert incomputer security when hosting a public Web site.To begin with, we advocate that requesters make anautomated nightly backup of the work submitted by theworkers. In order to ensure the integrity of the datagathered, a variety of security precautions are necessaryfor external HITs. Two of the most common attacks onWeb-based applications are database (most commonlySQL) injection attacks and Cross Site Scripting (XSS)attacks. A database injection attack can occur on anysystem that uses a database to store user input andexperiment parameters, which is a common way to designWeb-based software. A database injection attack can occurat any place where the code takes user input. There are avariety of inputs that a malicious user could give that wouldtrick the database underlying the requester’s software to runit. Such code could result in the database executing anarbitrary command specified by the malicious user, andsome commands could compromise the data that have beenstored. Preventing this type of attack is a relativelystraightforward matter of scrubbing user input for databasecommands—for instance, by removing characters recog-nized by the database as a command. There are a varietyof software libraries in many programming languagesthat will aid in this endeavor specific to the particularimplementation of the database and software that can befound for free online.Cross-site scripting attacks (XSS) are another type ofcode injection attack. Here, a malicious user would try toinject arbitrary scripting code, such as malicious Java-Script code, into the input in an attempt to get therequester’s server to run the code. Here again, one of themain methods for preventing this type of attack is inputvalidation. For example, if the input must be a number, therequester’s code should ensure that the only characters inthe input are numbers, a plus or minus sign, or a decimalpoint. Another preventative measure is to “HTML escape”the user input, which ensures that any code placed in theinput by a malicious user will not be executed. We cautionprospective requesters who use external HITs to take thesemeasures seriously.Code security is not the only type of securitynecessary for experiments on Mechanical Turk. Theprotocol that the requester uses to run the experimentmust also be secure. We demonstrate this with anexample. The second author of this article attempted asynchronous experiment that was made up of many HITs.The first part of the HIT was to take a quiz to ensureunderstanding of the experiment. If a worker passed thequiz, he or she would enter the waiting room and theneventually go into the experiment. Workers were paid$0.50 for passing the quiz, along with a bonus, depend-ing on their actions in the experiment. Two maliciousworkers then accepted as many HITs as they could at onetime. Meanwhile, the benevolent workers accepted oneHIT each, passed the quiz, went into the waiting room,and eventually began the experiment. After accepting asmany as possible, the malicious workers then filled outthe quiz correctly for each HIT, submitting them after theexperiment began. Thus, the malicious workers werepaid for their quizzes and were not allowed into theexperiment. The second author got bilked out of roughly$200. The fix was simply to make the experiment oneHIT with many assignments, so that each Turker couldaccept only one HIT at a time.Ethics and privacyAs with any research involving human subjects, care mustbe taken to ensure that subjects are treated in an ethicalmanner and that the research follows standard guidelinessuch as the Belmont Report (Ryan et al., 1979). Whileoversight of human research is typically managed by thefunding or home institution of the researcher, it is theresearcher’s responsibility to ensure that appropriate stepsare taken to conduct ethical research.Mechanical Turk and other crowdsourcing sites define arelatively new ethical and legal territory, and therefore thepolicies surrounding them are open to debate. Felstiner(2010) reviews many of the legal grounds and ethical issuesrelated to crowdsourcing and is an excellent starting pointfor the discussion. There are also many ethical issues thatapply to online experimentation in general. While this hasbeen covered extensively elsewhere (Barchard & Williams,2008), we felt that it would be helpful to the reader tohighlight them here. In the following section, we touch onissues relevant to Institutional Review Boards (IRBs) whenproposing research on Mechanical Turk.Informed consentInformed consent of subjects is nearly always a requirementfor human subject research. One way to obtain consent onBehav Res