Thursday, February 17, 2011

I can think of three thinks that may have allowed Watson to win Jeopardy:

Hardware: From a comment at Shtetl-Optimized, "The hardware Watson was running on is said to be capable of 80 teraflops. According to the TOP500 list for November 2000, the fastest supercomputer (ASCI White) was capable of 4.9 teraflops." So, computers became 40x faster over the last 10 years. Is this the winning factor?

Software: A couple of months back Noam Nissan reported: "while improvements in hardware accounted for an approximate 1,000 fold increase in calculation speed over a 15-year time-span, improvements in algorithms accounted for an over 43,000 fold increase." So, maybe it is just the better NLP and machine learning algorithms that played the crucial role in this success.

Data: 10 years back we did not have Wikipedia, and its derivatives, such as Wiktionary, WikiQuote, Wikispecies, DBPedia, etc. Such resources add a tremendous value for finding connections between concepts.

My gut feeling says that the crucial factor is the development of the data resources that allowed Watson to answer such trivia questions. Without discounting the importance of hardware and software development, without having such tremendously organized and rich data sources, it would not be possible for Watson to answer any of these questions. The IBM WebFountain was around for a while, but trying to structure the unstructured web data, and get meaning out of such data, is much harder than taking and analyzing the nicely organized data in DBPedia.

I asked Dahn if he had collected more extensive data. He is running tasks using a very large number of workers, so his sample would have been representative.

I was not disappointed: Dahn had data from approximately 19,000 workers, based on 75,000 worker requests from the last 6 months, recording the "user-agent" part of the HTTP request. For each workerid, we counted how many different user-agents we have seen. The maximum was 19, with an average value of 1.3.

Then, measured the workerid's per user-agent string. If a worker had registered multiple user-agent strings, we split the credit across browsers. For example, if the same workerid had one session with IE8 and one with IE9, then we gave 0.5 credit to IE8 and 0.5 credit to IE9.

Yep, most (~85%-90%) of MTurk workers use Windows. I found very surprising the prevalence of Windows XP!

Browser Usage

I found very interesting the relatively low percentage of IE users.

For reference, here are the most common versions for each of the most-used browsers by Mechanical Turk workers:

One of the good news is that most MTurk workers tend to use new versions of the browsers with good support for the latest web technologies (css, javascript, etc). Interestingly, though, a very significant fraction uses Windows XP!

In the future, we may repeat the measurements by also keeping statistics about Javascript versions, plugins, flash support etc.

Until then, enjoy and code safely, knowing that you do not have to support IE6 and IE7 in your MTurk HITs.

Monday, February 14, 2011

I am just posting this here, to build some awareness about HCOMP 2011, the 3rd Human Computation Workshop, which will be organized together with AAAI in San Francisco, on August 7 or 8. You can also find more detailed information about the workshop at http://www.humancomputation.com. The submission deadline is April 22 April 29th.

In this 3rd Human Computation Workshop (HCOMP 2011), we hope to draw together participants across disciplines -- machine learning, HCI, mechanism and market design, information retrieval, decision-theoretic
planning, optimization, computer vision -- for a stimulating full-day workshop at AAAI in the beautiful San Francisco this summer. There will be presentation of new works, lively discussions, poster and demo sessions, and invited talks by Eric Horvitz, Jennifer Wortman and more. There will also be a 4-hour tutorial called "Human Computation: Core Research Questions and State of the Art" at AAAI on August 7, which will give newcomers and current researchers a bird’s eye view of the research landscape of human computation.

Human computation is a relatively new research area that studies how to build intelligent systems that involves human computers, with each of them performing computation (e.g., image classification, translation, and protein folding) that leverage human intelligence, but challenges even the most sophisticated AI algorithms that exist today. With the immense growth of the Web, human computation systems can now leverage the abilities of an unprecedented number of Internet users to perform complex computation. Various genres of human computation applications are available today, including games with a purpose (e.g., the ESP Game) that generates useful data through gameplay, crowdsourcing marketplaces (e.g., Amazon Mechanical Turk) that coordinate workers to perform tasks for monetary rewards, and identity verification systems (e.g. reCAPTCHA) that generate useful data through users performing computation for access to online content.

Despite the variety of human computation applications, there exist many common core research issues. How can we design mechanisms for querying human computers in such a way that incentivizes or encourages truthful responses? What are the techniques for aggregating noisy outputs from multiple human computers? How do we effectively assign tasks to human computers to match their particular expertise and interests? What are some programming paradigms for designing algorithms that effectively leverage the humans in the loop? How do we build human computation systems that involve the joint efforts of both machines and humans, trading off each of their particular strengths and weaknesses? Significant advances on such questions will likely need to draw many disciplines, including machine learning, mechanism and market design, information retrieval, decision-theoretic planning, optimization, human computer interaction, etc.

The workshop recognizes the growing opportunity for AI to function as an enabling technology in human computation systems. At the same time, AI can leverage technical advances and data collected from human
computation systems for its own advancement. The goal of HCOMP 2011 is to bring together academic and industry researchers from diverse subfields in a stimulating discussion of existing human computation applications and future directions of this relatively new subject area. The workshop also aims to broaden the scope of human computation to more than the issue of data collection to a broader definition of human computation, to study systems where humans perform a major part of the computation or are an integral part of the overall computational system.

Topics

Topics of interest include, but are not limited to:

Programming languages, tools and platforms to support human computation

Domain-specific challenges in human computation

Methods for estimating the cost, reliability, and skill of labelers

Methods for designing and controlling workflows for human computation tasks

Empirical and formal models of incentives in human computation systems

Benefits of one-time versus repeated labeling

Design of manipulation-resistance mechanisms in human computation

Concerns regarding the protection of labeler identities

Active learning from imperfect human labelers

Techniques for inferring expertise and routing tasks

Theoretical limitations of human computation

Format

The workshop will consist of several invited talks from prominent researchers in different areas related to human computation, selected presentations of technical and position papers, as well as poster and demo sessions, organized by theme.

Submission

Technical papers and position papers may be up to 6 pages in length, and should follow AAAI formatting guidelines. For demos and poster presentations, authors should submit a short paper or extended abstract (up to 2 pages). We welcome early work, and particularly encourage submission of visionary position papers that are more forward looking. Papers must be submitted electronically via CMT. The submission deadline is April 22, 2010.

Invited talk - Winter Mason, presenting his work on "Individual vs. Group Success in Social Networks".

Winter started by presenting a summary of his HCOMP 2009 paper on how payment on Mechanical Turk affect quality and speed of completion (hint: affects speed, does not affect quality). Then he moved to talk on how agents can collaborate to solve complex problems. How structure of the communication network affect the results? How the position in the network affects the performance of an individual? The participants were playing a game, where they were attempting to discover oil in a map. The users could see where their neighbors searched for oil in the field. So, they were kind of guiding each other in discovering oil. As part of the experiment, the underlying, invisible graph connecting the players was changed, to see the results. Typically the players were copying each other more, as the clustering coefficient of the graph increased (i.e, less exploration). However, this did not have a statistically significant effect in "finding the peak of oil production" and all graph structured performed similarly in terms of overall success, although there was some non-significant decrease in performance. This work points the direction to a lot of interesting future research: When should we let people talk to each other vs let them work independently? What is the structure of the solution space that indicates that people should collaborate vs explore independently (e.g., if there is a "hard to find" peak of oil production, and many areas with moderate oil production that are easier to find, you want people to explore independently; if there is a single peak, collaboration helps)

The main question is how to capture behavior of users of search engines, when you do not have a search engine available (ala Google, Microsoft, Yahoo, etc). The alternative is to have something done on the lab, but the population is homogeneous and expensive if you want to create a big data set.

The author described how they build a task-oriented system on top of MTurk, intermediating on top of Bing, and examined how users used the search engine to identify answers to the questions posed (taken by IIR track of TREC).

The authors describe their experiences crowdsourcing relevance evaluations. "Find interesting stories on day d, for category c". Basic setup: display the results to the user, and measure speed, cost, and level of agreement (quality of assessments). They observe everything in the server, and they conclude that MTurk is good, cheap, and fast. To ensure quality they need at least three assessors per document. Workers seemed to just skip through the documents (~15 seconds per document) but the results seemed pretty consistent with the overall TREC results.

Carsten Eickhoff and Arjen de Vries. How Crowdsourcable is Your Task?

The authors list their experiences of how they fell in love with crowdsourcing, but then they realized (surprise!) that there are cheaters out to get you and submit junk. Their major conclusions: the inherent reputation metrics on Turk are pretty much useless, and gold testing tends to be good in specific cases (well-defined, unambiguous answers). They suggest to make the HITs more "creative" and less "mechanical" to be less susceptible to spam. Novelty helps, as refreshes the mind of the worker.

How can we structure a resume screening HIT, in order to achieve both good precision and recall? HR screening is mainly a recall task (you do not want to lose good candidates). The initial worker screening included a English skill test, plus an "attention to detail" task, which examines that workers pay attention. The author presents an experiment with different treatment conditions in terms of incentives (no incentive, bonus always, bonus only on performance framed as negative). The basic result: incentives for performance increase completion time, and positive incentives increase performance.

An analysis of the MTurk market, to identify how long it takes a task to be completed. Effect of price: 10x price, gives a 40% speedup. Effect of grouping HITS: 1000 HITs grouped get done 7x faster than 1000 sequential HITs. The slides are available online.

Some ideas on how to use graphical models to model users and their expertise to match them with appropriate items to work on. Basic idea: user modeling helps in understanding the workers, and can improve active learning.

Omar Alonso. Perspectives on Infrastructure for Crowdsourcing.

Description of techniques that facilitate the development of advanced crowdsourcing systems (e.g., Mapreduce, reputation systems, workflows etc). Can we devise a general computation system for the HPU (human processing unit), building on existing paradigms for computational systems? What are the fundamental building blocks that we need?

I described our experiences in building systems for managing quality when dealing with imperfect human annotators (from our HCOMP 2010 paper), and on how to efficiently allocate resources in labeling when using the data to build machine learning models (KDD 2008 and working paper). I used plenty of examples from our AdSafe experience, and gave a brief glimpse into our latest explorations in using the use of psychology and biology to influence worker behavior.

Conclusions and Best Paper

The day ended with some overall discussion about the problems that we face with crowdsourcing. The discussion focused significantly on what we can do to separate inherent uncertainty in the signal from uninformative noise, and on how to be able to get the "informed minority" to get the truth out, without being drowned by the "tyranny of majority" (a good example is the question "Is Obama a Grammy winner?", where most people will intuitively say "no" but the correct answer is "yes"; in redundancy based approaches it is likely that the noise with bury the signal). Also people expressed concern that everyone is building his own little system from scratch, reinventing the wheel, instead of having a more coordinated effort to share experiences, and infrastructure. Also the paper "How Crowdsourcable is Your Task?" got the most-innovative paper award, and the discussion continued over beers and other distilled beverages....

Sunday, February 6, 2011

There are a few techniques, which are extremely easy to understand and implement. At the same time, they appear to be extremely basic and should be very easy to beat with more advanced techniques. However, this is often not the case. Consider the following examples:

Majority voting and aggregating discrete votes.

Let's say that we are trying to label objects using discrete labels (e.g., is this blog comment "spam" or "not spam"). For each object, we get multiple people to look at it, and label it, the current practice today on Mechanical Turk.

The simplest aggregation technique: Use the majority vote as the correct answer.

This seems a ridiculously easy baseline to beat. We can model quality of the workers. We can control for the varying difficulty of the examples that need to be rated. We can control for the different types of expertise of the workers, and match them with the examples that are best for them. Plentyofpaperswerepublishedaroundthistopic.

What is the improvement? Modest at best, and non-existent most of the time. The only (real) improvement, in most cases, means kicking out the spammers and take the majority vote across the good workers. Do we need any advanced technique for that? No. A few gold cases here and there (ala Crowdflower), or a simple comparison of how often one workers agrees with the majority, is typically enough.

Why is that? Because majority vote is a simple model. No parameters to estimate. For anything more advanced, we need a lot of data for the model to generate robust parameter estimates. The errors introduced by incorrect parameter estimates typically alleviates the advantages of the more complex modeling.

Averages and aggregating continuous probability estimates

Now consider the case of combining forecasts from multiple sources. For example, we want to predict the weather, and we have multiple sources each with its own forecast. Or we have many stock market analysts, covering the same stock and making predictions for future performance.

Consider the simplest way to aggregate: average across all estimates $p_i$.

$\hat p = \frac{1}{N} \cdot \sum_i^N p_i$

Very straightforward. As in the case of aggregating discrete labels, it is trivial to improve, in theory.

This topic has a long history in the literature, and there are even meta-studies that examine the effectiveness of the various approaches. See for example the survey-style studies:

Both studies reach similar conclusions: You can definitely improve simple averages, but most of the time the improvement is marginal, and you lose robustness. From Clemen and Winkler: "...simple combination rules (e.g., simple average) tend to perform quite well. More complex rules, sometimes outperform the simple rules, but they can be somewhat sensitive, leading to poor performance in some instances."

Sometimes a simple baseline algorithm is so good for practical purposes, that any improvements have only academic interest. I know that we have papers to write, careers to advance, and grants to get, but sometimes it is good to stop and think: "What did I gain, compared to a simpler alternative? Does it make sense to introduce so much additional complexity for the sake of some minor improvement?"