The use of crowdsourcing may be relatively new to the technology, business and humanitarian sectors but when it comes to statistics, crowdsourcing is a well known and established sampling method. Crowdsourcing is just non-probability sampling. The crowdsourcing of crisis information is simply an application of non-probability sampling.

Lets first review probability sampling in which every unit in the population being sampled has a known probability (greater than zero) of being selected. This approach makes it possible to “produce unbiased estimates of population totals, by weighting sampled units according to their probability selection.”

Non-probability sampling, on the other hand, describes an approach in which some units of the population have no chance of being selected or where the probability of selection cannot be accurately determined. An example is convenience sampling. The main drawback of non-probability sampling techniques is that “information about the relationship between sample and population is limited, making it difficult to extrapolate from the sample to the population.”

There are several advantages, however. First, non-probability sampling is a quick way to collect way to collect and analyze data in range of settings with diverse populations. The approach is also a “cost-efficient means of greatly increasing the sample, thus enabling more frequent measurement.” In some cases, the non-probability sampling may actually be the only approach available—a common constrain in a lot of research, including many medical studies, not to mention Ushahidi Haiti. The method is also used in exploratory research, e.g., for hypothesis generation, especially when attempting to determine whether a problem exists or not.

The point is that non-probability sampling can save lives, many lives. Much of the data used for medical research is the product of convenience sampling. When you see your doctor, or you’re hospitalized, that is not a representative sample. Should the medical field throw away all this data based on the fact that it constitutes non-probability sampling. Of course not, that would be ludicrous.

The notion of bounded crowdsourcing, which I blogged about here, is also a known sampling technique called purposive sampling. This approach involves targeting experts or key informants. Snowball sampling is another type of non-probability sampling, which may also be applied to crowdsource of crisis information.

In snowball sampling, you begin by identifying someone who meets the criteria for inclusion in your study. You then ask them to recommend others who they may know who also meet the criteria. Although this method would hardly lead to representative samples, there are times when it may be the best method available. Snowball sampling is especially useful when you are trying to reach populations that are inaccessible or hard to find.

A project like Mission 4636 and Ushahidi-Haiti could take advantage of this approach by using two-way SMS communication to ask respondents to spread the word. Individuals who sent in text messages about persons trapped under the rubble could (later) be sent an SMS asking them to share the 4636 short code with people who may know of other trapped individuals. When the humanitarian response began to scale during the search and rescue operations, purposive sampling using UN personnel could also have been implemented.

In contrast to non-probability sampling techniques, probability sampling often requires considerable time and extensive resources. Furthermore, non-response effects can easily turn any probability design into non-probability sampling if the “characteristics of non-response are not well understood” since these modify each unit’s probability of being sampled.

This is not to suggest that one approach is better than the other since this depends entirely on the context and research question.

22 responses to “Demystifying Crowdsourcing: An Introduction to Non-Probability Sampling”

Patrick, I have to disagree with your initial premise “Crowdsourcing is just non-probability sampling.”

If you just put out a fixed form that asks pre-selected questions, then yes – this simple concept may be true.

However, crowdsourcing is more than just asking fixed questions. It’s providing a platform in which members of the crowd can express their perspectives and values in ways that may not be expected. When you provide someone general tools for contributing information they will surprise you with what’s relevant, shareable and insights that are truly transformative in the way we gather information and communicate.

These are definitely new models to determine how they fit within response and analysis – but there is more value inherent in the crowd than provide fixed sampling values :)

As a sampling method, crowdsourcing is a form of non-probability sampling, eg, convenience sampling–which I think is largely synonymous with crowdsourcing (Mission 4636 being an example using a broad “pre-selected question” via SMS). I don’t think that’s debatable per se, it’s a matter of definitions when it comes to the field of statistics. The post was to make a point about methods and the fact that crowdsourcing is not new–the application of technologies are. So you’re right that I did not address the value of crowdsourcing (that was not the point of the post). And of course I fully agree with you that crowdsourcing is a way to give members of the crowd a voice which can be transformative.

Using 4636 as the example then – originally meant as a reporting mechanism for Cyclones I believe, it was tranformed into an Earthquake response tool, but also could be a needs identification, OpenStreetMap road quality and IDP camp updating, or community coordination mechanism. The type of data being sampled varies with the individual providing the the information.

Yes indeed, the type of data sampled varies based on project and purpose. What was measured with 4636 was people’s needs, which the Fletcher team would then triage into most urgent needs. It was a form of non-probability sampling since not everyone on the ground had a non-zero probability chance of being selected. Perhaps if DigiCell had gone ahead and done a mass SMS broadcast (which at one point was the plan) than this may have changed the sampling somewhat.

@Patrick, I apologize for my absence from the conversation, I’ve been meaning to re-engage but I haven’t been able to find the necessary quiet time to meditate on any of these great topics until now.

This will be a short response… hopefully.

In physics we sit around discussing, often arguing- often over alcohol, about probabilistics until we’re red in the face, but when it comes right down to it I’ve never been privy to a conversation, dissertation or paper in which human lives or livelihoods were the data points. It’s nice. With few exceptions, we (physicists) can argue, write heady papers and drink until we pass out, secure in the knowledge that no one is truly suffering in relation to our quaint and curious meanderings about the nature of the Universe.

So it seems surreal to me, perhaps slightly Orwellian, to be scrutinizing ‘crowd sourcing crisis information’ with the same metrics of assessment and description as we use for unfathomably small particles and musings about black holes. I don’t want to come across as being all gushy on the topic; I recognize that traditional academic rigor still carries a lot of weight in high places –the type of places where decisions about funding and focusing of resources get made, for example. But I just want to point out that it feels a little odd to me- the non-social scientist in the room.

… That said, I propose that the oddity for me, schism perhaps is a better word, which I sense regarding the ‘modern’ context this discussion is taking place in, extends deeper than just the differences between the social sciences and the natural sciences. I think I’m talking (rambling now) about the schism between the vastly different paradigms that create knowledge silos vs. knowledge flows, and the camps that presently place value on each type of knowledge paradigm.

I would suggest that part of what doesn’t feel whole about the broader academic conversation this post is taking place in is that it is by its very framing sitting squarely in the camp of knowledge silos as the only valid paradigm with which to evaluate crowd sourcing crisis information… I’m not suggesting anyone’s to blame for this, it happens in the natural sciences all the time – it’s the status quo there and I think we’re suffering for it – but it’s so much less critical to solve that problem in the natural sciences right now because people’s lives aren’t at stake- just a heated debate over beers.

Hopefully no one feels like I’m coming down too hard on their heads on this – I’m attempting to say this diplomatically – but the system is a little out of whack at the moment (out of synchrony if you will) and my opinion is that we’re doing ourselves a disservice as academics, and as people interested in seeing positive social impacts by allowing the conversation to continue to be framed in such a way.

the list of hypotheses you can ask and distinguish is much more limited when you’re using any kind of non-probabilistic sampling method.

As Andrew points out this doesn’t mean that there’s no value. I know I’m speaking to the converted, but a crowdsourcing platform does something quite different from a sampling or survey exercise. It offers something to the participants. I used to analyze the UNICEF MICS household cluster surveys (I’m sorry, those orphan indicators are mine), and used to wonder how we were really empowering the people who were taking part in the survey. Again, not the point of the exercise at the time, but it’s not a long stretch to start thinking about how new survey tools can be used for more concrete, and flexible, purposes.

And Sean, as someone who used to work in Physics I can say that I had similar sense of “these are REAL people and not just P(X | Y).