Tag Archives: Sampling

“Arguing that Big Data isn’t all it’s cracked up to be is a straw man, pure and simple—because no one should think it’s magic to begin with.” Since citing this point in my previous post on Big Data for Disaster Response: A List of Wrong Assumptions, I’ve come across more mischaracterizations of Big (Crisis) Data. Most of these fallacies originate from the Ivory Towers; from [a small number of] social scientists who have carried out one or two studies on the use of social media during disasters and repeat their findings ad nauseam as if their conclusions are the final word on a very new area of research.

The mischaracterization of “Big Data and Sample Bias”, for example, typically arises when academics point out that marginalized communities do not have access to social media. First things first: I highly recommend reading “Big Data and Its Exclusions,” published by Stanford Law Review. While the piece does not address Big Crisis Data, it is nevertheless instructive when thinking about social media for emergency management. Secondly, identifying who “speaks” (and who does not speak) on social media during humanitarian crises is of course imperative, but that’s exactly why the argument about sample bias is such a straw man—all of my humanitarian colleagues know full well that social media reports are not representative. They live in the real world where the vast majority of data they have access to is unrepresentative and imperfect—hence the importance of drawing on as many sources as possible, including social media. Random sampling during disasters is a Quixotic luxury, which explains why humanitarian colleagues seek “good enough” data and methods.

Some academics also seem to believe that disaster responders ignore all other traditional sources of crisis information in favor of social media. This means, to follow their argument, that marginalized communities have no access to other communication life lines if they are not active on social media. One popular observation is the “revelation” that some marginalized neighborhoods in New York posted very few tweets during Hurricane Sandy. Why some academics want us to be surprised by this, I know not. And why they seem to imply that emergency management centers will thus ignore these communities (since they apparently only respond to Twitter) is also a mystery. What I do know is that social capital and the use of traditional emergency communication channels do not disappear just because academics chose to study tweets. Social media is simply another node in the pre-existing ecosystem of crisis information.

Furthermore, the fact that very few tweets came out of the Rockaways during Hurricane Sandy can be valuable information for disaster responders, a point that academics often overlook. To be sure, monitoring social media footprints during disasters can help humanitarians get a better picture of the “negative space” and thus infer what they might be missing, especially when comparing these “negative footprints” with data from traditional sources. Indeed, knowing what you don’t know is a key component of situational awareness. No one wants blind spots, and knowing who is notspeaking on social media during disasters can help correct said blind spots. Moreover, the contours of a community’s social media footprint during a disaster can shed light on how neighboring areas (that are not documented on social media) may have been affected. When I spoke about this with humanitarian colleagues in Geneva this week, they fully agreed with my line of reasoning and even added that they already apply “good enough” methods of inference with traditional crisis data.

My PopTech colleague Andrew Zolli is fond of saying that we shape the world by the questions we ask. My UN colleague Andrej Verity recently reminded me that one of the most valuable aspects of social media for humanitarian response is that it helps us to ask important questions (that would not otherwise be posed) when coordinating disaster relief. So the next time you hear an academic go on about [a presentation on] issues of bias and exclusion, feel free to share the above along with this list of wrong assumptions.

Most importantly, tell them [say] this: “Arguing that Big Data isn’t all it’s cracked up to be is a straw man, pure and simple—because no one should think it’s magic to begin with.” It is high time we stop mischaracterizing Big Crisis Data. What we need instead is a can-do, problem-solving attitude. Otherwise we’ll all fall prey to the Smart-Talk trap.

“It might be provocative to call into question one of the hottest tech movements in generations, but it’s not really fair. That’s because how companies and people benefit from Big Data, Data Science or whatever else they choose to call the movement toward a data-centric world is directly related to what they expect going in. Arguing that big data isn’t all it’s cracked up to be is a strawman, pure and simple—because no one should think it’s magic to begin with.”

So here is a list of misplaced assumptions about the relevance of Big Data for disaster response and emergency management:

• “Big Data will improve decision-making for disaster response”

This recent groundbreaking study by the UN confirms that many decisions made by humanitarian professionals during disasters are not based on any kind of empirical data—regardless of how large or small a dataset may be and even when the data is fully trustworthy. In fact, humanitarians often use anecdotal information or mainstream news to inform their decision-making. So no, Big Data will not magically fix these decision-making deficiencies in humanitarian organizations, all of which pre-date the era of Big (Crisis) Data.

• “Big Data suffers from extreme sample bias.”

This is often true of any dataset collected using non-random sampling methods. The statement also seems to suggest that representative sampling methods can actually be carried out just as easily, quickly and cheaply. This is very rarely the case, hence the use of non-random sampling. In other words, sample bias is not some strange disease that only affects Big Data or social media. And even though Big Data is biased and not necessarily objective, Big Data such as social media still represents a “new, large, and arguably unfiltered insights into attitudes and behaviors that were previously difficult to track in the wild.”

Statistical correlations in Big Data do not imply causation; they simply suggest that there may be something worth exploring further. Moreover, data that is collected via non-random, non-representative sampling does not invalidate or devalue the data collected. Much of the data used for medical research, digital disease detection and police work is the product of convenience sampling. Should they dismiss or ignore the resulting data because it is not representative? Of course not.

While the 911 system was set up in 1968, the service and number were not widely known until the 1970s and some municipalities did not have the crowdsourcing service until the 1980s. So it was hardly a representative way to collect emergency calls. Does this mean that the millions of 911 calls made before the more widespread adoption of the service in the 1990s were all invalid or useless? Of course not, even despite the tens of millions of false 911 calls and hoaxes that are made ever year. Point is, there has never been a moment in history in which everyone has had access to the same communication technology at the same time. This is unlikely to change for a while even though mobile phones are by far the most rapidly distributed and widespread communication technology in the history of our species.

There were over 20 million tweets posted during Hurricane Sandy last year. While “only” 16% of Americans are on Twitter and while this demographic is younger, more urban and affluent than the norm, as Kate Crawford rightly notes, this does not render the informative and actionable tweets shared during the Hurricane useless to emergency managers. After Typhoon Pablo devastated the Philippines last year, the UN used images and videos shared on social media as a preliminary way to assess the disaster damage. According to one Senior UN Official I recently spoke with, their relief efforts would have overlooked certain disaster-affected areas had it not been for this map.

Was the data representative? No. Were the underlying images and videos objective? No, they captured the perspective of those taking the pictures. Note that “only” 3% of the world’s population are active Twitter users and fewer still post images and videos online. But the damage captured by this data was not virtual, it was real damage. And it only takes one person to take a picture of a washed-out bridge to reveal the infrastructure damage caused by a Typhoon, even if all other onlookers have never heard of social media. Moreover, this recent statistical study reveals that tweets are evenly geographically distributed according to the availability of electricity. This is striking given that Twitter has only been around for 7 years compared to the light bulb, which was invented 134 years ago.

I have yet to meet anyone who earnestly believes this. As Derrick writes, “social media shouldn’t usurp traditional customer service or market research data that’s still useful, nor should the Centers for Disease Control start relying on Google Flu Trends at the expense of traditional flu-tracking methodologies. Web and social data are just one more source of data to factor into decisions, albeit a potentially voluminous and high-velocity one.” In other words, the situation is not either/or, but rather a both/and. Big (Crisis) Data from social media can complement rather than replace traditional information sources and methods.

• “Big Data will make us forget the human faces behind the data.”

Big (Crisis) Data typically refers to user-generated content shared on social media, such as Twitter, Instagram, Youtube, etc. Anyone who follows social media during a disaster would be hard-pressed to forget where this data is coming from, in my opinion. Social media, after all, is social and increasingly visually social as witnessed by the tremendous popularity of Instagram and Youtube during disasters. These help us capture, connect and feel real emotions.

I coined the term “bounded crowdsourcing” a couple years back to distinguish the approach from other methodologies for information collection. As tends to happen, some Muggles (in the humanitarian community) ridiculed the term. They freaked out about the semantics instead of trying to understand the under-lying concept. It’s not their fault though, they’ve never been to Hogwarts and have never taken Crowdsourcery 101 (joke!).

Open crowdsourcing or “unbounded crowdsourcing” refers to the collection of information with no intentional constraints. Anyone who hears about an effort to crowdsource information can participate. This definition is inline with the original description put forward by Jeff Howe: outsourcing a task to a generally large group of people in the form of an open call.

In contrast, the point of “bounded crowdsourcing” is to start with a small number of trusted individuals and to have these individuals invite say 3 additional individuals to join the project–individuals who they fully trust and can vouch for. After joining and working on the project, these individuals in turn invite 3 additional people they fully trust. And so on and so forth at an exponential rate if desired. Just like crowdsourcing is nothing new in the field of statistics, neither is “bounded crowdsourcing”; it’s analog being snowball sampling.

In snowball sampling, a number of individuals are identified who meet certain criteria but unlike purposive sampling they are asked to recommend others who also meet this same criteria—thus expanding the network of participants. Although these “bounded” methods are unlikely to produce representative samples, they are more likely to produce trustworthy information. In addition, there are times when it may be the best—or indeed only—method available. Incidentally, a recent study that analyzed various field research methodologies for conflict environments concluded that snowball sampling was the most effective method (Cohen and Arieli 2011).

I introduced the concept of bounded crowdsourcing to the field of crisis mapping in response to concerns over the reliability of crowd sourced information. One excellent real world case study of bounded crowdsourcing for crisis response is this remarkable example from Kyrgyzstan. The “boundary” in bounded crowd-sourcing is dynamic and can grow exponentially very quickly. Participants may not all know each other (just like in open crowdsourcing) so in some ways they become a crowd but one bounded by an invite-only criteria.

I have since recommended this approach to several groups using the Ushahidi platform, like the #OWS movement. The statistical method known as snowball sampling is decades old. So I’m not introducing a new technique, simply applying a conventional approach from statistics to the field of crisis mapping and calling it bounded to distinguish the methodology from regular crowdsourcing efforts. What is different and exciting about combining snowball sampling with crowd-sourcing is that a far larger group can be sampled, a lot more quickly and also more cost-effectively given today’s real-time, free social networking platforms.

The use of crowdsourcing may be relatively new to the technology, business and humanitarian sectors but when it comes to statistics, crowdsourcing is a well known and established sampling method. Crowdsourcing is just non-probability sampling. The crowdsourcing of crisis information is simply an application of non-probability sampling.

Lets first review probability sampling in which every unit in the population being sampled has a known probability (greater than zero) of being selected. This approach makes it possible to “produce unbiased estimates of population totals, by weighting sampled units according to their probability selection.”

Non-probability sampling, on the other hand, describes an approach in which some units of the population have no chance of being selected or where the probability of selection cannot be accurately determined. An example is convenience sampling. The main drawback of non-probability sampling techniques is that “information about the relationship between sample and population is limited, making it difficult to extrapolate from the sample to the population.”

There are several advantages, however. First, non-probability sampling is a quick way to collect way to collect and analyze data in range of settings with diverse populations. The approach is also a “cost-efficient means of greatly increasing the sample, thus enabling more frequent measurement.” In some cases, the non-probability sampling may actually be the only approach available—a common constrain in a lot of research, including many medical studies, not to mention Ushahidi Haiti. The method is also used in exploratory research, e.g., for hypothesis generation, especially when attempting to determine whether a problem exists or not.

The point is that non-probability sampling can save lives, many lives. Much of the data used for medical research is the product of convenience sampling. When you see your doctor, or you’re hospitalized, that is not a representative sample. Should the medical field throw away all this data based on the fact that it constitutes non-probability sampling. Of course not, that would be ludicrous.

The notion of bounded crowdsourcing, which I blogged about here, is also a known sampling technique called purposive sampling. This approach involves targeting experts or key informants. Snowball sampling is another type of non-probability sampling, which may also be applied to crowdsource of crisis information.

In snowball sampling, you begin by identifying someone who meets the criteria for inclusion in your study. You then ask them to recommend others who they may know who also meet the criteria. Although this method would hardly lead to representative samples, there are times when it may be the best method available. Snowball sampling is especially useful when you are trying to reach populations that are inaccessible or hard to find.

A project like Mission 4636 and Ushahidi-Haiti could take advantage of this approach by using two-way SMS communication to ask respondents to spread the word. Individuals who sent in text messages about persons trapped under the rubble could (later) be sent an SMS asking them to share the 4636 short code with people who may know of other trapped individuals. When the humanitarian response began to scale during the search and rescue operations, purposive sampling using UN personnel could also have been implemented.

In contrast to non-probability sampling techniques, probability sampling often requires considerable time and extensive resources. Furthermore, non-response effects can easily turn any probability design into non-probability sampling if the “characteristics of non-response are not well understood” since these modify each unit’s probability of being sampled.

This is not to suggest that one approach is better than the other since this depends entirely on the context and research question.