As a research curiosity, dataset bias has been shown to affect model generalizability: a machine learning algorithm trained on one dataset, and receiving flying colors on a particular collection of test images, may have abysmal performance on a different dataset with different image statistics. You can think about this as the case of only ever seeing faces front and center in an image, and then being tested on off-center faces and realizing you are unexpectedly, but miserably, failing at detecting them. Some real-life examples: "HP investigates claims of racist computer", "Camera misses the mark on racial sensitivity".

A more relevant and pressing example concerns the self-driving car. Ultimately, if trained correctly, it will learn to avoid pedestrians, lamp posts, barriers, and whatever else was meticulously labeled and included into its training set. But will it know to avoid a kangaroo if the most it ever saw in its training data and prior experience is a deer or a cow? For an algorithm to be capable of this type of generalization is a reasonable expectation. So even though the car might not be able to accurately determine what is happily hopping across the street, it will guess that it is a bit of a deer and a bit of a cow and since it is upright, maybe a bit of a pedestrian as well... but overall, and most importantly, whatever its identity, this happy creature should not be run over.Speaking of cows...

Let me make an aside here and say that the sole fact that our machine learning algorithms (the ones behind the self-driving cars, autonomous appliances and robots of the future) are increasingly relying on data (are "data-driven") and are increasingly likely to be neural networks (which happen to chug through and learn from large amounts of data very well) is not in itself a reason for concern. I do not buy the argument that we should fear our "black box algorithms" because they are parameter-bloated, connected and intertwined networks that are "hard to understand and harder to interpret". Until quite recently, when asked what computer vision was up to, I would sarcastically answer: "detecting cows on fields... but only patched ones, on green fields, in the center of the image, and only if awake". We wouldn't even be having some of these conversations (in the media) about dataset bias if neural networks weren't performing this well (there are bigger problems at stake if even the cows can't be detected). Having large amounts of data and the architectural machinery to deal with it is precisely what is helping us learn and generalize better.

With that aside, it is nevertheless crucial to start thinking more carefully about dataset bias and model generalization. This thinking should not pit us against data-driven algorithms in any way; rather, we should continue to remind ourselves that it is the great prediction potential of the algorithms that is granting us the opportunity to think about these questions in the first place.

The problem is that this kind of dataset bias is unavoidable - because people are biased and they are the source of the data that we're feeding to our algorithms (and if we can learn anything from the above, it's that people seem to be biased towards putting others down on social media, but then compensating by flooding the net with pictures of cute animals). This means that, unwillingly, we may be imbuing our algorithms with negative qualities, behavior, and biases. We may be perpetuating biases that we would otherwise like to remove from society (see "When algorithms discriminate").

And yet, when we try to actively interfere, we can make the problem worse. More frighteningly, if people know they can influence and have effects on the data, they may use this for their benefit, and this can have either positive or negative consequences for other members of society (as in this Ted Talk on the "Moral bias behind your search results"). We are all responsible for the data that we put on the net (what we upload and what we search), and need to recognize that the biases that are out there are our own.

But if the biases do get out there, should we get rid of them? What would "un-biasing" the data even mean? Who has the right to say that something should or should not appear in a search result? Would a top-down filtering of content, that would change the data everyone sees, even be appropriate? Most parents would disagree with a single, universal parental control for all of the world's children, and this is not much different. Different individuals, cultures, societies have different preferences and norms, different beliefs and taboos.

Which brings me to the importance of diversity in the data that we have. I would not argue for artificially nudging the numbers or tweaking the data to try to eliminate certain kinds of biases, as this can have all sorts of secondary and unintended consequences (as the Twitter example has shown). Instead, the more people that are participating in the data that is being harnessed for training algorithms, the better. This naturally adjusts the data balance to more accurately reflect the population that is using it. One solution is just to put more people on the web, and I think we will get there. Another is to bring more humanities folks into the tech loop. It would be great to have the perspective of anthropologists, sociologists, historians, policy and law makers, and psychologists for insights on cultural sensitivities, historical trends, crowd mentality, virality, societal pressures, etc. so that we can have better expectations of what the data may bring before its on our plates and we have to deal with the consequences. In this case, the suggested approach is to use this knowledge to adjust the data-collection procedures themselves rather than the data after-the-fact.

If you were collecting a survey, a change in wording but not meaning might drastically change who would respond to it. How different cultures look at concepts like success, individuality, and norms are also crucially different and affect how and what they communicate about these topics. Take this simple example: say you collect perceptions about an exam from a group of schoolchildren. You get two answers: "it was easy! [secondary reason: I passed]", "it was ok [I think I got only 98%... where did the 2% go?]". Without knowing the context of the cultures, societies, or families from which these two responses came, you would have a very biased dataset (I'm reminded of Malcolm Gladwell's book Outliers; or this talk). And this extends beyond surveys. The behaviors you elicit (and end up collecting) from a group of users can depend crucially on how information, a task, or a UI are presented. Psychologists and sociologists know this very well. But they are also less likely (currently) to be the ones collecting the large datasets that modern-day computational algorithms are trained on.

Questions of labeling are key. What do you call that thing? If you give it one label over another, a different set of properties or attributes might be retrieved. Consider an example: the labeling of street scenes. Here's a pedestrian, and another one. Here's a bicyclist. What about that person in a wheelchair? Is that a pedestrian or transportation device? How many body parts must be visible and moving for a pedestrian to be labeled as such? This labeling might affect how an algorithm analyzing the scene might predict the future movements of the participants and objects in the scene. This, in turn, might affect the decisions the algorithm (read: autonomous vehicle) makes.

Every labeler is biased. Biased by their culture, their society, and their experience. Instead of attempting to unbias the labels, we should introduce even more biased labelers... to compensate (and please, let's throw something intellectual up on the net, at least once for every 100 cats...). We should increase the diversity of the bias until, on average, we get something reasonable. Ensembles work. That's the wisdom of crowds.

And what do we do in the meantime while we wait for the whole world's wisdom to accumulate on the net? We think harder about our data collection strategies and the tasks used; we spend more time debugging and visualizing the algorithms and the trends they pick up on; we consider how to present, display, and use the data; we brainstorm ways to annotate and make explicit whether certain labels, tags, or content are more likely to be controversial or subjective; and we treat predictions in this space with greater care, and importantly, less confidence. Just as we tend to dislike the individuals with the greatest bias but highest confidence, let's not fill our digital world with these personalities.