Archive

Over the last few years, the study of statistics has taken on new meaning and importance with the emergence of “data science,” an imprecisely defined discipline that merges aspects of statistics with computer science. Google, Amazon, and Facebook are among the most famous companies putting data science to use, looking for trends among their users.

How does Facebook know who your friends might be, or which advertisements you want to watch? Data science. How does Amazon know which books you’re likely to buy, or how much to charge you for various products? Data science. How does Google know which search results to show you? Data science. How does Uber know when to implement surge pricing, and how much to charge? Data science.

A key part of data science is machine learning, in which the computer is trained to identify the factors that might lead to problems. If you have ever tried to make a legitimate credit-card payment, but your card has been denied because it looked suspicious, you can be sure that it didn’t “look” bad to a human. Rather, a machine-learning system, having been trained on millions of previous transactions, did its best to put you into the “good” or “bad” category.

Today, machine learning affects everything from what advertisements we see to translation algorithms to automatic driving systems to the ways in which politicans contact voters. Indeed, the secret weapon of Barack Obama’s two presidential campaigns was apparently his finely tuned data science system, which provided a shockingly accurate picture of which voters might change their minds, and which would be the best way to do so. (A great book on the subject is The Victory Lab, by Sasha Issenberg.)

I’ve been getting both excited and optimistic about the ability of data science to improve our lives. Every week, I hear (often on the Partially Derivative podcast) and read amazing, new stories about how data science helped to solve problems that would otherwise be difficult, time-consuming, or impossible to deal with.

Even before reading this book, O’Neil was someone I admired: Her blog, mathbabe.org, has useful insights about the use of math in everyday life, her book “Doing Data Science” is a great introduction to the subject, and she’s a panelist on the “Slate Money” podcast, which I thoroughly enjoy each week.

While O’Neil is easygoing and funny in her writing and speaking, her book is deadly serious. In it, she says that the widespread use of data science is leading to serious problems in society. She blames a number of different things for these failures. In particular, the opacity of many of the algorithms used makes them impossible to understand or evaluate. Their widespread use across huge populations for important decisions, and the frequent inability to find and appeal to a human, to insert some (pardon the term) common sense into the equation, means that mistakes can have far-reaching effects. Even if the results are good for most of the people most of the time, they can be bad for some of the people (and sometimes even most of the people) quite a bit of the time.

In statistics, you’re always dealing with averages, generalities, and degrees of confidence. When you’re letting a computer make decisions about people’s jobs, health, education, and court cases, you need to err on the safe side. Otherwise, many people could end up having their lives destroyed because they were statistical outliers, or didn’t quite match the profile you intended.

O’Neil points out, early on in the book, that data science involves creating statistical models. Models represent a form of reality, and help us to understand reality, but they aren’t themselves reality. The designer of a model needs to decide which factors to include and exclude. This decision-making process is, as O’Neil points out, riddled with the potential for error. This is particularly true if the thing you’re trying to measure isn’t easily quantified; in such cases, it’s common to use a proxy value.

For example, let’s say that you want to know how happy people are. You can’t directly measure that, so you use a proxy value for it — say, how much money people spend on luxury items. Not only is this a lousy proxy because there are lots of other reasons to buy luxury goods, but it’s likely to show that poor people are never happy. By choosing a bad proxy, you have made the model worthless. Combine a few bad proxy values, and unleash it on a large population, and you’re likely to do real harm.

Even if you choose your inputs (direct and proxies) correctly, your model will still likely have mistakes. That’s why it’s crucial to refine and improve the model over time, checking it against real-world data. As O’Neil points out in the book, this is why it makes sense for sports teams to model their players’ techniques; over time, they will analyze lots of players and games, and find out which factors are correlated with winning and losing. But in the case of a classroom teacher’s performance, how many inputs do you have? And how often does a fired teachers’s performance at other schools get factored into the model? Moreover, what if the inputs aren’t reliable? Put all three of these factors together, and you end up with a model that’s effectively random — but that still ends up having good teachers fired, and bad teachers remain.

(I should point out that the software I developed for my PhD dissertation, the Modeling Commons, is a collaborative, Web-based system for modeling with NetLogo. I developed it with the hope and expectation that by sharing models and discussing them, quality and understanding will both improve over time.)

As O’Neil points out, updates to models based on empirical data are rare, often because it is hard or impossible to collect such information. But as she points out, that’s no excuse; if you don’t update a model, it’s basically useless. If you give it a tiny number of inputs, its training is useless. And if your input data has the potential of being fudged, then you’re truly in terrible trouble. Given the choice between no model and a bad model, you’re probably better off with no model.

The thing is, these sorts of poorly designed, never-updated algorithms are playing a larger and larger part of our lives. They’re being used to determine whether people are hired and fired, whether insurance companies accept or reject applications, and how people’s work schedules are determined.

Some of O’Neil’s most damning statements have to do with race, poverty, and discrimination in the United States. By using inappropriate proxies, police departments might reduce crime, but they do so by disproportionately arresting blacks. And indeed, O’Neil isn’t saying that these data science algorithms aren’t efficient. But their efficiency is leading to behavior and outcomes that are bad for many individuals, and also for the long-term effects on society.

Sure, the “broken windows” form of policing might bring police to a neighborhood where they’re needed — but it will also result in more arrests in that neighborhood, leading to more residents being in trouble with the law simply because there are police officers in view of the perpretrators. Add to that the fact that many courts give longer sentences to those who are likely to return to a life of crime, and that they measure this likelihood based on the neighborhood in which you were raised — and you can easily see how good intentions lead to a disturbing outcome.

Moreover, we’ve gotten to the point in which no one knows or understands how many of these models work. This leads to the absurd situation in which everyone assumes the computer is doing a good job because it’s neutral. But it’s not neutral; it reflects the programmers’ understanding of its various inputs. The fact that no one knows what the model does, and that the public isn’t allowed to try to look at them, means that we’re being evaluated in ways we don’t even know. And these evaluations are affecting millions of people’s lives.

O’Neil suggests some ways of fixing this problem; conservatives will dislike her suggestions, which include government monitoring of data usage, and stopping organizations from sharing their demographic data. In Europe, for example, she points out that companies not only have to tell you what information they have about you, but are also prohibited from sharing such information with other companies. She also says that data scientists have the potential to do great harm, and even kill people — and it’s thus high time for data scientists have a “Hippocratic oath” for data, mirroring the famous oath that doctors take. And the idea that many more of these algorithms should be open to public scrutiny and criticism is a very wise one, even if I believe that it’s unrealistic.

Now, I don’t think that some of O’Neil’s targets are deserving of her scorn. For example, I continue to think that it’s fascinating and impressive that modern political party can model a country’s citizens in such detail, and then use that data to decide whom to target, and how. But her point about how US elections now effectively include a handful of areas in a handful of states, because only those are likely to decide the election, did give me pause.

I read a lot, and I try to read things that will impress and inform me. But “Weapons of Math Destruction” is the first book in a while to really shake me up, forcing me to reassess my enthusiasm for the increasingly widespread use of data science. She convinced me that I fell into the same trap that has lured so many technologists before me — namely, that a technology that makes us more efficient, and that can do new things that help so many, doesn’t have a dark side. I’m not a luddite, and neither is O’Neil, but it is crucial that we consider the positive and negative influences of data science, and work to decrease the negative influences as much as possible.

The main takeaway from the book is that we shouldn’t get rid of data science or machine learning. Rather, we should think more seriously about where it can help, what sorts of models we’re building, what inputs and outcomes we’re measuring, whether those measures accurately reflect our goals, and whether we can easily check and improve our models. These are tools, and like all tools, they can be used for good and evil. Moreover, because of the mystique and opacity associated with computers and math, it’s easy for people to be lured into thinking that these models are doing things that they aren’t.

If you’re a programmer or data scientist, then you need to read this book, if only to think more deeply about what you’re doing. If you’re a manager planning to incorporate data science into your organization’s work, then you should read this book, to increase the chances that you’ll end up having a net positive effect. And if you’re a policymaker, then you should read this book, to consider ways in which data science is changing our society, and how you can (and should) ensure that it is a net positive.

In short, you should read this book. Even if you don’t agree with all of it, you’ll undoubtedly find it thought provoking, and a welcome counterbalance to our all-too-frequent unchecked cheerleading of technological change.