When Not to Trust the Algorithm

WALTER FRICK: Welcome to the HBR IdeaCast from Harvard Business Review. I’m Walter Frick and I’m here in the studio with Cathy O’Neil, data scientist and blogger at mathbabe.org, co-host of the Slate Money podcast, and author of the new book Weapons of Math Destruction. Cathy, thanks for being here.

CATHY O’NEILL: So glad to be here.

WALTER FRICK: There’s a lot of things I like about the book, and I want to talk in a minute about some of the examples that you have that are more managerial. We’ll talk about hiring and employee management. But to start, I was hoping maybe you could just tell us a little bit about how you came to write this book. And specifically the book opens with you as a data scientist on Wall Street. Tell us a little bit about how you got there and then how you went from there to the author of this book.

CATHY O’NEILL: So I’m a mathematician, I’m a math nerd. I’ve been a math nerd ever since playing with spirographs when I was five and figuring out about periodicities, and prime numbers, and stuff like that.

And I’ve always thought of mathematics as a refuge away from the messiness of reality. I remember in high school we were learning about history, and manifest destiny, and slavery. And it always bothered me that you could– if you just had a different political agenda, you could reframe things and it seemed like a completely different thing actually was true or actually happened. And mathematics is nothing like that. Even if you disagree with somebody about every single possible political thing, you’d agree on math, because it was just true, because you had these very clear assumptions and you made very logical steps.

And moreover, it was this really refreshing thing where if in fact you were wrong about something, if you were making a step that you thought was logical but it had a flaw, and someone told you that you were making a mistake, you’re apt to thank them. You’re apt to say, oh, thanks for explaining my mistake, it’s saving me time. And that’s really rare in the world, where you’d actually thank someone for explaining to you why you’re wrong.

And I had this naive artistic perspective on this whole thing, thinking, well, math can really clarify and we can use mathematics as a tool to help the world see things with more clarity and more honestly. And after becoming a professor, I decided that academic– the temple of academia was a little bit too slow for me. And I wanted to go somewhere that was a little faster-paced and where I had more impact.

And living in New York City in 2006, I was like, well, what kind of job can I get if I want to be a businesswoman? And the most obvious job was in finance. So I ended up in finance.

I started in June 2007. And I almost immediately witnessed the breakdown of the entire economic system from the inside, working at D E Shaw, which was 20% owned by Lehman Brothers. And I worked with Larry Summers there.

And I was very disillusioned by two years in. I was particularly ashamed of the way that the AAA ratings, which were in some sense a mathematical promise of safety, had been actually just lies, mathematical lies. Than they had been in large part one of the reasons that the mortgage-backed security industry had grown so much, because people trusted these, because they trusted mathematics. They had this image, which was absolutely heavily promoted, that mathematicians were busy doing their honest work in the corner of Moody’s, S&P’s, and Fitch and assuring us, the rest of the world, the investors, that this stuff was safe, when in fact there was no reason for them to believe it and they knew that. They were just selling it. So it’s like a weaponized mathematics, which I didn’t want to have any part of.

And I spent a couple years working in risk trying to make the weaponized mathematics better. I was thinking, OK, well, maybe we can fix our risk model. And I even worked on the credit default swap part of the Value at Risk model. And I, for example, noticed that the returns of credit instruments, like credit default swaps, were not normally distributed at all.

But what I realized soon after starting there was that actually people don’t want to know what their actual risk is. People aren’t interested in knowing that their risk is bigger than they have computed so far. That’s when I became ultra-disillusioned, when I realized that yet again mathematics in that situation was being used not to uncover the truth but as sort of a shield so that people could go on doing essentially corrupt things but claim that they had some kind of mathematical stamp of approval, like a rubber stamp.

WALTER FRICK: Yeah.

CATHY O’NEILL: And I was like, I have had enough of this. And I left finance altogether.

WALTER FRICK: But I think there’s so many nice examples in the book where you get at both the ways in which math done by mathematicians or statisticians attempting to do something very specific might fail in that regard. But even, I think, more powerful are all the examples of when these algorithms essentially more or less succeed at what they’re trying to do but that that has a really pernicious effect, because it’s embedded in all of these more problematic structures within organizations and society. So it just really feels like if you’ve been reading about big data, reading our stuff on big data, this seems like it takes some of the pushback from kind of that general level of, well, algorithms just aren’t going to be able to do that or they just shouldn’t do that sort of in principle to really, I think, a sophisticated critique of these are specifically why these systems can get out of whack.

So before we go to some of the hiring examples, can you talk a little bit about how you actually define this weapons of math destruction? You have a couple of key attributes.

CATHY O’NEILL: Yeah, so you’re exactly right. I left finance and entered data science thinking that maybe I could feel better about my life and my nerd contributions, and soon found that similar kinds of weaponizations were happening. And I’ve noticed a pattern. And I wanted to tell the world, because I thought it was relatively invisible to most people, that we’d all drunk the Kool-Aid of big data and we trusted so math so much, so deeply, that it was blinding us to the real problem.

So I defined weapons of math destruction to be a certain class of algorithms that I think are deeply troubling. And they’re characterized with three characteristics.

The first is that they’re really important. So I want to focus only on algorithms that affect a lot of people in important ways. So, for example, if they decide whether or not you’re going to get a job, or whether you’re going to get a loan, or how much you’re paying for insurance, or if you go to jail how long you go to jail or prison– something that matters to people, that’s important. So that’s the first characteristic.

The second is that it’s secret. So these are almost always scoring algorithms. But people are not told how their scores are actually created– they don’t know the formula. And they often don’t even know that they’re being scored. They’re often silent, secret scoring systems.

And when you have something that’s secret or silent, then it’s not accountable. These are almost always unaccountable systems that people can’t appeal their scores. They can’t complain. They don’t even know if it’s correct. And that’s a real problem. When you combine those two things, that it’s important and that it’s secret, it’s sometimes actually against law, it’s unconstitutional in certain cases.

And then finally I care about it if it’s destructive. If we had secret important things that were helping the world, then OK, not perfect. But the problem is that these things aren’t helping, they’re destroying things. And they’re destroying individual lives, often unfairly. But they’re also engendering these larger negative feedback loops, by which I mean destructive positive feedback loops. So they’re making things worse and worse. Instead of solving a problem– which they often set out to solve a problem and they often set up to solve that with good intentions, but instead of solving it, they’re making it worse.

WALTER FRICK: So let’s talk about this in the context of hiring and, from the other side, getting a job. So the way that that might have been done prior to these algorithms is you have tons of resumes and applications come in and people sift through them. They use their own rules, or biases, or heuristics to winnow that pile down. There’s some sort of interview process. And now, essentially, algorithms can either cut that pile down in the beginning, or maybe tests or some other things can help actually with that final decision. What are the dangers in that?

CATHY O’NEILL: I want to– before saying on the dangers, I want to just acknowledge that this is a real problem, especially in the age of the internet. You’re going to get 10 times more applications than you used to. So you have to figure out a way to cut down your work. And the question is how do you do that. And people more and more have been replacing their HR with algorithms.

The problem is that there’s lots of really important rules and laws around fair hiring practices. And these algorithms have not been audited for fairness or legality.

So we have reason to believe that some of these algorithms discriminate based on mental health status. This is illegal under the Americans with Disability Act. You’re not allowed to give somebody a health exam, including a mental health exam, as part of hiring. But even so, we have a large number of– 66% of job applicants in the US actually have to take a personality test in order to get to the interview. And if some of those personality tests filter out people with mental health problems, then that’s a real problem. So not only does it destroy their chances of getting a job, but it is exactly creating that feedback loop that the ADA is meant to stop, which is actually isolating that group of people from normal society.

WALTER FRICK: Yeah.

CATHY O’NEILL: And moreover, these things are secret. Often people aren’t even told what their results are. They’re not told their score, they’re not told how they’re being scored. And they’re, as I said, very widespread.

WALTER FRICK: Yeah, we can talk later on a little bit about the disparate impact of some of these things. But in that case, we’ve actually published some research about– that I think our listeners would be interested– that essentially a lot of high-level jobs include these things now too. And even executives find themselves up against these personality tests and are on the other side of these somewhat-opaque algorithms.

CATHY O’NEILL: Yeah, so most of the personality tests are for minimum wage jobs. And to be honest, I think most of them are screening for how good a drone are you going to be. But there are lots of newfangled kinds of resume-sorting algorithms. The first generation was keyword searches, but we’ve gone well beyond that, to the point where we’re basically training machine learning algorithms on old application data and we’re training them to a definition of success.

So the way you build an algorithm is you have to have a data set where you’re looking for patterns. But then you have to define what success looks like so that you know what you’re looking for the pattern of.

So I do a thought experiment often with people where I’m imagining that Fox News has a machine learning algorithm to find anchors. And they define success as stay at Fox News for five years and get promoted twice. Now historically speaking, which is how you’re going to train this algorithm, we happen to know now that women were systematically prevented from succeeding at Fox News. So what will happen when you train a machine learning algorithm using that old data, it’ll recognize this pattern. And when you give it a new set of applicants for a new job as an anchor, it will basically be asking the question, who among these new applicants looks like somebody who is successful in the past. And we have reason to suspect that it would filter out the women.

WALTER FRICK: And there’s an example in the book about this that essentially– I think it’s a hospital– uses a ton of data on previously essentially what have hiring managers decided, thinking, well, we’ve got all this data, let’s put it to use. And in fact it’s just what you’re saying– there are all these biases that have been built in over decades of decisions and that ends up shaping, rather than looking at actually what’s actually a predictor of who’s going to do a good job.

CATHY O’NEILL: Right. And that example came from like St George’s medical school in London. And what’s actually really fascinating about that is they were well ahead of their time. They tried to implement their streamlined application process in order to save time. They did that early, maybe in the late ’80s, a couple decades ago.

What’s interesting about that is that they then noticed that it was racist and sexist. I think the remarkable part about this story isn’t that they did it, it’s that they noticed it was flawed. What we have now in our very, very excited moment here of this new technology called big data and big data algorithms is we have the beginnings of all this stuff but no standards of safety set in place so that we can actually check whether it’s flawed or not.

I liken it to a car manufacturer that sends cars out onto the street without tracking whether the wheels fall off of the car and kill the passengers within the first four miles. We just don’t track this stuff.

And often it’s un-trackable. If you think about it, if you filter people out from even getting an interview, you never see them again. There’s no way to see, to learn that you made a mistake on that– that person would have been a good employee– because they’re gone.

WALTER FRICK: Yeah. And it does seem like in this context that there may not be a perfect solution, but there are maybe standards of bad. And so in one case you just are never really looking to follow up on who became a good employee on any measure.

And then I think if you look at some of the more advanced– let’s say Google’s whole people analytics thing– they’re probably still not cracking that problem of they can’t see what would have happened or what that person that they never looked at would do. But they’re at least trying to go back and say, well, did we do a good job relative to who we think is doing a good job in the job right now? And they’ve made some changes to their process, because they realized things that they thought would be good predictors– test scores and that kind of thing– turned out to have no correlation. Is it fair to say that that’s at least a midpoint between the algorithm never even looks at performance and the unattainable what you’d really want, which is actually knowing the outcome of everything?

CATHY O’NEILL: Yeah, it’s really good that they’re at least creating an ecosystem where they learn, their model learns, and they can update their model, which a lot of the examples in my book don’t even have that.

And I also want to mention another example– actually also coming from Google, who I don’t think is perfect by any means, by the way– but an example of an improved design based on data. So they also had this example where they had this process by which people would self-promote– they would ask to be considered for promotion. And it sounded like a fair system just theoretically. But then they realized that women were much less likely to ask for this screening for promotion. And they actually changed the design of it.

So this is a great example of something like sounds good on the surface, sounds like an objective fair system, but ends up being actually not at all fair. And the fact that they actually found that out and addressed it is really great. It’s very rare.

WALTER FRICK: Yeah. So input data– if you have bad data in, you’re going to have biased projections out. If your algorithm never has any attempt to learn or you’re not actually looking for ways to improve it–

CATHY O’NEILL: And just to be clear, I’m not just saying improve it by the definition of success that you’re talking about or by the definition accuracy you’re talking about. I want it to be monitored for fairness and for discrimination as well, which is slightly different.

WALTER FRICK: So that’s a really good point. I guess there is this tension in a lot of the examples– not all of them. So the worst, bad by every measure, would be an algorithm that essentially does badly on the thing it’s trying to optimize and has horrible fairness problems.

CATHY O’NEILL: Right.

WALTER FRICK: And what seems like the most interesting category– and that I think were some of the most interesting examples to me in the book– are ones where if you just look at it in terms of, well, does it seem to be optimizing well on this one metric that we’re looking at? It’s doing pretty well. And in some cases, people can be blown away by how well these algorithms work, which can trigger that idea that, well, it’s so amazing, I can’t– who am I to criticize it? But in fact they’re doing well on this one measure and there are these huge equity problems with it.

CATHY O’NEILL: It might pick out the very men that will be successful. But it’s still not fair.

The example I think about in my book along those lines is recidivism risk algorithms to judge the chance of someone coming back to prison after they leave it. And the problem there is that there’s plenty of reasons that people come back to prison not because they’re necessarily more criminal, but because of the situation that they’re living in neighborhoods that are much more heavily policed than other people. So they’re, in some sense, on the hook, taking blame for their demographics just as much of their as their actual behavior. So that’s another thing that happens, is that once we assign scores to people, we often put them on the hook for bad scores and give them credit for good scores, even if one of the real reasons that they’re getting a bad score or a good score is just by dint of their demography.

WALTER FRICK: So let’s talk a little bit about what we can do. And if you’re either– let’s say in this case you’re a manager or a data scientist in a company and you want to actually think about these questions and see that you’re doing at least as good a job as you can. What can we do? And I would love to have you talk about the bit in your book– you talk about a data science ethics class and the idea of how would you do a credit score that didn’t have some of these problems. Can you tell us a bit about that?

CATHY O’NEILL: Yeah. So the example I give in my book is– and I’m hoping that every data science institute, and there are a lot of them springing up all over the country– take ethics seriously and have a class on ethics. It’s just one example of a homework assignment where you’re saying how do you decide what their risk of default is, like how do you assign credit scores to people if you have a bunch of data. And you could give them data. And then you would ask the question, well, should you take race into account, if you have an attribute– race? And what would happen if you do or don’t use race?

Of course, this is just directly related to anti-discrimination law we have on the books called the Fair Credit Reporting Act that makes it illegal to use race for FICO scores and for other official credit scores. But I want the students to actually think through why that is something that we would use or not use. And then if they get to the point where they’re like, yeah, we shouldn’t use race because essentially it creates this negative feedback loop, then you say, OK, well, OK, let’s not use race, but should we use zip code, which of course is a proxy for race in our segregated society?

And so once they acknowledge that zip code is just as good as race, then you’re like, OK, so how do we choose our attributes? Because there are so many proxies to race. And it’s really actually very tricky. It’s tricky. And I’m not trying to claim that it’s easy. But I do think that as data scientists our job is to solve hard problems and we should take this on as one of them.

WALTER FRICK: So what else can we do? And what would you like to see happen? You talked a little bit about the idea of an audit or someone coming in, whether it’s a researcher. You can imagine even having to pay a company to come in and do an independent audit. Is that a way to go? Is this a regulatory question? Is it about the individual ethics of data scientists?

CATHY O’NEILL: I don’t think you can rely on the individual ethics of data scientists just simply because they work within a company where their boss basically tells them what to do and how to optimize their algorithm, which is usually on money. I just started a company to audit algorithms, so I do think that there is a possible future in that. And I think that once business leaders realize that they’re really putting themselves at risk for using discriminatory hiring practices via an algorithm, then they’re going to say to themselves, well instead of waiting for the EEOC to come after us, we should actually double-check that this algorithm is legal. Then I think there will be a market for that. And I’m hoping they will be.

I also think that the EEOC does actually have to make that threat. The regulations that already exist around anti-discrimination law, disparate impact, and fair hiring practices have to be enforced in the realm of big data. Right now they’re basically not. And I think it’s a matter of time before they are, but I don’t know how much time. Because honestly they don’t have the tools to do it yet.

WALTER FRICK: So you have some examples in the book where you’re willing to say, look, this is an algorithm that just isn’t likely to work, at least in the very short term. So the example you give is rating teachers’ effectiveness. And just say let’s just admit that we’re probably not going to be able to do this in the next few years and let’s focus our data efforts on actually just maybe helping teachers do a better job.

Presumably, there are other cases where some sort of ethical– potentially regulated, potentially audited– set of data systems do a really good job. And you do mention some in the book. How do you leave us in terms of thinking about– if this is a bit of a welcome pushback against all this hype around big data, how much potential is there? I start with the premise that the systems that existed previously were pretty bad and were pretty biased in so many ways– hiring certainly as one of them. Are you broadly– I guess, how do you balance the optimism and pessimism in terms of where this is all going?

CATHY O’NEILL: I’m actually really optimistic. The great thing about algorithms, number one, is that they don’t lie. So when you audit them, if they’re being unfair, they will fess up. The other good news is once you have an algorithm that is fair– it might take a little more work to make that and you have to be much more deliberate about it, but it will actually be better than humans.

What we’re dealing with right now is an industry that just assumes that every product that it creates is automatically perfect. And that’s simply not true. And I wrote the book just to give many examples of truly imperfect and destructive algorithms.

That’s not to say that there’s no potential there. And I think there is a potential there. The recidivism risk algorithm was put in place because judges are racist and we know that. And it was an effort to make the system less racist. What we’re doing right now, which is insufficient, is throwing in these algorithms that are probably themselves racist. But we could make better algorithms, better risk orders, and actually improve the system. It’s up to us to make sure that’s what we’re doing.

WALTER FRICK: Cathy O’Neill, thank you so much for being here.

CATHY O’NEILL: My pleasure. Thanks for having me.

WALTER FRICK: The book is Weapons of Math Destruction. And thank you for listening to the HBR IdeaCast from Harvard Business Review. You can get more ideas on this and other topics at hbr.org.