Sendhil Mullainathan: What Big Data Means For Social Science (HeadCon '13 Part I)

I'm going to talk to you today about a project that I've started in the last year or two. This type of thinking, this type of work, is going to be one of the challenges social science faces in the coming three, four, five, ten years. It's work exclusively with Jon Kleinberg. For those of you who don't know him, Jon is a computer scientist, one of the preeminent computer scientists. He's probably the smart one of the two of us, but I'm the pretty one so it's better that I'm being taped.

This is work that starts with the following observation that lots of people have had, so it will be trite to start with but we just have to live with that. The observation is that data sets are getting bigger, and bigger, and bigger in a fundamental way. As the size of data grows, what does this imply for social science? For how we conduct the business of science?

We've known big data has had big impacts in business, and in lots of prediction tasks. I want to understand, what does big data mean for what we do for science? Specifically, I want to think about the following context: You have a scientist who has a hypothesis that they would like to test, and I want to think about how the testing of that hypothesis might change as data gets bigger and bigger. So that's going to be the rule of the game. Scientists start with a hypothesis and they want to test it; what's going to happen?

I'm going to talk to you today about a project that I've started in the last year or two. This type of thinking, this type of work, is going to be one of the challenges social science faces in the coming three, four, five, ten years. It's work exclusively with Jon Kleinberg. For those of you who don't know him, Jon is a computer scientist, one of the preeminent computer scientists. He's probably the smart one of the two of us, but I'm the pretty one so it's better that I'm being taped.

This is work that starts with the following observation that lots of people have had, so it will be trite to start with but we just have to live with that. The observation is that data sets are getting bigger, and bigger, and bigger in a fundamental way. As the size of data grows, what does this imply for social science? For how we conduct the business of science?

We've known big data has had big impacts in business, and in lots of prediction tasks. I want to understand, what does big data mean for what we do for science? Specifically, I want to think about the following context: You have a scientist who has a hypothesis that they would like to test, and I want to think about how the testing of that hypothesis might change as data gets bigger and bigger. So that's going to be the rule of the game. Scientists start with a hypothesis and they want to test it; what's going to happen?

Now, you might have heard of the scientific method; that is a thing that some of us use. Some people would argue, as an economist I don't really use it, but, in fact, I do. The heart of the scientific method is what you might call "deduction." You start with this hypothesis, it says variable X should matter, you go to your data set, and you see if, in fact, X does matter. If it doesn't, that's a rejection, if it does, that's an acceptance, and we move forward in that way.

What I want to argue today is that that very basic approach changes. A deduction may not be the best thing to do once data sets get really big. To do that, instead of doing science for a bit, I want to do science fiction, which I'm sure, to make another economics joke, is what many of you think economics is. This science fiction story is going to be set in the nineteenth century. It's going to be set by a budding medical researcher in the nineteenth century who is interested in why people are dying in hospitals—a biological theory of why people are dying.

Somewhat anachronistically, she stumbles first upon a theory—before germ theory, before any of this—a theory of a mind-body connection. This scientist is convinced that what drives people, what drives mortality, is optimism. When people are optimistic, they are more likely to survive. When people are pessimistic, they are more likely to die. That's her theory.

To test her theory, following a good deductive process, she says, "I'm going to turn this optimism theory into a testable hypothesis. What's the hypothesis? Well, if I see my neighbor die, that's got to make me kind of pessimistic." So her hypothesis, or her empirical hypothesis from a theory, is that when your neighbor is sick or when your neighbor dies, you're more likely to die. She would like to test that deduced hypothesis; in fact, she does it well. She finds a hospital where neighbors are randomly assigned. Terrific, now we have causality. And she goes ahead and says, "Okay, now let's see what's in this data." She goes through her data and she finds her theory is accepted. Terrific. In fact, when neighbors die, the patient is more likely to die. Good news for the optimism theory. Of course, hopefully by now, you know that it's good news for some other theory, but to this researcher with her theory, she's kind of accepted it.

What I want to do now is put a pause on that for a little bit and say now this is where the science fiction part comes in. Imagine we live in some steampunk world, where this researcher actually has access to a very large dataset and a computer to use that dataset (even though it's the nineteenth century). She's got many, many patients, and lots of detail about all the characteristics of the patients, all the characteristics of the treatments, everything that's going on. Could she have done something different, and what could she have done? That's the question I want to ask in my little steampunk science fiction story.

To understand what she could have done, I want to take a little pause here and give you a sense of the way in which I think big data has transformed the world of artificial intelligence and how we can use that insight to transform the way of social science. Some of this may be very familiar to you, so be a little patient with me. I want to think about the classic problem of artificial intelligence—natural language processing. I remember when I was a computer science undergrad, this just felt impossible, like how on earth are we going to get a computer to understand anything?

To just understand how impossible it is, I want to take, in this big world of natural language processing, a tiny little problem called, "word sense disambiguation." What is word sense disambiguation? Take the word "bank." In a sentence, does the word "bank" refer to riverbank? The financial institutions, which seem to get in trouble all the time? Or does it refer to bank left, bank right? It can mean many things. How, in an algorithm, parsing a sentence is to determine what it means? People working in artificial intelligence tried what you and I might naturally try in this situation, they said, "Let's figure out what are the rules that might help disambiguate." So people started writing down the rules, introspecting, thinking about what might work. Very smart people went at this problem, and they made - I think this is probably not a precise number- but approximately zero progress. This was a problem. And this is true of all natural language processing. Roughly, when I left computer science, this felt like this impossible task, that maybe in 200 years we will one day have some brilliant insight and crack it. Now I have SIRI right here on my phone. What happened?

When you go back and look, what happened was not some brilliant insight deep and sitting in my phone, what happened was big data. What I mean by that is, you give an algorithm millions, billions of instances of the word "bank," in which you say, "In this case it means 'riverbank,' in this case it means…" you just give it tons of learning data. You don't think about exactly what rules it's going to use. You just code up lots of features, every feature you can imagine. Throw it in there. Just throw it in. Then you have this algorithm learn the associations that predict river bank, and the more data you give, the wider you make the dataset and the better this thing gets towards predictive accuracy.

For scientists, it's about as annoying as things can get. You're like, "How does this thing work? I don't know, but it works. Isn't that great?" So in some sense, this thing, which is the use of big data, is almost quite different than what we tend to do. We tend to form specific hypotheses, but what I want to argue is that this thing can be used for this activity. So now to do that, let's go back to our steampunk world and think of what that medical researcher could have done.

She could have done the following thing: She could have said, "Okay, I've got all of this data, and I've got all of these variables. Well, let me go through all the variables and check off the ones that have anything to do with optimism. Okay. Roommate health may have something to do with optimism." Let's suppose in this data there's nothing else, but there may be other variables in other datasets, like, how much is the patient smiling? Oh, well, that has to do with optimism. She checks out each of the variables that have something to do with optimism. "The rest," she says, "I don't know what this stuff is. It has nothing to do with me." Okay. Great.

Once she's done that, then what will we do? What we would do is we'd say, okay, here's the algorithm, let it go to work. What it's going to do is come up with the best predictor it can of patient mortality, and we're going to ask the following question: Does it use the variable that you thought was important or not? Now, this is fundamentally distinct in the deductive test. The deductive test is, I take the variable I thought was important and ask whether it predicted. This test, which we're calling the inductive test (and I know half the people here know what the word "induction" means better than I do, so please don't break my heart and tell me this is not induction) is, instead of looking whether this variable matters, you say just figure out all the variables that matter, and is yours one of them.

Think of what would have happened if she had done this with her particular example of roommate health. She would have said, "Oh, look, roommate health matters by itself, but look at this other variable, which seems to be taking up a lot of the action. Did doctor wash hands between patients? I don't know what that is, but look, it's starting to soak up the impact on roommate health. And look, this other variable starts coming in. Shared scalpel." And so on, and so on, and so on. And here's another one that seems to come in: Was the disease tuberculosis, or was the disease buggy accident? Because in this cyberpunk world they don't have cars yet, they just have buggies, it's a very bleak world.

We collected data, and it is expensive to collect. What do you go out and collect? The stuff that you think matters. That's why deduction is so powerful. But once you collect all kinds of things, then you will have the ability to look at all these variables and see what matters, much like in word sense disambiguation. We're no longer defining rules. We're just throwing everything in.

As we go forward what she'd see is her original variable—roommate health—which she thought was proxying for optimism, is being killed off by these other variables that her theories have nothing to say about, which would make her uncomfortable that she actually had discovered evidence for optimism. She discovered an empirical relationship—roommate health mattered, but remember, theory testing is not empirical relationships, it is testing a theory. In this case, she would have discovered that actually the data is not confirming her theory, as she thought it was.

In fact, if you think of what's happening here, this is almost what the scientific method looks like as it's done by humans, as it unfolds over time and iterations of people by people. If absent—the computer and all this data—she would have run her one test, someone else would have come and said, "Oh, I reran your thing, but I'm noticing this fact." Someone else would have said, "I noticed this fact." And slowly over time, someone else would have come up with another theory, which you might want to call it, say, the germ theory, and then that theory would have overwhelmed it. With these large datasets, we have the ability to supercharge that process at least a little bit, but we're no longer doing the inductive test. So that's the heart of it. So the heart of what we're calling the inductive approach, the inductive scientific method, is, just like with deduction, starting with the hypothesis that you'd like to test, but instead of looking just for the hypothesis, letting an algorithm determine what is the best predictor, and then seeing.

There are a few things to note about induction in this meeting. The first thing to note is induction is only as powerful as your dataset is rich. You have very few other variables besides your theory. Now, why is that important? That's important because this is the reason this type of approach was never really practical until the last ten years. We collected data, and it is expensive to collect. What do you go out and collect? The stuff that you think matters. That's why deduction is so powerful. But once you collect all kinds of things, then you will have the ability to look at all these variables and see what matters, much like in word sense disambiguation. We're no longer defining rules. We're just throwing everything in.

Second, I want you to observe that this is not, absolutely not, about causality. The original researcher's mistake wasn't that she misunderstood admitted variable bias—she randomly assigned roommates. It's about the interpretation of causal relationships we discover. It can also be about causality, but the core issue is not, I'm telling you admitted variable bias, I'm trying to tell you about the testing of hypotheses. So that's the approach, and that's what I think has some room to combine this movement in big data with what we tend to do.

Let me tell you a practical application of this. We decided to try and test this with one of the old facts in behavioral finance. Behavioral finance is the application of the work Danny and others have done to financial markets. I would say that, historically, behavioral finance has been one of the big reasons why behavioral economics really took off. In a way, if you were thinking of attacking some foreign country, this is not quite the capital of economics, but once you take that territory it becomes easy to take everything else because people are like, "There's no way these psychological biases could matter in markets," and well, they do. And so then you're like, "Oh, we can take the rest to the ground."

One of the early facts in this area was something called the disposition effect, very close to Danny's heart. The disposition effect states that because people dislike realizing losses, what we should see is that when somebody holds the stock that they bought at $10, they're much more likely to sell if that stock is at $9 than if that stock is at 11 because you just don't want to realize that loss. It's quite intuitive, and it's kind of an interesting application of loss aversion combined with one other assumption.

In fact, one of the beautiful papers in this literature was Terry Odean's who went and got a very large dataset of traders—about 100,000 traders from a brokerage house—and what he showed was: Using good deductive science, I took this large data, I looked at people who were in the gain domain, who were in the loss domain, and the proportion of gains realized to losses realized was huge. Gains were realized at about 60 percent higher rate than losses. Very good deductive science.

We thought, well, let's go and just apply inductive science, because this is a large dataset with lots of features, and when you apply machine learning techniques, and you let the algorithm go through this huge set of variables and pick out variables individually that it thinks are important, this is no longer a prediction. If I have to look one by one, in fact, there's good support for the disposition effect. This algorithm is rediscovering loss aversion, because it finds this gain variable and says if I have to use one, this is one of the ones that I would use. It discovers a few others, but it's as if—I'm going to put you out of work soon, Danny—it discovers loss aversion, which is kind of interesting.

But then if you say to the algorithm go ahead and use all the variables you could to come up with the best predictor, it doesn't care about the disposition effect. Absolutely uninterested in that effect. The disposition effect, much like roommate assignment, appears to have been merely a proxy for some deeper avenue of behavior that really has nothing to do with disposition. Disposition has approximately zero predictive value when you add it. So if you said, here's an algorithm, I hide from it the disposition effect variable, or even the gain domain, I even hide from it anything to do with the purchase price so it can't possibly know anything about the disposition effect.

We thought, well, let's go and just apply inductive science, because this is a large dataset with lots of features, and when you apply machine learning techniques, and you let the algorithm go through this huge set of variables and pick out variables individually that it thinks are important, this is no longer a prediction.

Here's another algorithm to which I give privilege advantage of my theory, which is: purchase price matters. It turns out the two algorithms do exactly as well. There is no benefit in knowing the purchase price. What we thought was the disposition effect appears to be a proxy for something else. It's a little unsatisfying, but lots of rejections are unsatisfying. One thing that's interesting about the induction test is, unlike deduction, where we are told it doesn't work, induction actually gives us a little bit more; it gives us at least some sense of north because it tells us these are the variables I used to kill the variable you care about.

In this particular case, what killed disposition? Well, two things appear to kill disposition. The algorithm discovers this variable, which we call quartile. What is quartile? Quartile is the price that you have right now—the stock that you're seeing—where does it fall in the distribution of prices over the last 180 days, or that you've seen, for example? If it's in the top quartile, people are much, much more likely to sell than if it's anywhere else. Now, you can see how that's somewhat correlated with whether you're in the gain domain, but it has nothing to do with your purchase price. It's just where you were sitting.

The other thing that surprisingly seems to matter is if you just look, the last pattern of three prices you saw make a big difference, so, up, up, up—people are very likely to sell. Interestingly, down, down, down—people are very likely to sell. So the pattern that really matters when we put it all together is, you're in the top quartile and the price goes up, up, up, or you're in the top quartile and the price goes down, down, down. That's where a disproportionate amount of the sales happen. Of course, because gain is correlated with being in that space weekly, gain matters. But if you had that variable, which the algorithm discovers, it's not just that gain no longer matters, it's this stuff is much, much more predictive.

Just to close it out, that's the application. The things that we're thinking about now are, in some sense this is, hopefully, a way to combine what I think is the principled elements of science—hypothesis, test—with the data mining inductive—exploratory—just stuff comes up that is hard to interpret aspect of big data that has now allowed the datasets that we have.

PIZARRO: The big data approach, you're so right about how this is increasingly a method where we're going to reach discovery. I worry a little bit that, well for one, we've had data, maybe not big data in the way that we speak of it, so this approach is weird in that we talk about the scientific method as first formulating a theory and then generating a hypothesis, and then testing it out, because that so often is not how it happens in real life. You can go back to looking at Tycho Brahe's data, right? He was a big data guy. All he did was collect data. Then Keppler comes along and finds them, and says, "Hey, there's some rules to this." But it takes Newton to come and actually offer an explanation. This is what I want to ask you about. Explanation is the heart of the scientific method, and I fear that big data yields better predictions about the future, but we lose sight of getting to the more basic general principals that might actually yield predictions in completely different domains.

MULLAINATHAN: I could not agree more with you, and maybe I wasn't clear enough. I absolutely am in no way proposing that we use big data to induct a hypothesis, and so on, and so on. I was still maintaining, I think, the good features of science, where wherever your hypothesis came from, that's the starting point. The goal is very much to use big data to test that hypothesis. Does that make sense?

PIZARRO: Yes. And I wasn't saying that you were ignoring that feature. I really wanted to hear your thoughts on whether the focus on the success of predictions will possibly make us lose sight of the work that we have to do to try to yield more general explanations—that we might abandon the quest for the more basic because the prediction is so powerful that we just keep collecting more and more data, and saying, "Well, I don't care. This captures the most variance, so that's what matters." I feel like we lose some deeper understanding when we're so focused on that. Do you think that this is actually a danger, or do you think that this is not at all?

MULLAINATHAN: It's a great question. I don't think it's a danger, and here's why. I think right now the conversation is there, but having worked a lot with these types of techniques now, very quickly, just pragmatically, you want to cross it—that is, to say there are domains where it's really clear you can just put in your prediction method and plow forward. But in lots of domains, you very quickly have to get interpretations for exactly the reason that you said, which is we're trying to start with big data, but we're trying to move to somewhere where we don't have much data. It's happened so often, but I think it's just because this is like a new toy that's appealing in this way, but just pragmatic. I don't even think it's going to require anyone to say this. Pragmatically, you start using it, and you're like, "Oh, wait, I can't," and then all of a sudden it breaks. And that's been my experience.

CHRISTAKIS: I had a narrower reflection, not on your broader point, although I have some thoughts on that, too. On the last issue of prediction about when traders might trade, it reminds me a little bit about the challenge physicians face with prognostication. In a way you're talking about position, velocity, and acceleration of a particle. So when you're trying to predict whether a patient will live or die, initially, people put a lot of credence in what is the patient's health status right now. So a patient that, on a ten-point scale was an eight, would be predicted to live longer, have more prospects for survival than a patient whose position was now four.

But, of course, it matters a lot if I told you the patient who's four was yesterday a ten, or vice-versa, the patient who's four was yesterday a three, and the patient who's an eight was yesterday a ten. Now you might make a different prediction about what's likely to happen. And then, of course, you've got the third moment as well, so you can go downstream. So there's an analogy from what you're describing. Eventually, you could even imagine that the velocity is much more important than the position, and dispense with the position altogether.

The other thing that reminded me of what you were describing in that example is the asymmetric loss aversion that physicians have with they make prognoses. So if you imagine that a patient's survival, like the classic "how do you price a piece of real estate"—the classic Zellner—the real estate agent has to pick a price to sell a house, and you can't publish on the front of the house a density function for what price you would sell it at. You have to pick a single price. If you pick too high a price, then maybe you run the risk that the house doesn't sell, and if you pick too low a price, well then you run the risk that you left money on the table. So you calibrate, and you pick a point, and you have to balance these two losses, not selling versus leaving money on the table.

The physician faces a similar problem, which is, "I have to make a prediction to you as to how long you're going to live, you're coming to me for treatment. If I over predict and you die early, then I lose face, and I'm embarrassed. If I under predict about how long you're going to live, well then I maybe look fantastic, or, you know, make treatment decisions otherwise." It turns out that one of the deep reasons physicians consistently overestimate survival and miscalibrate is they feel very differently about selling at the dollar loss than they do about selling at a dollar gain. So a one-month overestimate of survival means something very different to them in terms of their loss than a one-month underestimate. So just two analogies to your discussion.

BROCKMAN: How do these ideas manifest in your work in government?

MULLAINATHAN: I've done various things, and I've done work with CFPB and worked at Treasury. I think this is quite distinct, and I think this touches back on the big data center question. This really is me trying to struggle with the following fundamental issue, which is, if we maintain the rigid rule that science is a hypothesis test, I'm trying to ask is there something different about science when data gets very large? To me that's interesting because I had always just presumed, until I started down this path, that when data gets very large, the only thing that changes about science is that we have more power. Great! It's almost like the focus gets sharper, and that's all. We continue what we're doing, but because of sharper focus, maybe we can look for smaller effects.

In fact, this stuff has convinced me that it's possible that the qualitative nature of science itself changes. I know that from talking about it in the realm of social science, because we understand that, but I suspect the same ideas can be used in other areas—we're not sure—but in other areas where we also have large datasets and we're testing scientific hypotheses.

DENNETT: I'm trying to put my finger on what I feel is missing from this new approach, and I haven't got it very well figured out, but it reminds me a bit of credit assignment problems in AI and in debugging, and also in connectionism, where you've got a connection to this model, train it up, it works, and I'm thinking of why? How? And there are some techniques, which can tease out pretty well what's doing the work. I think Sinofsky and Hinton have some, for instance. But I think we need something more here.

I guess what worries me is that we'll come to settle for a big data prediction, and just abandon the search for understanding and say, "Well, come on, that's a nineteenth century idea, a twentieth century idea. Who needs formuli? Who needs understanding, when we just push the button, and the algorithm gives us the prediction?" That is, to me, a depressing prospect.

MULLAINATHAN: That's related to your question. So let me tackle it, a bit more. Why I don't think we'll settle there, is that several people have written about this. Donahoe at Stanford has written some very good things about this. There's sort of a misnomer in the word "big" in big data. This may be familiar to all of you, but at least let me just talk it through. We could break the word "big" into two parts: Long data and wide data. What do I mean by that? Long data is the number of data points you have. So if you picture the data set as sort of like a matrix, or written on a piece of paper, length is the length of that dataset. The width is the number of features that you have.

These two kinds of "big" work in exactly the opposite direction. That is, long is really, really good. Wide, some of it's bad, and it poses a lot of problems. Why does wide pose a lot of problems? Picture the prediction function working as a search process. The search processes find the combinations of features that work well to predict why. You could see, with just a little back of the envelope calculation the mathematicsare such that as the data gets even a little bit wider, this thing is growing exponentially, I mean, just crazy exponentially. As a result, when data gets wider, and wider, and wider, the problem gets harder, and harder, and harder, and algorithms do worse, and worse, and worse. As the data gets longer and longer, algorithms do better and better.

Why I'm saying this is because ultimately as data gets bigger, it's not like it's reducing the need for "curation." What is curation? Curation is the human element of going in and saying, sure, we've got lots of these variables, but let's pick these few that matter. And I think this is a part that's missing in what we're talking about. Something interesting is happening in the curation process. Deduction is one end of curation, which is I pick the one feature I thought mattered and I'm putting that in. With big data we can put in more than the one feature we thought mattered, but we can't in many cases, throw it all into the mix. In a few cases we can. The dataset is sufficiently thin that you can throw it all in, but in most cases, you are left with a pretty massive curation problem. And in most machine learning prediction applications that's all swept under the rug. The truth is, some domain expert had to curate this thing; somebody had to decide because there's just way too much. So there is some interesting and important interaction there.

Notice, my comment is a statement about the inherent mathematics of the problem. This is not something where more computing power is going to solve this. It's not something where more data is going to solve this because this is a double exponential situation where when things get sufficiently wide we're talking about more data than there are atoms in the universe. We're not even close to that.

So this width problem, in blowing things up, is at the heart of why fundamentally we have to have explanations. I'm not saying we have an answer to that, but I think that that's why there is the need for having a hypothesis, testing, and then continuing.

DENNETT: It's like search and chess too. If you had 7500 first move possibilities…

MULLAINATHAN: That's actually a great example. It's worth noting the difference between search and chess and this example. In search and chess, my input is not data, it's just computational power to walk down the tree because the world is given to me. So the only constraint I have is steps down this tree. In this machine learning world, the length is the limit. I can't just say, "Get me more data!" Well there are only 6 billion people in the world, I've given you everybody, I can't give you any more. So the length constraint is a very real constraint on the width that we can search through, which is why I'm optimistic that we're not fundamentally going to somehow lose the need for explanation.

KAHNEMAN: Let's start at the other end. Let's stipulate for a moment that loss aversion is real, and that people really hate to realize a loss. Now, that would predict some disposition effect. Not predict the disposition effect. That is, there is a difference, and I think that's extremely interesting. It might not be the best predictor, but it might still be true. So when you say "kill the variable," you might have killed something that, in fact, is valid, and interesting, and important. It is just not the best predictor of when people sell or don't. We know that that can happen. I'm wondering about that. Because then you might actually be losing something through big data, because there is a consistent story. There is a broad story about loss aversion, and it seems trivial that when somebody has a choice between rewarding themselves by selling something that has gain, and punishing themselves by selling at a loss, they're more likely to reward themselves. If that isn't true, if big data can kill that hypothesis, then we're in real trouble.

So here is a hypothesis that somehow must be true, and what you have shown is that it's not the best predictor of people's choices. And because it is not the best predictor, it's not an independent predictor, you have come up with a conclusion, which is a strong conclusion—we've killed that variable. I am not sure you have killed it. I'm not sure you should kill it. That's the question I'm raising.

MULLAINATHAN: I think there are two ways in which we could be wrong in saying we've killed the variable. One is that we focused narrowly on stock sales, and that's the only thing in the world, and we're trying to figure out did we say something meaningful about the disposition effect there.

The second way is look, we don't just care about stock sales, we care about a variety of things. Even if this variable is not the most important predictive variable here, it's possible that it's the third most important predictive variable, but across many, many, many domains, and so as a result, it's a foolish thing. So I think these are two separate elements of it, and let me take them in part.

The second one, I think, is the easiest to talk about, which is that I think what I find valuable in the inductive test at some level, is it's worth just comparing it to the deductive test. In that sense, induction, will lead to many more of these instances, where variables look unimportant, or far less important in this context, but I think that's something social science has to come up with, which is that our theories can never be so good that in every context they do very, very well. So we have to be amassing evidence across a variety of contexts, and so I would completely agree. If you said to me, "We are going to go and get data on housing," do I believe that this fact that disposition is less important here, or unimportant here, given the other variables, does that mean disposition wouldn't be important here? I don't know.

I think of this as having supercharged one part of the process, but by no means having in any way supercharged the other part, which is important for social science, which is to look context by context, and start to understand. And I totally would believe a world where we found that disposition was third most important or second or was important if we didn't know, but because the quantile is not something you can look at in other places. I'm agreeing, though in this case, I will say that it gave me a moment of pause because when we look at the variable that matters, in this case the quantile, and the price dynamics, that made me feel that now that I go to the housing world I would also be curious about those variables. So that gives me guidance, but kill is too strong a word for the second one. I completely agree with that. It's more saying we've learned the signal that this thing wasn't independent.

KAHNEMAN: Then I have a small technical question. I assume you did it the way I would have done it, but the disposition, as I understand it, is that you take an individual who has a choice, that there's a portfolio, there's some winners and some losers in the portfolio, and the question is: Which is he more likely to sell? So is that the way it was set up? In general, that is, if you don't set it up as a sales problem, if you just predict what, as a choice problem, between selling a loser and selling a winner, if you just predict selling, you could get something entirely different, and loss aversion would be completely irrelevant.

MULLAINATHAN: No. It was very much set up as you're given this string, here's a person, and you're given this string of everything that's happened to them, as well as their purchase price. And so, therefore, you can say, okay, for this person, in this stock, here's everything you know about them, and now you'd like to make a prediction about is that person going to be selling or losing. Just to get a sense, that type of string, there are many such strings, and so there will be ones where that person will be in the loss domain on another stock. And so that's the signal that I think the Odean paper correctly got at, because it collapsed all those strings down and said all the times that you're in gain versus all the times in loss. But I don't know if that's what you mean.

KAHNEMAN: No. My question was whether you restricted your analysis and your prediction to cases in which the portfolio included a winner and a loser, because if you didn't, then your results get a completely different interpretation. The disposition effect is about a choice. The story you were telling us is about a prediction, and you could have a portfolio that is all winners or all losers, and if you allowed those portfolios inside your analysis, you could get your result, and that would even bear on the disposition effect.

CHRISTAKIS: And more particularly, does the outcome of other stocks held by the same trader affect the likelihood of selling or buying the index stock. Right? Across stocks, not across individuals. Are you comparing individuals who are at a gain: they are more likely to sell than individuals at a loss, versus comparing within the individual?

MULLAINATHAN: That's right. We've done a little bit of stuff, but maybe not as much as we'd like, but we've done a little bit stuff on the entire portfolio. Is the entire portfolio in a gain? Where is the stock relative to the other stocks in the gain? Those never get off the ground, which I think is sort of a mental accounting thing, that these things are being individually accounted for. But we haven't done, and we can do this, it's a great idea, is just literally compare this person and say here are two stocks that the person themselves held at this time, which one is more likely to be sold? That's a valid experiment.

KAHNEMAN: I mean that's interesting, because of something it tells you about big data and the analysis, that is that when you construct it in the deductive way, when you construct a story about the disposition effect, you really very clearly have a choice in mind, and …

MULLAINATHAN: To be fair, the Odean experiment wasn't the one you had either.

KAHNEMAN: No.

MULLAINATHAN: When we did deduction with Odean, we also did just take all winners and compared all losers. So in some sense, I take your comment, but I think that's almost a slippage, a mental slippage we all have had around what is the prediction, and maybe this is nice, because it's really forcing us to sharpen exactly what the prediction is.