A widespread misconception is that biases explain or even produce behavior. They don’t—they describe behavior. The endowment effect does not cause people to demand more for a mug they received than a mug-less counterpart is prepared to pay for one. It is not because of the sunk cost fallacy that we hang on to a course of action we’ve invested a lot in already. Biases, fallacies, and so on are no more than labels for a particular type of observed behavior, often in a peculiar context, that contradicts traditional economics’ simplified view of behavior.

[S]aying that the endowment effect is caused by Loss Aversion, as a function of Prospect Theory, is like saying that human sexual behavior is caused by Abstinence Aversion, as a function of Lust Theory. The latter provides no intellectual or analytic purchase, none, on why sexual behavior exists. Similarly, Prospect Theory and Loss Aversion – as valuable as they may be in describing the endowment effect phenomena and their interrelationship to one another – provide no intellectual or analytic purchase, none at all, on why the endowment effect exists. …

[Y]ou can’t provide a satisfying causal explanation for a behavior by merely positing that it is caused by some psychological force that operates to cause it. That’s like saying that the orbits of planets around the sun are caused by the “orbit-causing force.” …

[L]oss aversion rests on no theoretical foundation. Nothing in it explains why, when people behave irrationally with respect to exchanges, they would deviate in a pattern, rather than randomly. Nor does it explain why, if any pattern emerges, it should have been loss aversion rather than gain aversion. Were those two outcomes equally likely? If not, why not?

We used to go to the Catskill Mountains, a place where people from New York City would go in the summer. The fathers would all return to New York to work during the week, and come back only for the weekend. On weekends, my father would take me for walks in the woods and he’d tell me about interesting things that were going on in the woods. When the other mothers saw this, they thought it was wonderful and that the other fathers should take their sons for walks. They tried to work on them but they didn’t get anywhere at first. They wanted my father to take all the kids, but he didn’t want to because he had a special relationship with me. So it ended up that the other fathers had to take their children for walks the next weekend.

The next Monday, when the fathers were all back at work, we kids were playing in a field. One kid says to me, “See that bird? What kind of bird is that?”

I said, “I haven’t the slightest idea what kind of a bird it is.”

He says, “It’s a brown-throated thrush. Your father doesn’t teach you anything!”

But it was the opposite. He had already taught me: “See that bird?” he says. “It’s a Spencer’s warbler.” (I knew he didn’t know the real name.) “Well, in Italian, it’s a Chutto Lapittida. In Portuguese, it’s a Bom da Peida. In Chinese, it’s a Chung-long-tah, and in Japanese, it’s a Katano Tekeda. You can know the name of that bird in all the languages of the world, but when you’re finished, you’ll know absolutely nothing whatever about the bird. You’ll only know about humans in different places, and what they call the bird. So let’s look at the bird and see what it’s doing—that’s what counts.” (I learned very early the difference between knowing the name of something and knowing something.)

Knowing the name of a “bias” such as loss aversion isn’t zero knowledge – at least you know it exists. But knowing something exists is a very shallow understanding.

And back to Koen Smets:

Learning the names of musical notes and of the various signs on a staff doesn’t mean you’re capable of composing a symphony. Likewise, learning a concise definition of a selection of cognitive effects, or having a diagram that lists them on your wall, does not magically give you the ability to analyze and diagnose a particular behavioral issue or to formulate and implement an effective intervention.

Behavioral economics is not magic: it’s rare for a single, simple nudge to have the full desired effect. And being able to recite the definitions of cognitive effects does not magically turn a person into a competent behavioral practitioner either. When it comes to understanding and influencing human behavior, there is no substitute for experience and deep knowledge. Nor, perhaps even more importantly, is there a substitute for intellectual rigor, humility, and a healthy appreciation of complexity and nuance.

]]>https://jasoncollins.blog/2018/08/08/the-difference-between-knowing-the-name-of-something-and-knowing-something/feed/1jasonacollinsMichael Mauboussin’s Think Twice: Harnessing the Power of Counterintuitionhttps://jasoncollins.blog/2018/08/01/michael-mauboussins-think-twice-harnessing-the-power-of-counterintuition/
https://jasoncollins.blog/2018/08/01/michael-mauboussins-think-twice-harnessing-the-power-of-counterintuition/#respondWed, 01 Aug 2018 09:00:05 +0000http://jasoncollins.blog/?p=22970Michael Mauboussin’s Think Twice: Harnessing the Power of Counterintuition is a multi-disciplinary book on how to improve your decision making. Framed around eight common decision-making mistakes, Mauboussin draws on disciplines including psychology, complexity theory and statistics.

Given the scope of the book, it does not reach great depth for most of its subject areas. But the interdisciplinary nature of the book means that most people are likely to find something new. I gained pointers to a lot of interesting reading, plus some new ways of thinking about familiar material. Below are a few interesting parts.

One early chapter contrasts the inside and outside views when making a judgement or prediction, a perspective I have often found helpful. The inside view uses the specific information about the problem at hand. The outside view looks at whether there are similar situations – a reference class – that can provide a statistical basis for the judgement. The simplest statistical basis is the “base rate” for that event – the probability of it generally occurring. The outside view, even a simple base rate, is typically a better indicator of the outcome than an estimate derived from the inside view.

Mauboussin points out that ignorance of the outside view is not the sole obstacle to its use. People will often ignore base rate information even when it is right in front of them. Mauboussin discusses an experiment by Freymuth and Ronan (pdf) where the experimental participants selected treatment for a fictitious disease. When the participants were able to choose a treatment with a 90% success rate that was paired with a positive anecdote, they chose it 90% of the time (choosing a control treatment with 50% efficacy the remaining 10% of the time). But when paired with a negative anecdote, only 39% chose the 90% efficacy treatment. Similarly, a treatment with 30% efficacy paired with a negative anecdote was chosen only 7% of the time, but this increased to 78% when it was paired with a positive anecdote. The stories drowned out the base rate information.

To elicit an outside view, Mauboussin suggests the simple trick of pretending you are predicting for someone else. Think about how the event will turn out for others. This will abstract you from the distracting inside view information and bring you closer to the more reliable outside view.

Mauboussin is at his most interesting, and differs from most standard examinations of decision making, when he considers decision making in complex systems (which happens to be the environment of many of our decisions).

One of his themes is it is nearly impossible to manage a complex system. Understanding any individual part may be of limited use in understanding the whole, and interfering with that part may have many unintended consequences. The century of bungling in Yellowstone National Park (via Alston Chase’s book Playing God in Yellowstone) provides an example. In an increasingly connected world, more of our decisions are going to be in these types of systems.

One barrier to understanding a complex system is that the agents in an apparently intelligent system may not be that intelligent themselves. Mauboussin quotes biologist Deborah Gordon:

If you watch an ant try to accomplish something, you’ll be impressed by how inept it is. Ants aren’t smart, ant colonies are.

Complex systems often perform well at a system level despite the dumb agents. No single ant understands what the colony is doing, yet the colony does well.

Mauboussin turns this point into a critique of behavioural finance, suggesting it is a mistake to look at individuals rather than the market:

Regrettably, this mistake also shows up in behavioral finance, a field that considers the role of psychology in economic decision making. Behavioral finance enthusiasts believe that since individuals are irrational—counter to classical economic theory—and markets are made up of individuals, then markets must be irrational. This is like saying, “We have studied ants and can show that they are bumbling and inept. Therefore, we can reason that ant colonies are bumbling and inept.” But that conclusion doesn’t hold if more is different—and it is. Market irrationality does not follow from individual irrationality. You and I both might be irrationally overconfident, for example, but if you are an overconfident buyer and I am an overconfident seller, our biases may cancel out. In dealing with systems, the collective behavior matters more. You must carefully consider the unit of analysis to make a proper decision.

Mauboussin’s discussion of the often misunderstood concept of reversion (regression) to the mean is also useful. Here are some snippets:

“Mediocrity tends to prevail in the conduct of competitive business,” wrote Horace Secrist, an economist at Northwestern University, in his 1933 book, The Triumph of Mediocrity in Business. With that stroke of the pen, Secrist became a lasting example of the second mistake associated with reversion to the mean—a misinterpretation of what the data says. Secrist’s book is truly impressive. Its four hundred-plus pages show mean-reversion in series after series in an apparent affirmation of the tendency toward mediocrity.

…

In contrast to Secrist’s suggestion, there is no tendency for all companies to migrate toward the average or for the variance to shrink. Indeed, a different but equally valid presentation of the data shows a “movement away from mediocrity and [toward] increasing variation.” A more accurate view of the data is that over time, luck reshuffles the same companies and places them in different spots on the distribution. Naturally, companies that had enjoyed extreme good or bad luck will likely revert to the mean, but the overall system looks very similar through time. …

A counterintuitive implication of mean reversion is that you get the same result whether you run the data forward or backward. So the parents of tall children tend to be tall, but not as tall as their children. Companies with high returns today had high returns in the past, but not as high as the present. …

Here’s how to think about it. Say results are part persistent skill and part transitory luck. Extreme results in any given period, reflecting really good or bad luck, will tend to be less extreme either before or after that period as the contribution of luck is less significant. …

On this last point, a simple test of whether your activity involves skill is whether you can lose on purpose. For example, try to build a stock portfolio that will do worse than the benchmark.

Mauboussin links reversion of the mean to the “halo effect” (I recommend reading Phil Rosenzweig’s book of that name). The halo effect is the tendency of impressions from one area to influence impressions of another. In business, if people see a company with good profits, they will tend to assess the CEO’s management style, communications, organisational structure, strategic direction as all being positive.

When the company’s performance later reverts to the mean, people then interpret all of these things as going bad, when it is quite possible nothing has changed. The result is that great results tend to be followed by glowing stories in the media followed by the fall:

Tom Arnold, John Earl, and David North, finance professors at the University of Richmond, reviewed the cover stories that Business-Week, Forbes, and Fortune had published over a period of twenty years. They categorized the articles about companies from most bullish to most bearish. Their analysis revealed that in the two years before the cover stories were published, the stocks of the companies featured in the bullish articles had generated abnormal positive returns of more than 42 percentage points, while companies in the bearish articles underperformed by nearly 35 percentage points, consistent with what you would expect. But for the two years following the articles, the stocks of the companies that the magazines criticized outperformed the companies they praised by a margin of nearly three to one.

And to close, Mauboussin provides a great example of bureaucratic kludge preventing the use of a checklist in medical treatment:

Toward the end of 2007, a federal agency called the Office for Human Research Protections charged that the Michigan program violated federal regulations. Its baffling rationale was that the checklist represented an alteration in medical care similar to an experimental drug and should continue only with federal monitoring and the explicit written approval of the patient. While the agency eventually allowed the work to continue, concerns about federal regulations needlessly delayed the program’s progress elsewhere in the United States. Bureaucratic inertia triumphed over a better approach.

]]>https://jasoncollins.blog/2018/08/01/michael-mauboussins-think-twice-harnessing-the-power-of-counterintuition/feed/0jasonacollinsthink_twiceRobert Sapolsky’s Why Zebra’s Don’t Get Ulcershttps://jasoncollins.blog/2018/07/25/robert-sapolskys-why-zebras-dont-get-ulcers/
https://jasoncollins.blog/2018/07/25/robert-sapolskys-why-zebras-dont-get-ulcers/#respondWed, 25 Jul 2018 09:00:48 +0000http://jasoncollins.blog/?p=22950Before tackling Robert Sapolsky’s new book Behave: The Biology of Humans at Our Best and Worst, I decided to read Sapolsky’s earlier, well-regarded book Why Zebra’s Don’t Get Ulcers. I have been a fan of Sapolsky’s for some time, largely through his appearance on various podcasts. (This discussion with Sam Harris is excellent.)

Why Zebra’s Don’t Get Ulcers is a wonderful book. Sapolsky is a great writer, and the science is interesting. That Sapolsky did not sugarcoat the introduction to every chapter with a cute story, as seems to be a common formula today, made the book a pleasant contrast to a lot of my recent reading.

The core theme of the book is that chronic stress is bad for your health. It can lead to cardiovascular disease, destroy your sleep, age you faster, and so on. The one positive (relative to common beliefs) is that stress probably doesn’t cause cancer (with the possible exception of colon cancer).

The story linking stress with these health problems largely revolves around the hormones that trigger the stress response. I’ll give a quick synopsis of this story, as it helps give context to some of the snippets below.

When the stressor first arises, CRH (corticotropin releasing hormone) is released from the hypothalamus in the brain. CRH helps to turn on the sympathetic nervous system, with the nerve endings of the sympathetic nervous system releasing adrenaline (called epinephrine through the book). This all leads to increased heart rate, vigilance and arousal. It triggers the cessation of many bodily functions, such as digestion, repair and reproductive processes, and suppresses immunity, mobilising the body’s resources to solve the stressor at hand.

Fast forward 15 seconds, and the CRH has triggered the pituitary at the base of the brain to release ACTH (also known as corticotropin). A few minutes later the ACTH in turn triggers the release of glucocorticoids by the adrenal gland. The glucocorticoids increase the stress response, further arousing the sympathetic nervous system and raising circulating glucose. The glucocorticoids are also involved in recovery and the preparation for the next stressor. For instance, they stimulate appetite.

Many of the costs of stress arise through the actions of these hormones when the stress is intermittent or chronic. CRH is cleared from the body a couple of minutes after the end of the stressor. It can take hours for glucocorticoids to be cleared. Continued intermittent or chronic stressors results in permanently elevated glucocorticoid levels, subjecting the body to a stress response without pause. For instance, the stress response makes the heart work harder. If you are in chronic stress, this increased work effort is constant, leading to high blood pressure, and wearing out your blood vessels.

There are a raft of other hormones and processes involved in the stress response, each with their own roles, costs and benefits, but this basic picture, particularly the cost of ongoing high levels of glucocorticoids, forms the books central thread.

Although this sounds like a somewhat mechanical process, an important theme in the book is that the cost of stress is not just a mechanical equation, whereby stress causes a bodily response with various costs. The book balances a reductive view of biology, in which you can trace everything back to physical factors such as bacteria, viruses, genes, hormones and so on, with another view that is more psychologically grounded. In that latter view, stress can be purely psychological, affected by someone’s sense of control and so on.

The one part of the book that I found mildly unsatisfying was the chapter on the link between stress, poverty and health. Naturally, poverty and poor health are closely linked, with poverty associated with greater stress. Sapolsky asks about direction of causality: does poverty harm health, or does poor health lead to poverty. But (as he does in some other chapters), Sapolsky does not delve deeply into whether there might be other causal factors. I felt that that chapter deserves another book.

More generally, I don’t have the subject expertise to critique the book, but I highlighted a lot of interesting passages. Below is a selection.

On sex differences in stress response:

Taylor argues convincingly that the physiology of the stress-response can be quite different in females, built around the fact that in most species, females are typically less aggressive than males, and that having dependent young often precludes the option of flight. Showing that she can match the good old boys at coming up with a snappy sound bite, Taylor suggests that rather than the female stress-response being about fight-or-flight, it’s about “tend and befriend”—taking care of her young and seeking social affiliation.

…

A few critics of Taylor’s influential work have pointed out that sometimes the stress-response in females can be about fight-or-flight, rather than affiliation. For example, females are certainly capable of being wildly aggressive (often in the context of protecting their young), and often sprint for their lives or for a meal (among lions, for example, females do most of the hunting). Moreover, sometimes the stress-response in males can be about affiliation rather than fight-or-flight. This can take the form of creating affiliative coalitions with other males or, in those rare monogamous species (in which males typically do a fair amount of the child care), some of the same tending and befriending behaviors as seen among females. Nevertheless, amid these criticisms, there is a widespread acceptance of the idea that the body does not respond to stress merely by preparing for aggression or escape, and that there are important gender differences in the physiology and psychology of stress.

On stress making us both eat more and less:

The official numbers are that stress makes about two-thirds of people hyperphagic (eating more) and the rest hypophagic. Weirdly, when you stress lab rats, you get the same confusing picture, where some become hyperphagic, others hypophagic. So we can conclude with scientific certainty that stress can alter appetite. Which doesn’t teach us a whole lot, since it doesn’t tell us whether there’s an increase or decrease. …

The confusing issue is that one of the critical hormones of the stress-response stimulates appetite, while another inhibits it. … CRH inhibits appetite, glucocorticoids do the opposite. Yet they are both hormones secreted during stress. Timing turns out to be critical. …

Suppose that something truly stressful occurs, and a maximal signal to secrete CRH, ACTH, and glucocorticoids is initiated. If the stressor ends after, say, ten minutes, there will cumulatively be perhaps a twelve-minute burst of CRH exposure (ten minutes during the stressor, plus the seconds it takes to clear the CRH afterward) and a two-hour burst of exposure to glucocorticoids (the roughly eight minutes of secretion during the stressor plus the much longer time to clear the glucocorticoids). So the period where glucocorticoid levels are high and those of CRH are low is much longer than the period of CRH levels being high. A situation that winds up stimulating appetite. In contrast, suppose the stressor lasts for days, nonstop. In other words, days of elevated CRH and glucocorticoids, followed by a few hours of high glucocorticoids and low CRH, as the system recovers. The sort of setting where the most likely outcome is suppression of appetite. The type of stressor is key to whether the net result is hyper-or hypophagia. …

Take some crazed, maze-running rat of a human. He sleeps through the alarm clock first thing in the morning, total panic. Calms down when it looks like the commute isn’t so bad today, maybe he won’t be late for work after all. Gets panicked all over again when the commute then turns awful. Calms down at work when it looks like the boss is away for the day and she didn’t notice he was late. Panics all over again when it becomes clear the boss is there and did notice. So it goes throughout the day. … What this first person is actually experiencing is frequent intermittent stressors. And what’s going on hormonally in that scenario? Frequent bursts of CRH release throughout the day. As a result of the slow speed at which glucocorticoids are cleared from the circulation, elevated glucocorticoid levels are close to nonstop. Guess who’s going to be scarfing up Krispy Kremes all day at work?

So a big reason why most of us become hyperphagic during stress is our westernized human capacity to have intermittent psychological stressors throughout the day.

On the link between the brain and immunity:

The evidence for the brain’s influence on the immune system goes back at least a century, dating to the first demonstration that if you waved an artificial rose in front of someone who is highly allergic to roses (and who didn’t know it was a fake), they’d get an allergic response. … [T]he study that probably most solidified the link between the brain and the immune system used a paradigm called conditioned immunosuppression.

Give an animal a drug that suppresses the immune system. Along with it, provide, à la Pavlov’s experiments, a “conditioned stimulus”—for example, an artificially flavored drink, something that the animal will associate with the suppressive drug. A few days later, present the conditioned stimulus by itself—and down goes immune function. … The two researchers experimented with a strain of mice that spontaneously develop disease because of overactivity of their immune systems. Normally, the disease is controlled by treating the mice with an immunosuppressive drug. Ader and Cohen showed that by using their conditioning techniques, they could substitute the conditioned stimulus for the actual drug—and sufficiently alter immunity in these animals to extend their life spans.

Does acupuncture rely on a placebo effect?

[S]cientists noted that Chinese veterinarians used acupuncture to do surgery on animals, thereby refuting the argument that the painkilling characteristic of acupuncture was one big placebo effect ascribable to cultural conditioning (no cow on earth will go along with unanesthetized surgery just because it has a heavy investment in the cultural mores of the society in which it dwells).

On the anticipatory stress when you set an early alarm:

In the study, one group of volunteers was allowed to sleep for as long as they wanted, which turned out to be until around nine in the morning. As would be expected, their stress hormone levels began to rise around eight. How might you interpret that? These folks had enough sleep, happily restored and reenergized, and by about eight in the morning, their brains knew it. Start secreting those stress hormones to prepare to end the sleep. But the second group of volunteers went to sleep at the same time but were told that they would be woken up at six in the morning. And what happened with them? At five in the morning, their stress hormone levels began to rise. This is important. Did their stress hormone levels rise three hours earlier than the other group because they needed three hours less sleep? Obviously not. … Their brains were feeling that anticipatory stress while sleeping, demonstrating that a sleeping brain is still a working brain.

On the importance of having outlets for stress, even if that outlet is someone else:

An organism is subjected to a painful stimulus, and you are interested in how great a stress-response will be triggered. The bioengineers had been all over that one, mapping the relationship between the intensity and duration of the stimulus and the response. But this time, when the painful stimulus occurs, the organism under study can reach out for its mommy and cry in her arms. Under these circumstances, this organism shows less of a stress-response. …

Two identical stressors with the same extent of allostatic disruption can be perceived, can be appraised differently, and the whole show changes from there. …

The subject of one experiment is a rat that receives mild electric shocks (roughly equivalent to the static shock you might get from scuffing your foot on a carpet). Over a series of these, the rat develops a prolonged stress-response: its heart rate and glucocorticoid secretion rate go up, for example. For convenience, we can express the long-term consequences by how likely the rat is to get an ulcer, and in this situation, the probability soars. In the next room, a different rat gets the same series of shocks—identical pattern and intensity; its allostatic balance is challenged to exactly the same extent. But this time, whenever the rat gets a shock, it can run over to a bar of wood and gnaw on it. The rat in this situation is far less likely to get an ulcer. You have given it an outlet for frustration. Other types of outlets work as well—let the stressed rat eat something, drink water, or sprint on a running wheel, and it is less likely to develop an ulcer. …

A variant of Weiss’s experiment uncovers a special feature of the outlet-for-frustration reaction. This time, when the rat gets the identical series of electric shocks and is upset, it can run across the cage, sit next to another rat and… bite the hell out of it. Stress-induced displacement of aggression: the practice works wonders at minimizing the stressfulness of a stressor.

On how predictability can make stressors less stressful:

During the onset of the Nazi blitzkrieg bombings of England, London was hit every night like clockwork. Lots of stress. In the suburbs the bombings were far more sporadic, occurring perhaps once a week. Fewer stressors, but much less predictability. There was a significant increase in the incidence of ulcers during that time. Who developed more ulcers? The suburban population. (As another measure of the importance of unpredictability, by the third month of the bombing, ulcer rates in all the hospitals had dropped back to normal.)

On the link between low SES and poor health – it is more about someone’s beliefs than their actual level of poverty:

[T]he SES/ health gradient is not really about a distribution that bottoms out at being poor. It’s not about being poor. It’s about feeling poor, which is to say, it’s about feeling poorer than others around you. …

Instead of just looking at the relationship between SES and health, Adler looks at what health has to do with what someone thinks and feels their SES is—their “subjective SES.” Show someone a ladder with ten rungs on it and ask them, “In society, where on this ladder would you rank yourself in terms of how well you’re doing?” Simple. First off, if people were purely accurate and rational, the answers across a group should average out to the middle of the ladder’s rungs. But cultural distortions come in—expansive, self-congratulatory European-Americans average out at higher than the middle rung (what Adler calls her Lake Wobegon Effect, where all the children are above average); in contrast, Chinese-Americans, from a culture with less chest-thumping individualism, average out to below the middle rung. …

Amazingly, it is at least as good a predictor of these health measures as is one’s actual SES, and, in some cases, it is even better.

Julia: There’s this ongoing debate in the heuristics and biases field and related fields. I’ll simplify here, but between, on the one hand, the traditional Kahneman and Tversky model of biases as the ways that human reasoning deviates from ideal reasoning, systematic mistakes that we make, and then on the other side of the debate are people, like for example Gigerenzer, who argue, “No, no, no, the human brain isn’t really biased. We’re not really irrational. These are actually optimal solutions to the problems that the brain evolved to face and to problems that we have limited time and processing power to deal with, so it’s not really appropriate to call the brain irrational, it’s just optimized for particular problems and under particular constraints.”

It sounds like your research is pointing towards the second of those positions, but I guess it’s not clear to me what the tension actually is with Kahneman and Tversky in what you’ve said so far.

Tom: Importantly, I think, we were using pieces of both of those ideas. I don’t think there’s necessarily a significant tension with the Kahneman and Tversky perspective.

Here’s one way of characterizing this. Gigerenzer’s argument has focused on one particular idea which comes from statistics, which is called the bias‐variance trade off. The basic idea of this principle is that you don’t necessarily want to use the most complex model when you’re trying to solve a problem. You don’t necessarily want to use the most complex algorithm.

If you’re trying to build a predictive model, including more predictors into the model can be something which makes the model actually worse, provided you are doing something like trying to minimize the errors that you’re making in accounting for the data that you’ve seen so far. The problem is that, as your model gets more complicated, it can overfit the data. It can end up producing predictions which are driven by noise that appears in the data that you’re seeing, because it’s got such a greater expressive capacity.

The idea is, by having a simpler model, you’re not going to get into that problem of ending up doing a good job of modeling the noise, and as a consequence you’re going to end up making better predictions and potentially doing a better job of solving those problems.

Gigerenzer’s argument is that some of these heuristics, which you can think about as strategies that end up being perhaps simpler than other kinds of cognitive strategies you can engage in, they’re going to work better than a more complex strategy ‐‐ precisely because of the bias‐variance trade off, precisely because they take us in that direction of minimizing the amount that we’re going to be overfitting the data.

The reason why it’s called the bias‐variance trade off is that, as you go in that direction, you add bias to your model. You’re going to be able to do a less good job of fitting data sets in general, but you’re reducing variance ‐‐ you’re reducing the amount which the answers you’re going to get are going to vary around depending on the particular data that you see. Those two things are things that are both bad for making predictions, and so the idea is you want to find the point which is the right trade off between those two kinds of errors.

…

What’s interesting about that is that you basically get this one explanatory dimension where it says making things simpler is going to be good, but it doesn’t necessarily explain why you get all the way to the very, very simple kinds of strategies that Gigerenzer tends to advocate. Because basically what the bias‐ variance trade off tells you is that you don’t want to use the most complex thing, but you probably also don’t want to use the simplest thing. You actually want to use something which is somewhere in between, and that might end up being more complex than perhaps the simpler sorts of strategies that Gigerenzer has identified, things that, say, rely on just using a single predictor when you’re trying to make a decision.

Kahneman and Tversky, on the other hand, emphasized heuristics as basically a means of dealing with cognitive effort, or the way that I think about it is computational effort. Doing probabilistic reasoning is something which, as a computational problem, is really hard. It’s Bayesian inference… It falls into the categories of problems which are things that we don’t have efficient algorithms to get computers to do, so it’s no surprise that they’d be things that would be challenging for people as well. The idea is, maybe people can follow some simpler strategies that are reducing the cognitive effort they need to use to solve problems.

Gigerenzer argued against that. He argued against people being, I think the way he characterized it was being “lazy,” and said instead, “No, we’re doing a good job with solving these problems.”

I think the position that I have is that I think both of those perspectives are important and they’re going to be important for explaining different aspects of the heuristics that we end up using. If you add in this third factor of cognitive effort, that’s something which does maybe push you a little bit further in terms of going in the direction of simplicity, but it’s also something that we can use to explain other kinds of heuristics.

Griffiths later provides a great explanation of why the availability heuristic can be a good decision-making tool:

Tom: The basic idea behind availability is that if I ask you to judge the probability of something, to make a decision which depends on probabilities of outcomes, and then you do that by basically using those outcomes which come to mind most easily.

An example of this is, say, if you’re going to make a decision as to whether you should go snorkeling on holiday. You might end up thinking not just about the colorful fish you’re going to see, but also about the possibility of shark attacks. Or, if you’re going to go on a plane flight, you’ll probably end up thinking about terrorists more than you should. These are things which are very salient to us and jump out at us, and so as a consequence we end up overestimating their probabilities when we’re trying to make decisions.

What Falk did was look at this question from the perspective of trying to think about a computational solution to the problem of calculating an expected utility. If you’re acting rationally, what you should be doing when you’re trying to make a decision as to whether you want to do something or not, is to work out what’s the probabilities of all of the different outcomes that could happen? What’s the utility that you assign to those outcomes? And then average together those utilities weighted by their probabilities. Then that gives you the value of that particular option.

That’s obviously a really computationally demanding thing, particularly for the kinds of problems that we face as human beings where there could be many possible outcomes, and so on and so on.

A reasonable way that you could try and solve that problem instead is by sampling, by generating some sample of outcomes and then evaluating utilities of those outcomes and then adding those up.

Then you have this question, which is, well, what distribution should you be sampling those outcomes from? I think the immediate intuitive response is to say, “Well, you should just generate those outcomes with the probability that they occur in the world. You should just generate an unbiased sample.” Indeed, if you do that, you’ll get an unbiased estimate of the expected utility.

The problem with that is that if you are in a situation where there are some outcomes that are extreme outcomes ‐‐ that, say, occur with relatively lower probability, which is I think the sort of context that we often face in the sorts of decisions that we make as humans ‐‐ then that strategy is going to not work very well. Because there’s a chance that you don’t generate those extreme outcomes, because you’re sampling from this distribution, and those things might have relatively low chance of happening.

…

The answer is, in order to deal with that problem, you probably want to generate from a different distribution. And we can ask, what’s the best distribution to generate from, from the perspective of minimizing the variance in the estimates? Because in this case it’s the variance which really kills you, it’s the variability across those different samples. The answer is: Add a little bit of bias. It’s the bias‐variance trade off again. You generate from a biased distribution, that results in a biased estimate.

The optimal distribution to generate from, from the perspective of minimizing variance, is the distribution where the probability of generating an outcome is proportional to the probability of that outcome occurring in the world, multiplied by the absolute value of its utility.

Basically, the idea is that you want to generate from a distribution where those extreme events that are either extremely good or extremely bad are given greater weight ‐‐ and that’s exactly what we end up doing when we’re answering questions using those available examples. Because the things that we tend to focus on, and the things that we tend to store in our memory, are those things which really have extreme utilities.

Can we make the availability heuristic work better for us?

I think the other idea is that, to the extent that we’ve already adopted these algorithms and these end up being strategies that we end up using, you can also ask the question of how we might structure our environments in ways that we end up doing a better job of solving the problems we want to solve, because we’ve changed the nature of the inputs to those algorithms. If intervening on the algorithms themselves is difficult, intervening on our environments might be easier, and might be the kind of thing that makes us able to do a better job of making these sorts of inferences.

To return to your example of shark attacks and so on, I think you could expect that there’s even more bias than the optimal amount of bias in availability‐based decisions because what’s available to us has changed. One of the things that’s happened is you can hear about shark attacks on the news, and you can see plane crashes and you can see all of these different kinds of things. The statistics of the environment that we operate in are also just completely messed up with respect to what’s relevant for making our own decisions.

So a basic recommendation that would come out of that is, if this is the way that your mind tends to work, try and put yourself in an environment where you get exposed to the right kind of statistics. I think the way you were characterizing that was in terms of you find out what the facts are on shark attacks and so on.

]]>https://jasoncollins.blog/2018/07/18/tom-griffiths-on-gigerenzer-versus-kahneman-and-tversky-plus-a-neat-explanation-on-why-the-availability-heuristic-can-be-optimal/feed/0jasonacollinsOpposing biaseshttps://jasoncollins.blog/2018/07/11/opposing-biases/
https://jasoncollins.blog/2018/07/11/opposing-biases/#respondWed, 11 Jul 2018 09:00:27 +0000http://jasoncollins.org/?p=22446From the preface of one print of Philip Tetlock’s Expert Political Judgement (hat tip to Robert Wiblin who quoted this passage in the introduction to an 80,000 hours podcast episode):

The experts surest of their big-picture grasp of the deep drivers of history, the Isaiah Berlin–style “hedgehogs,” performed worse than their more diffident colleagues, or “foxes,” who stuck closer to the data at hand and saw merit in clashing schools of thought. That differential was particularly pronounced for long-range forecasts inside experts’ domains of expertise.

…

Hedgehogs were not always the worst forecasters. Tempting though it is to mock their belief-system defenses for their often too-bold forecasts—like “off-on-timing” (the outcome I predicted hasn’t happened yet, but it will) or the close-call counterfactual (the outcome I predicted would have happened but for a fluky exogenous shock)—some of these defenses proved quite defensible. And, though less opinionated, foxes were not always the best forecasters. Some were so open to alternative scenarios (in chapter 7) that their probability estimates of exclusive and exhaustive sets of possible futures summed to well over 1.0. Good judgment requires balancing opposing biases. Over-confidence and belief perseverance may be the more common errors in human judgment but we set the stage for over-correction if we focus solely on these errors and ignore the mirror image mistakes, of under-confidence and excessive volatility.

I can see why this idea of opposing biases makes correction of “biases” difficult.

But before we get to the correction of biases, this concept of opposing biases points at a major difficulty with behavioural analyses of decision making. When you have, say, both loss aversion and overconfidence in your bag of explanations for poor decision making, you can explain almost anything after the fact. The gamble turned out poorly? Overconfidence. Didn’t take the gamble? Loss aversion.

Recently I’ve heard a lot of people talking of action bias. There is also a status quo bias. Again, a pair of biases with which we can explain anything.

Picture the following situation: You are taking a freshman-level philosophy class in college, and your professor has just asked you to imagine a runaway trolley barreling down a track toward a group of five people. The only way to save them from being killed, the professor says, is to hit a switch that will turn the trolley onto an alternate set of tracks where it will kill one person instead of five. Now you must decide: Would the mulling over of this dilemma enlighten you in any way?

I ask because the trolley-problem thought experiment described above—and its standard culminating question, Would it be morally permissible for you to hit the switch?—has in recent years become a mainstay of research in a subfield of psychology. …

For all this method’s enduring popularity, few have bothered to examine how it might relate to real-life moral judgments. Would your answers to a set of trolley hypotheticals correspond with what you’d do if, say, a deadly train were really coming down the tracks, and you really did have the means to change its course? In November 2016, though, Dries Bostyn, a graduate student in social psychology at the University of Ghent, ran what may have been the first-ever real-life version of a trolley-problem study in the lab. In place of railroad tracks and human victims, he used an electroschock machine and a colony of mice—and the question was no longer hypothetical: Would students press a button to zap a living, breathing mouse, so as to spare five other living, breathing mice from feeling pain?

“I think almost everyone within this field has considered running this experiment in real life, but for some reason no one ever got around to it,” Bostyn says. He published his own results last month: People’s thoughts about imaginary trolleys and other sacrificial hypotheticals did not predict their actions with the mice, he found.

Om what this finding means for the trolley problem:

If people’s answers to a trolley-type dilemma don’t match up exactly with their behaviors in a real-life (or realistic) version of the same, does that mean trolleyology itself has been derailed? The answer to that question depends on how you understood the purpose of those hypotheticals to begin with. Sure, they might not predict real-world actions. But perhaps they’re still useful for understanding real-world reactions. After all, the laboratory game mirrors a common experience: one in which we hear or read about a thing that someone did—a policy that she enacted, perhaps, or a crime that she committed—and then decide whether her behavior was ethical. If trolley problems can illuminate the mental process behind reading a narrative and then making a moral judgment then perhaps we shouldn’t care so much about what happened when this guy in Belgium pretended to be electrocuting mice.

…

[Joshua Greene] says, Bostyn’s data aren’t grounds for saying that responses to trolley hypotheticals are useless or inane. After all, the mouse study did find that people’s answers to the hypotheticals predicted their actual levels of discomfort. Even if someone’s feeling of discomfort may not always translate to real-world behavior, that doesn’t mean that it’s irrelevant to moral judgment. “The more sensible conclusion,” Greene added over email, “is that we are looking at several weakly connected dots in a complex chain with multiple factors at work.”

…

Bostyn’s mice aside, there are other reasons to wary of the trolley hypotheticals. For one thing, a recent international project to reproduce 40 major studies in the field of experimental philosophy included stabs at two of Greene’s highly cited trolley-problemstudies. Both failed to replicate.

]]>https://jasoncollins.blog/2018/07/04/hypotheticals-versus-the-real-world-the-trolley-problem/feed/0jasonacollinsTrolley_problemExplaining the hot-hand fallacy fallacyhttps://jasoncollins.blog/2018/06/28/explaining-the-hot-hand-fallacy-fallacy/
https://jasoncollins.blog/2018/06/28/explaining-the-hot-hand-fallacy-fallacy/#commentsWed, 27 Jun 2018 18:00:31 +0000http://jasoncollins.blog/?p=22907Since first coming across Joshua Miller and Adam Sanurjo’s great work demonstrating that the hot-hand fallacy was itself a fallacy, I’ve been looking for a good way to explain simply the logic behind their argument. I haven’t found something that completely hits the mark yet, but the following explanation from Miller and Sanjurjo in The Conversation might be useful to some:

In the landmark 1985 paper “The hot hand in basketball: On the misperception of random sequences,” psychologists Thomas Gilovich, Robert Vallone and Amos Tversky (GVT, for short) found that when studying basketball shooting data, the sequences of makes and misses are indistinguishable from the sequences of heads and tails one would expect to see from flipping a coin repeatedly.

Just as a gambler will get an occasional streak when flipping a coin, a basketball player will produce an occasional streak when shooting the ball. GVT concluded that the hot hand is a “cognitive illusion”; people’s tendency to detect patterns in randomness, to see perfectly typical streaks as atypical, led them to believe in an illusory hot hand.

…

In what turns out to be an ironic twist, we’ve recently found this consensus view rests on a subtle – but crucial – misconception regarding the behavior of random sequences. In GVT’s critical test of hot hand shooting conducted on the Cornell University basketball team, they examined whether players shot better when on a streak of hits than when on a streak of misses. In this intuitive test, players’ field goal percentages were not markedly greater after streaks of makes than after streaks of misses.

GVT made the implicit assumption that the pattern they observed from the Cornell shooters is what you would expect to see if each player’s sequence of 100 shot outcomes were determined by coin flips. That is, the percentage of heads should be similar for the flips that follow streaks of heads, and the flips that follow streaks of misses.

Our surprising finding is that this appealing intuition is incorrect. For example, imagine flipping a coin 100 times and then collecting all the flips in which the preceding three flips are heads. While one would intuitively expect that the percentage of heads on these flips would be 50 percent, instead, it’s less.

Here’s why.

Suppose a researcher looks at the data from a sequence of 100 coin flips, collects all the flips for which the previous three flips are heads and inspects one of these flips. To visualize this, imagine the researcher taking these collected flips, putting them in a bucket and choosing one at random. The chance the chosen flip is a heads – equal to the percentage of heads in the bucket – we claim is less than 50 percent.

Caption: The percentage of heads on the flips that follow a streak of three heads can be viewed as the chance of choosing heads from a bucket consisting of all the flips that follow a streak of three heads. Miller and Sanjurjo, CC BY-ND

To see this, let’s say the researcher happens to choose flip 42 from the bucket. Now it’s true that if the researcher were to inspect flip 42 before examining the sequence, then the chance of it being heads would be exactly 50/50, as we intuitively expect. But the researcher looked at the sequence first, and collected flip 42 because it was one of the flips for which the previous three flips were heads. Why does this make it more likely that flip 42 would be tails rather than a heads?

Caption: Why tails is more likely when choosing a flip from the bucket. Miller and Sanjurjo, CC BY-ND

If flip 42 were heads, then flips 39, 40, 41 and 42 would be HHHH. This would mean that flip 43 would also follow three heads, and the researcher could have chosen flip 43 rather than flip 42 (but didn’t). If flip 42 were tails, then flips 39 through 42 would be HHHT, and the researcher would be restricted from choosing flip 43 (or 44, or 45). This implies that in the world in which flip 42 is tails (HHHT) flip 42 is more likely to be chosen as there are (on average) fewer eligible flips in the sequence from which to choose than in the world in which flip 42 is heads (HHHH).

This reasoning holds for any flip the researcher might choose from the bucket (unless it happens to be the final flip of the sequence). The world HHHT, in which the researcher has fewer eligible flips besides the chosen flip, restricts his choice more than world HHHH, and makes him more likely to choose the flip that he chose. This makes world HHHT more likely, and consequentially makes tails more likely than heads on the chosen flip.

In other words, selecting which part of the data to analyze based on information regarding where streaks are located within the data, restricts your choice, and changes the odds.

There are a few other pieces in the article that make it worth reading, but here is an important punchline to the research:

Because of the surprising bias we discovered, their finding of only a negligibly higher field goal percentage for shots following a streak of makes (three percentage points), was, if you do the calculation, actually 11 percentage points higher than one would expect from a coin flip!

An 11 percentage point relative boost in shooting when on a hit-streak is not negligible. In fact, it is roughly equal to the difference in field goal percentage between the average and the very best 3-point shooter in the NBA. Thus, in contrast with what was originally found, GVT’s data reveal a substantial, and statistically significant, hot hand effect

]]>https://jasoncollins.blog/2018/06/28/explaining-the-hot-hand-fallacy-fallacy/feed/9jasonacollinsWealth and geneshttps://jasoncollins.blog/2018/06/21/wealth-and-genes/
https://jasoncollins.blog/2018/06/21/wealth-and-genes/#commentsThu, 21 Jun 2018 09:00:08 +0000http://jasoncollins.blog/?p=22879Go back ten years, and most published attempts to link specific genetic variants to a trait were false. These candidate-gene studies were your classic, yet typically rubbish, “gene for X” paper.

The proliferation of poor papers was in part because the studies were too small to discover the effects they were looking for (see here for some good videos describing the problems). As has become increasingly evident, most human traits are affected by thousands of genes, each with tiny effects. With a small sample – many of the early candidate-gene studies involved hundreds of people – all you can discover is noise.

But there was some optimism that robust links would eventually be drawn. Get genetic samples from a large enough population (say, hundreds of thousands), and you can detect these weak genetic effects. You can also replicate the findings across multiple samples to ensure the results are robust

In recent years that promise has started to be realised through genome-wide association studies (GWAS). Although more than 99% of the human genome is common across people, there are certain locations at which the DNA base pair can differ. These locations are known as single-nucleotide polymorphisms (SNPs). A GWAS involves looking across all of the sampled SNPs (typically one million or so SNPs for each person) and estimating the effect of each SNP against an outcome of interest. Those SNPs that meet certain statistical thresholds are treated as positive findings.

One innovation from this work is the use of “polygenic scores”. The effect of all measured SNPs from a GWAS is used to produce a single score for a person. That score is used to predict their trait or outcome. Polygenic scores are used regularly in animal breeding, and are now starting to be used to look at human outcomes, including those of interest to economists.

The latest example of this is an examination of the link between wealth and a polygenic score for education. An extract from the abstract of the NBER working paper by Daniel Barth, Nicholas Papageorge and Kevin Thom states:

We show that genetic endowments linked to educational attainment strongly and robustly predict wealth at retirement. The estimated relationship is not fully explained by flexibly controlling for education and labor income. … The associations we report provide preliminary evidence that genetic endowments related to human capital accumulation are associated with wealth not only through educational attainment and labor income, but also through a facility with complex financial decision-making.

We first establish a robust relationship between household wealth in retirement and the average household polygenic score for educational attainment. A one-standard-deviation increase in the score is associated with a 33.1 percent increase in household wealth (approximately $144,000 in 2010 dollars). … Measures of educational attainment, including years of education and completed degrees, explain over half of this relationship. Using detailed income data from the Social Security Administration (SSA) as well as self-reported labor earnings from the HRS, we find that labor income can explain only a small part of the gene-wealth gradient that remains after controlling for education. These results indicate that while education and labor market earnings are important sources of variation in house-hold wealth, they explain only a portion of the relationship between genetic endowments and wealth.

The finding that the genes that affect education also affect other outcomes – in this case wealth – is no surprise. Whether these genes relate to, say, cognitive ability or conscientiousness, it is easy to imagine that they affect all of education, workplace performance, savings behaviour and a host of other factors that would in turn influence wealth.

To tease this out, I would be interested in seeing studies that examine the predictive power of polygenic scores for more fundamental characteristics, such as IQ and the big five personality traits. These would likely capture a good deal of the variation in outcomes being attributed to education. You might also look at some fundamental economic traits, such as risk or time preferences (to the extent these are not just reflections of IQ and the big five). If you know these more fundamental traits, most other behaviours are simply combinations of that.

This was a lesson learnt from research on heritability, where you could find studies calculating the heritability of everything from opinions on gun control to leisure interests. Although this had some value in that it led to the first law of behavioural genetics, namely that all human behavioural traits are heritable, a lot of these studies were simply capturing manifestations of differences in IQ and the big five. (It also benefited academics with padded CVs).

Moving on, what does analysis using polygenic scores add to other work?

Our work contributes to an existing literature on endowments, economic traits, and household wealth. One strand of this work examines how various measures of “ability,” such as IQ or cognitive test scores, predict household wealth and similar outcomes … However, parental investments and other environmental factors can directly affect test performance, making it difficult to separate the effects of endowed traits from endogenous human capital investments. A second strand of this literature focuses on genetic endowments, and seeks to estimate their collective importance using twin studies. Twin studies have shown that genetics play a non-trivial role in explaining financial behavior such as savings and portfolio choices … However, while twin studies can decompose the variance of an outcome into genetic and non-genetic contributions, they do not identify which particular markers influence economic outcomes. Moreover, it is typically impossible to apply twin methods to large and nationally representative longitudinal studies, such as the HRS, which offer some of the richest data on household wealth and related behavioral traits.

Twin studies are fantastic at teasing out the role of genetics, but if you want to take genetic samples from a new population and use the genetic markers as controls in your analysis or to predict outcomes, you need something of the nature of these polygenic scores.

We note two important differences between the EA score and a measure like IQ that make it valuable to study polygenic scores. First, a polygenic score like the EA score can overcome some interpretational challenges related to IQ and other cognitive test scores. Environmental factors have been found to influence intelligence test results and to moderate genetic influences on IQ (Tucker-Drob and Bates, 2015). It is true that differences in the EA score may reflect differences in environments or investments because parents with high EA scores may also be more likely to invest in their children. However, the EA score is fixed at conception, which means that post-birth investments cannot causally change the value of the score. A measure like IQ suffers from both of these interpretational challenges.

The interpretational challenge with IQ doesn’t need to be viewed in isolation. Between twin and adoption studies and these studies, you can start to tease out how much a measure like IQ is practically (as opposed to theoretically) hampered by those challenges. An even better option might be an IQ polygenic score.

The paper ends with a warning that we know should have been attached to many papers for decades now, but this time with an increasingly tangible solution.

Economic research using information on genetic endowments is useful for understanding what has heretofore been a form of unobserved heterogeneity that persists across generations, since parents provide genetic material for their children. Studies that ignore this type of heterogeneity when studying the intergenerational persistence of economic outcomes, such as income or wealth, could place too much weight on other mechanisms such as attained education or direct monetary transfers between parents and children. The use of observed genetic information helps economists to develop a more accurate and complete understanding of inequality across generations.

Examining intergenerational outcomes while ignoring genetic effects is generally a waste of time.

]]>https://jasoncollins.blog/2018/06/21/wealth-and-genes/feed/1jasonacollinsIs the marshmallow test just a measure of affluence?https://jasoncollins.blog/2018/06/13/is-the-marshmallow-test-just-a-measure-of-affluence/
https://jasoncollins.blog/2018/06/13/is-the-marshmallow-test-just-a-measure-of-affluence/#commentsWed, 13 Jun 2018 09:00:59 +0000http://jasoncollins.blog/?p=22801I argued in a recent post that the conceptual replication of the marshmallow test was largely successful. A single data point – whether someone can wait for a larger reward – predicts future achievement.

That replication has generated a lot of commentary. Most concerns the extension to the original study, an examination of whether the marshmallow test retained its predictive power if they accounted for factors such as the parent and child’s background (including socioeconomic status), home environment, and measures of the child’s behavioural and cognitive development.

The result was that these “controls” eliminated the predictive power of the marshmallow test. If you know those other variables, the marshmallow test does not give you any further information.

As I said before, this is hardly surprising. They used around 30 controls – 14 for child and parent background, 9 for the quality of the home environment, 5 for childhood achievement and 2 for behavioural characteristics. It is likely that many of them capture the features that give the marshmallow test its predictive power.

So can we draw any conclusions from the inclusion of those particular controls? One of the most circulated interpretations is by Jessica Calarco in the Atlantic, titled Why Rich Kids Are So Good at the Marshmallow Test. The subtitle is “Affluence—not willpower—seems to be what’s behind some kids’ capacity to delay gratification”. Calarco writes:

Ultimately, the new study finds limited support for the idea that being able to delay gratification leads to better outcomes. Instead, it suggests that the capacity to hold out for a second marshmallow is shaped in large part by a child’s social and economic background—and, in turn, that that background, not the ability to delay gratification, is what’s behind kids’ long-term success.

This conclusion is a step too far. For a start, controlling for child background and home environment (slightly more than) halved the predictive power of the marshmallow test. It did not eliminate it. It was only on including additional behavioural and cognitive controls – characteristics of the child themselves – that the predictive power of the marshmallow test was eliminated

But the more interesting question in one of causation. Are the social and economic characteristics themselves the cause of later achievement?

One story we could tell is that the social and economic characteristics are simply proxies for parental characteristics, which are genetically transmitted to the children. Heritability of traits such as IQ tend to increase with age, so parental characteristics would likely have predictive power in addition to that of the four-year old’s cognitive and behavioural skills.

On the flipside, maybe the behavioural and cognitive characteristics of the child are simply reflections of the development environment that the child has been exposed to date. This is effectively Calarco’s interpretation.

Which is the right interpretation? This study doesn’t help answer this question. It was never designed to. As lead study author Tyler Watts tweeted in response to the Atlantic article:

We found that in combination, all of those controls reduced the effect of the test on later achievement. So, it’s difficult to say which control was most important and and all of those factors are correlated with one another.

If you want to know whether social and economic background causes future success, you should look elsewhere. (I’d start with twin and adoption studies.)

That said, there were a couple of interesting elements to this new study. While the marshmallow test was predictive of future achievement at age 15, there was no association between the marshmallow test and two composite measure of behaviours at 15. The composite behaviour measures were for internalising behaviours (such as depression) and externalising behaviours (such as anti-social behaviours). This inability to predict future behavioural problems hints that the marshmallow test may obtain its predictive power through the cognitive rather than the behavioural channel.

This possibility is also suggested by the correlation between the marshmallow test and the Applied Problems test, which requires the children to count and solve simple addition problems.

[T]he marshmallow test had the strongest correlation with the Applied Problems subtest of the WJ-R, r(916) = .37, p < .001; and correlations with measures of attention, impulsivity, and self-control were lower in magnitude (rs = .22–.30, p < .001). Although these correlational results were far from conclusive, they suggest that the marshmallow test should not be thought of as a mere behavioral proxy for self-control, as the measure clearly relates strongly to basic measures of cognitive capacity.

[A] child’s ability to wait in the ‘marshmallow test’ situation reflects that child’s ability to engage various cognitive and emotion-regulation strategies and skills that make the waiting situation less frustrating. Therefore, it is expected and predictable, as the Watts paper shows, that once these cognitive and emotion-regulation skills, which are the skills that are essential for waiting, are statistically ‘controlled out,’ the correlation is indeed diminished.

Also from Mischel:

Unfortunately, our 1990 paper’s own cautions to resist sweeping over-generalizations, and the volume of research exploring the conditions and skills underlying the ability to wait, have been put aside for more exciting but very misleading headline stories over many years.

PPS: In another thread to her article, Calarco draws on the concept of scarcity:

There’s plenty of other research that sheds further light on the class dimension of the marshmallow test. The Harvard economist Sendhil Mullainathan and the Princeton behavioral scientist Eldar Shafir wrote a book in 2013, Scarcity: Why Having Too Little Means So Much, that detailed how poverty can lead people to opt for short-term rather than long-term rewards; the state of not having enough can change the way people think about what’s available now. In other words, a second marshmallow seems irrelevant when a child has reason to believe that the first one might vanish.

I’ve written about scarcity previously in my review of Mullainathan and Shafir’s book. I’m not sure the work on scarcity sheds light on the marshmallow test results. The concept behind scarcity is that poverty-related concerns consume mental bandwidth that isn’t then available for other tasks. A typical experiment to demonstrate scarcity involves priming the experimental subjects with a problem before testing their IQ. When the problem has a large financial cost (e.g. expensive car repairs), the performance of low-income people plunges. Focusing their attention on their lack of resources consumes mental bandwidth. On applying this to the marshmallow test, I haven’t seen much evidence four-year olds are struggling with this problem.

(As an aside, scarcity seems to be the catchall response to discussions of IQ and achievement, a bit like epigenetics is the response to any discussion of genetics.)

Given Calarco’s willingness to bundle the marshmallow test replication into the replication crisis (calling it a “failed replication”), its worth also thinking about scarcity in that light. If I had to predict which results would not survive a pre-registered replication, the experiments in the original scarcity paper are right up there. They involve priming, the poster-child for failed replications. The size of the effect, 13 IQ points from a simple prime, fails the “effect is too large” heuristic.

Then there is a study that looked at low-income households before and after payday, which found no change in cognitive function either side of that day (you could consider this a “conceptual replication”). In addition, for a while now I have been hearing rumours of file drawers containing failed attempts to elicit the scarcity mindset. I was able to find one pre-registered direct replication, but it doesn’t seem the result has been published. (Sitting in a file drawer somewhere?)

[P]articipants entered a room where they sat in chairs with small desks attached (the typical exam-style setup). Next, each participant received a sheet of paper containing a series of twenty different matrices … and were told that their task was to find in each of these matrices two numbers that added up to 10 …

We also told them that they had five minutes to solve as many of the twenty matrices as possible and that they would get paid 50 cents per correct answer (an amount that varied depending on the experiment). Once the experimenter said, “Begin!” the participants turned the page over and started solving these simple math problems as quickly as they could. …

Here’s an example matrix:

This was how the experiment started for all the participants, but what happened at the end of the five minutes was different depending on the particular condition.

Imagine that you are in the control condition… You walk up to the experimenter’s desk and hand her your solutions. After checking your answers, the experimenter smiles approvingly. “Four solved,” she says and then counts out your earnings. … (The scores in this control condition gave us the actual level of performance on this task.)

Now imagine you are in another setup, called the shredder condition, in which you have the opportunity to cheat. This condition is similar to the control condition, except that after the five minutes are up the experimenter tells you, “Now that you’ve finished, count the number of correct answers, put your worksheet through the shredder at the back of the room, and then come to the front of the room and tell me how many matrices you solved correctly.” …

If you were a participant in the shredder condition, what would you do? Would you cheat? And if so, by how much?

With the results for both of these conditions, we could compare the performance in the control condition, in which cheating was impossible, to the reported performance in the shredder condition, in which cheating was possible. If the scores were the same, we would conclude that no cheating had occurred. But if we saw that, statistically speaking, people performed “better” in the shredder condition, then we could conclude that our participants overreported their performance (cheated) when they had the opportunity to shred the evidence. …

Perhaps somewhat unsurprisingly, we found that given the opportunity, many people did fudge their score. In the control condition, participants solved on average four out of the twenty matrices. Participants in the shredder condition claimed to have solved an average of six—two more than in the control condition. And this overall increase did not result from a few individuals who claimed to solve a lot more matrices, but from lots of people who cheated by just a little bit.

The question then becomes how to reduce cheating. Ariely describes one idea:

[O]ur memory and awareness of moral codes (such as the Ten Commandments) might have an effect on how we view our own behavior.

… We took a group of 450 participants and split them into two groups. We asked half of them to try to recall the Ten Commandments and then tempted them to cheat on our matrix task. We asked the other half to try to recall ten books they had read in high school before setting them loose on the matrices and the opportunity to cheat. Among the group who recalled the ten books, we saw the typical widespread but moderate cheating. On the other hand, in the group that was asked to recall the Ten Commandments, we observed no cheating whatsoever. And that was despite the fact that no one in the group was able to recall all ten.

This result was very intriguing. It seemed that merely trying to recall moral standards was enough to improve moral behavior.

This experiment comes from a paper co-authored by Nina Mazar, On Amir and Ariely (pdf). (I’m not sure where the 450 students in the book comes from – the paper reports 229 students for this experiment. A later experiment in the paper uses 450. There were also a few differences in this experiment to the general cheating story above. People took their answers home for “recycling”, rather than shredding them, and payment was $10 per correct matrix to two randomly selected students.)

The self-concept maintenance theory holds that many people will cheat in order to maximize self-profit, but only to the extent that they can do so while maintaining a positive self-concept. Mazar, Amir, and Ariely (2008; Experiment 1) gave participants an opportunity and incentive to cheat on a problem-solving task. Prior to that task, participants either recalled the 10 Commandments (a moral reminder) or recalled 10 books they had read in high school (a neutral task). Consistent with the self-concept maintenance theory, when given the opportunity to cheat, participants given the moral reminder priming task reported solving 1.45 fewer matrices than those given a neutral prime (Cohen ́s d = 0.48); moral reminders reduced cheating. The Mazar et al. (2008) paper is among the most cited papers in deception research, but it has not been replicated directly. This Registered Replication Report describes the aggregated result of 25 direct replications (total n = 5786), all of which followed the same pre-registered protocol. In the primary meta-analysis (19 replications, total n = 4674), participants who were given an opportunity to cheat reported solving 0.11 more matrices if they were given a moral reminder than if they were given a neutral reminder (95% CI: -0.09; 0.31). This small effect was numerically in the opposite direction of the original study (Cohen ́s d = -0.04).

And here’s a chart demonstrating the result (Figure 2):

Multi-lab experiments like this are fantastic. There’s little ambiguity about the result.