Deviation Leads to Aggravation

Archive for the ‘statistics’ Category

As another US election draws nigh, politics becomes the sporting talk of a certain American cross section. I’m much more inclined to be an observer rather than participant, but inevitably I am drawn in to an idle political chat or two. If nothing else, these conversations force me to confront the fact that my voting views are not as anodyne as I’d like to think, and that I’d better be ready to explain myself satisfactorily.

Here’s a short and–I hope–entertaining movie I made based on how these conversations run, with the main differences being that I’m not this articulate in person and that I usually fail to convince the person I’m not some “communist whack-a-doo.” If you’re having a hard time understanding the robo-speak, you can turn on closed captions:

The main points I try to get across in the movie:

There are many reasons to vote.

What many, if not most, voters use as their stated reason for voting (i.e. its instrumentality, or ability to decide who wins) is irrational in a dry, technical, uncontroversial way.

This is OK, because voters’ behavior reveals their voting to be for other valid reasons, such as for personal expression, group affiliation, the fulfilling of a civic duty, etc. In other words, they’re behaving like me, even if they don’t acknowledge it.

One thing I don’t mention in the movie is that I, along with plenty of others in the electorate, rarely bother to vote in small and/or local elections when the instrumental value of a vote is orders of magnitude higher. You can try to explain this by pointing out the smaller stakes, but in my my view it’s another bit of evidence that people vote expressively.

(The paper referenced in the movie on voting probability in the 2008 can be found here (.pdf), and the statistic about death from a non-poisonous arthropod is from the always fun to use Book of Odds.)

Even as I read The Black Swan for the first time, I’ve already read it. I’ve listened to several in-depth interviews with Nicholas Nassim Taleb since the book came out in 2007, and he’s had a recent resurgence in attention as the credit crisis fits his titular metaphor aptly. Despite my familiarity with the main thesis I’m still enjoying the book, just as one might still enjoy slurping down the spiced milk after finishing his Cinnamon Toast Crunch. Indeed, I’ve not come across another book that so completely elucidates (in a far more sophisticated and erudite manner, granted), how I’ve come to think about things generally.

I’m 2/3 of the way through the book and have come across many passages tempting me to blog, but the following will probably be the only one I excerpt (you, yes YOU, should really just read the book). In it, Taleb describes the limitation of making predictions in a complex system by using an example computed by a mathematician named Michael Berry:

If you know a set of basic parameters concerning [a billiard] ball at rest, can compute the resistance of the table (quite elementary), and can gauge the strength of the impact, then it is rather easy to predict what would happen at the first hit. The second impact becomes more complicated, but possible; you need to be more careful about your knowledge of the initial states, and more precision is called for. The problem is that to correctly compute the ninth impact, you need to take into account the gravitational pull of someone standing next to the table (modestly, Berry’s computations use a weight of less than 150 pounds). And to compute the fifty-sixth impact, every single elementary particle in the universe needs to be present in your assumptions! An electron at the edge of the universe, separated from us by 10 billion light-years, must figure in the calculations, since it exerts a meaningful effect on the outcome. Now, consider the additional burden of having to incorporate predictions about where these variables will be in the future. Forecasting the motion of a billiard ball on a pool table requires knowledge of the dynamics of the entire universe, down to every single atom!

(…)

In a dynamical system, where you are considering more than a ball on its own, where trajectories in a way depend on one another, the ability to project into the future is not just reduced, but is subjected to fundamental limitation. (p. 178)

Austrian economists like Hayek used similar reasoning in the early 20th century to critique Soviet-style central planning. One oft-forgotten miracle of prices is that they provide a basis of comparison for completely different things. If I decide to use my $100 for golf lessons, I know exactly what I’m giving up for them: $100 worth of Braeburn apples, Suzie’s babysitting, Tide laundry detergent, Clive Owen’s acting, the neighbor’s stash of dope, the additional interest I would earn in my Citibank savings account, a lecture by Al Gore, Hamburger Kunsthalle tickets, the copyright on Beatles sound recordings, taxi rides from JFK to Manhattan, common stock in a Mumbai start-up, etc. In other words, prices tell me about relative values. In the absence of a price system, the Austrians argued, it would be impossible to ration resources effectively, and even if prices were used, no central planner could ever hope to set them correctly because prices reflect an incomprehensible amount of dispersed knowledge particular to time and place. Just think about the task Mr. Planner would have to face:

Set the price of every resource (including, for example, the time of every person in the economy)

Make sure each price is correct relative to every other price both now and in the future.

Repeat steps 1-2 every second as conditions change.

Could we, like Camus, imagine Mr. Planner happy in his Sisyphean task? And to extend it to Taleb’s point, do we really think anyone could make a certain and accurate forecast of where prices will be in a decade? A year? A day? For that matter, are mypowers of clairvoyance to be trusted?

Happily I can report they are, for after reading the above passage and forming this post in my head I turned the page to find a brief section discussing Hayek; Roma Downey has my undying gratitude.

Ask anyone on the street these days about middle class incomes and, if they don’t threaten to mace you if you don’t please step away, they might tell you a sad story of stagnation. Adjusting for inflation, the typical household is earning but a pittance more than the 1970s, while all the gains in wealth have gone to feed fancy feasts to the fat cats at the top of the distribution.

But Terry J. Fitzgerald, who may or may not be related to F. Scott Fitzgerald (or even F. Scott Key, for that matter), and who, though failing to write a Great American Novel or national anthem, has nonetheless written many a satisfactory research article for the Minneapolis Fed, recently penned a rejoinder to this dominant narrative that is gloriously free of turgid run-on sentences such as the one I’m writing at this very moment. Par exemple:

The U.S. Census Bureau reports that median household income stagnated from 1976 to 2006, growing by only 18 percent. In contrast, data from the Bureau of Economic Analysis indicate that income per person was up 80 percent.

The fact that an 18 percent gain in purchasing power is considered stagnation may in the eyes of some be rivaled only by the Turducken as a signal of how prosperous our society has become. Ignoring that consideration, however, leaves one to wonder about the apparent contradiction in the two statistics above. How can income per person have grown four times as much as the income for a typical household? Terry tells the story in pictures:

About 15-20 percent of the difference between the per person and household figures is caused by increasing inequality, but most of the discrepancy is explained by other factors such has household composition, definitional differences, and different methods of calculating inflation. Briefly:

Household composition – The household in 1976 looked different than the household of today. They are smaller, for example, and less likely to have a married couple, and this accounts for a smaller growth in household income. Comparing apples to apples yields much higher growth rates.

Definitional differences – Unlike the Census, the BEA includes “employer contributions to employee pension and insurance funds and in-kind transfer payments such as Medicaid, food stamps and energy assistance” in its measure of income. Including these forms of compensation boosts income growth.

Different methods of calculating inflation – There is no standard way of calculating inflation, and income statistics over time are sensitive to which method is used. Using a different method that attempts to capture reality better results in higher growth.

Putting all this together yields a 44-62 percent increase in median household income over the past 30 years. The assumptions and the methodology are yet imperfect, but this is hardly a result at which to cluck, quack, or gobble.

It occurred to me as I stood sadly alone that when someone goes on the prowl they are implicitly conducting a sort of statistical test whereby a hypothesis is formed, data are gathered, and a conclusion to either accept or reject the hypothesis is formed. This technique is imperfect, so sometimes the hapless wooer will incorrectly reject the hypothesis when it is in fact true (Type I error), and sometimes will incorrectly accept the hypothesis when it is in fact false (Type II error).

To explain it more clearly, consider that my default hypothesis when I see a fair lass is “She’s not interested.” Having formed said hypothesis, I need to collect some data, so I saunter over and initiate a conversation. During the course of the conversation, I look for evidence in support of my hypothesis, such as:

She immediately vacates the area when I approach

She folds her arms and doesn’t look at me

She keeps fingering a wedding ring

She abruptly pulls out a Gloria Steinem book from her purse and commences to reading

But I also keep a wary eye for evidence to reject my hypothesis, such as:

She keeps eye contact for a prolonged period of time

She laughs readily and engages in gentle teasing

She plays with her hair or jewelry

She touches me lightly on the arm

She says something like “The attractiveness of your erudition and wit is exceeded only by your HOTNESS!”

Now that I’ve gathered the data, it’s time to base a conclusion on them. Essentially there are four outcomes:

I decide correctly to reject my hypothesis (i.e. I think she’s interested in me and this is true)

I decide correctly to accept my hypothesis (i.e. I think she’s not interested in me and this is true)

I decide incorrectly to accept my hypothesis (i.e. I think she’s not interested in me when she really is, a Type II error).

Personally speaking, I commit a Type I error rarely. This is not because I possess some amazing psychological insights, but rather because it takes a lot of evidence to convince me that a girl really is interested and hence cause me to reject my default hypothesis. Instead I’m predisposed to make a Type II error, which is perhaps the far more tragic kind. Because I require so much evidence to reject my default hypothesis, I’m far more likely to conclude incorrectly that the girl doesn’t like me when in fact she thinks she’s found her soul mate.

My painful tale of romantic woe illustrates nicely a trade-off that statisticians face routinely. If they set the standard of proof too high, they’ll be prone to reject a true hypothesis. If they set the standard of proof too low, however, they’ll be prone to accept a false hypothesis (which might get a guy slapped in the face in my example). The FDA behaves like I do when approving drugs, for instance. Because incorrectly approving a harmful drug would be so disastrous, the FDA sets a very high standard of proof for drug approval. This consequently makes the FDA more likely to reject beneficial drugs.

Sometimes I wonder if maybe, just maybe, the FDA has committed a Type II error and rejected a beneficial drug that would have helped me with my Type II problems, but that’s probably just the Zoloft talking.

‘The forecaster is like an entrepreneur,’ says Roman Frydman. ‘He uses quantitative methods, but he also studies history, and relies on intuition and judgment. He is not a scientist.’

The quote refers specifically to economics forecasters, but it applies just as well to any forecaster of complex systems, such as those who forecast climate changes.

Predictions are not science. No way, no how. And unfortunately, bad predictions (which are more common than good ones) come with little cost. As a result, they’re dreadfully oversupplied. This doesn’t mean that predictions should be ignored, but nor should they be elevated to anything more than they are: guesses–and often very crude ones at that.

On a semi-related note, one way to increase the accuracy of a forecast is through the use of prediction markets. More on that later, perhaps.

I only have an elementary grasp of statistics, but I know enough to become frustrated when I read some article proclaiming that the results of some test were “statistically significant,” as if that statement was, well, significant. Not only is it often insignificant (as the word is commonly used), but it is also an incomplete statement.

Say we have a woman, Ms. T, who claims to have a palette so precise as to be able to discern whether her tea was made by pouring hot water over the tea bag or whether the tea bag was added to the hot water. If we wanted to test her claim, basic statistical methodology would be (roughly):

Form a hypothesis, such as “Ms. T cannot tell the difference in how her tea is prepared.” This hypothesis, called the null hypothesis, is formed for the express purpose of being disproved.

Form an alternative hypothesis, such as “Ms. T can tell difference in how her tea is prepared,” in the event the null hypothesis is rejected.

Set up an experiment, such as one where Ms. T is blindfolded, given fifty cups of tea prepared either by adding the tea bag or the hot water first, and asking her to identify which method was used for each cup.

Decide what the chances are that Ms. T would randomly be able to guess correctly. These are called the significance levels, which are probabilities that, given our null hypothesis (she can’t tell the difference), the results could have happened merely by chance (Standard values are 5%, 1% and 0.1%, though there’s no particular reason for this.). If our results have a probability lower than these values, we reject the null hypothesis and assume support for the alternative hypothesis because we believe our results are so improbable that the results are not just coincidence. If the probability is higher than our significance levels, we assume support for the null hypothesis.

Compare Ms. T’s results with the results of a group of “normal” people. Let’s say that we find that Ms. T was correct in so many of the test tastes (49 out of 50) that, compared to a normal population, the chances of her randomly guessing correctly that many times was only 3%.

What happens now? Well, remember, our significance levels were 5%, 1% and 0.1%. Since the actual probability was determined to be 3%, our results are “statistically significant” at the 5% level, but are not significant at the 1% and .1% levels. “Significant” in statistics does not mean “important” or considerable”, but rather “indicative” or “expressive.” In plain English, all our results have shown is that if we assume the chances are 5% that Ms. T would be able to guess 49 out of 50 times correctly randomly, a result of 3% is indicative that our results did not happen by chance, and thus the null hypothesis should be rejected.

But remember, the 5% significance level we selected is purely arbitrary–some others might think that .1% is more accurate for significance, that we should only reject our null hypothesis if there is less than one chance in a thousand Ms. T’s results could have happened by coincidence. Furthermore, some researchers have been found to determine the significance levels after they’ve conducted the experiment; that way, they can choose a level ex post that makes their results “significant.”

At best, most journalists’ claims about statistical significance are incomplete because they fail to include the benchmark that determines significance, and at worst, they grossly misinterpret “statistically significant” as meaning “statistically important.”

I find the philosophy behind this type of statistical method to be interesting. I might write a future post on how we subscribe to the same philosophy when determining the burden of proof necessary to convict a criminal. If nothing else, I trust this post has convinced readers that journalists have the third type of lie down pat, whether they realize it or not.