11 Ways To Lie With Statistics

Written by a non-statistician in hokey language and illustrated by humorous line drawings, How To Lie With Statistics is as relevant and enjoyable as when it first appeared in 1954.

Indeed the book is a best seller even though some examples are out of date, like the salary of Yale graduates and the price of bananas.

Likewise the tricks described by Darrell Huff, from misleading charts to misuse of averages, are still used today. “Many a statistic is false on its face. It gets by only because the magic of numbers brings about a suspension of common sense,” Huff says.

When numbers appear, the reader believes some truth is about to be imparted. Even a nonsensical statement such as this carries the air of authority until the meaning sinks in.

Yes, using statistics to lie is easy - as you will soon see. And, statistics are a valid and useful tool. They allow us to describe and find out about the way we live in every sense. They offer perspective on the past and make the future predictable or, at least, feel that way.

And, yes, statistics can be used to manipulate, obfuscate, sensationalize, and confuse. It will be clear to anyone who clicks on just how simple it is for anyone to learn to do all of that and more.

We start with the statistical sample: the statistician's best friend for good or evil

Samples are, by definition, incomplete pictures of the whole. How much of the whole, this is the question. When a sample is large enough and selected properly, it tells us something.

The basic sample is called 'random.' As its name suggests, it is formed by chance from the 'universe,' that is the whole from which the sample is part. Everyone in the universe must have an equal chance of landing in that pool. It is expensive to do and difficult to obtain.

So, more often than not we accept a 'stratified random sampling.' Here, the universe is divided up into groups which themselves proportionally represent the universe. Hard to do accurately when you want to, too easily biased if you want it to be.

Samples are based on responses, which reveal either the truth or the airbrushed version of who we wish we were.

When samples rely on people to tell the truth about themselves, we learn more about what they want to be than who they really are.

You may recall the study that showed some extraordinarily high number of Americans reported washing their hands after using the bathroom. Reporters staked out public restrooms far and wide and came away with a far lower percentage of actual post-washroom washing.

Why? From the days of yore, people tend to respond with what will please the one asking the question (who wants to say they don't wash?), will offend the poll taker least (studies show the gender or race of the one asking the questions greatly affects the answers given), or will make them look the best (self-reported income tends to be far higher than actual).

Also implicit in all statistics based on sampling are the probable error and the standard error, both of which state the measure of reliability -- without it, the number is meaningless

This means it is a range, though some either ignore this fact or try to use it to say something that isn't there.

Ignoring? Let's say your IQ is said to be 130 and your spouse's 128. As much as you would like to speak in slow, simple sentences at home, the range says there is no difference between the two.

On the other side of that coin, let's say 10 companies are all found to use too much packaging material. A list is presented in which all of them are shown to use what environmentalists consider to be too much. Yet, the company at the bottom - by a hair - might still step up and herald themselves as the Green Company of choice!

'A difference is a difference only if it makes a difference.'

When the sample is too small to speak to anything, it allows you to say what you want to say without pesky facts getting in the way

Flip a coin four times. Will you get the mythical 50%? Probably not, but maybe. This may suffice when tossing a coin. When you make a medical decision or assess the validity of a scientific study, we demand more proof.

If we don't know the degree of significance of a given number - how representative (or not) a sample is - we don't know how likely it is that the test or sample figure represents a real result rather than one produced by chance. And that is the very definition of inadequate.

An average is a single value meant to typify a list of values: there are three types - mean, median, or mode - and they 'typify' in very different ways

The mean average is the one you most commonly think of when you hear the word average. Advertisers and others sometimes rely on this. You arrive at the mean average by adding a group of numbers, then divide by the number of items you've just added together.

Huff suggests this way of misusing the mean: A real estate agent wants to be able to say a neighbourhood has a high average income. The neighbourhood in question is mostly farmers and hourly-wage workers. There are three families, though, who are millionaire weekenders. The mean average will assist the broker's wish for a higher number because the wealthy few will bring the mean average considerably higher. Of course, it will not paint a particularly accurate portrait, but it gets the job done.

If the mean average does not say what you wish, the median average - the numerical value separating the higher half of a sample from the lower - might better suit your purpose

The median average is most often used when there is a great range among the numbers being considered. It describes what is typical about a group of values.

The misuse of the median would go a little something like this:

You have seven dogs and want the co-op board to approve you anyway. Foolishly they require no meeting of canines or photos. You have three Newfoundlands and four Yorkshire Terriers. You say to the board, the average weight of the dogs is 8 pounds! And I promise to always use the service elevator. The mean average (add the weights and divide by 7) would not work for you here. You are referring to the median average: the point at which there are an equal number of dog weights on one side as the other. And the median is, yes, 8 pounds. The fact that the list is topped by three which top the scales at 150 (not counting the drool) is glossed over.

The mode is the number which appears most frequently in a group of numbers - it says the least and is most ripe for abuse

A perfectly appropriate use of the mode would be if you were looking for the most commonly used girl name in a NYC. The one that appears most frequently is the mode.

A less-than-kosher use would be if you went out and played a 9 holes of golf and came home boasting an average score of 4. You know in your heart that your scores were 4, 4, 7, 10, 4, 11, 6, 8, 4. Only the mode is going to give you any kind of bragging rights. Whether you avail yourself or not, this is another matter entirely.

Meet the Average Family -- Mean, Medium, and Mode. If you want, they will lie to you just as soon as look at you.

Graphs can blur, exaggerate and hide because the eye doesn't 'understand' what isn't there

When information is plotted on a graph, it can either be a true and clear show of the facts or a narrow enough slice of it to make a different point entirely. It is all in what you choose to include and how you choose to show it.

For example, when there is a zero at the bottom of a graph (for comparison), and 10% looks like 10%, the trend (up, down, flat) is in proportion. A mere glance offers an easy-to-read and accurate sense of things.

Looking for a little more shock value? Chop off the bottom of that same graph, don't start from zero, and, voila, the exact same graph looks much more extreme.

Or, throw caution to the wind and change the proportions between the ordinate (x) and the abscissa (y). There are no rules to these things and the graph suddenly has a very different message indeed. Numbers that slope comfortably now thunder upward like thoroughbreds on Derby Day.

The terrific example Huff offers in the book: It is safer to drive at 7am than at 7pm, because four times more accidents occur at the latter time.

The problem? The relevant fact is not the time of day, it is the fact that there are more drivers out on the road at the later time. More drivers, more miles covered, more time and distance in which an accident might occur.

If A, then B, therefore C: correlation comes in many flavours and some play fast and loose with the truth

For example, chance - pure and simple - might give you the results you need to say what you want to say. Two utterly unrelated events occur by chance, and the result works and the undiscerning eye will never know otherwise.

Or, two related events happen, you just aren't sure which is the cause, and which the effect. A chicken-egg quandary that allows you to dub one the cause, the other effect, as suits your purpose.

Or, you would like to conclude what goes beyond the scope of the date. It rains, crops grow. Rain is always good for crops. Well, except too much rain is not good for crops. If you'd like to stop with 'rain is good,' there is nothing stopping you.

The trickiest of them all, the one most often used to spurious effect: when two events have zero effect on the other, yet there is a very real correlation between the two. Duff offers this example: Suicide rates are highest in June. Do suicides produce June brides? Or, do June weddings produce suicidal urges? Or, unrelated and more likely the case, the very depressed soul who gets makes it through the winter with the promise of a better spring, finds that hasn't been the case, proceeds to the sad, next step.