inference

which gets its data from an article that surveyed British army recruits 100 years ago versus now:

“In the study of British recruits, the average height of British men, who had an average age of 20, was about 5 feet 6 inches (168 centimeters) at the turn of the century, whereas now they stand on average at about 5 feet 10 inches (178 cm). The increase can be attributed, most likely, to improved nutrition, health services and hygiene, said the researchers from the University of Essex in Colchester.”

Well, the question is whether there is anything that needs attributing here or is it just a misinterpretation of the data. The smell of a rat begins to emerge once you realize that recruits into the British army are not your average Joe. They tend to be young, strong, athletic and perhaps tall men.

Let’s start with some data. The male population of England and Wales in 1911 was about 17,000,000 and currently it is about 29,000,000. The British armed forces meanwhile has dropped from about 500,000 in 1900 to about 180,000 now.

Now for some simplifying assumptions just to illustrate what can happen when you analyze this data. Suppose heights have the usual bell shaped distribution with an average height of 162cm and a standard deviation (variability) of 10cm. So roughly speaking heights are spread from 120cm to 200cm (a decent approximation). Also, clearly, its not the small and weak who will try out and get recruited into the British armed forces, heavens no. Suppose its tall men that are likely to be in the Armed forces.

Perform a little experiment. Generate 17,000,000 heights in a bell-shaped distribution with mean 162cm and std=10cm – this represents the population of men in Britain around the 1900s. Start with the tallest man and add him to the army with some probability, say 50%. Keep going down the list from tallest to shortest until you fill the recruitment quota of 500,000. This little experiment simulates building the British Army in 1900. By going from tallest to shortest you are making it more likely that taller men are getting into the army. Its a simple experiment, do it. Now compute the average height of the men in the Army. You will get approximately 168cm. Wow! So the men in the army in 1900 are about 6cm taller than the average man. (That is called sampling bias.)

Lets perform the same experiment today, but without changing the population statistics. So Generate 29,000,000 heights in a bell-shaped distribution with mean 162cm and std=10cm. Again, start with the tallest and pick him with the same probability 50%, and keep going until you have todays army of 180,000. What is the average height of todays army? About 178cm. WOW!

Stop. Go back to the quote from the article and see what you think of it now. Is this some voodoo statistical fluke? No. If you repeat the experiment, you will get the same result again and again. Is it magic? No. It is the power of sampling bias. We didn’t change the population – same average height. We didn’t even change the way the army is `sampled’. It is clearly simplistic, but that is not the point. It is just one way to obtain a sampling of taller men for the army, and it is a reasonable attempt. Ofcourse, this is not exactly what is going on because we made some modeling assumptions, but you get the idea. The golden nugget here is:

Be careful when you reason about a population from a biased sample.

Recruits into the British army are a biased sample toward the taller men – likely. Very strange things can happen when you take averages from a biased sample. The small bias in our simple experiment above explains the entire 10cm of the height difference without needing to resort to genetics, health, hygiene or wealth. Perhaps there is some truth to health and hygiene leading to bigger better taller people, but to get to this conclusion one must reason from a random unbiased sample. Unfortunately, the reality is that most such social experiments are done with conveniently available data (such as British recruits whose bio-data happen to be collected), not theoretically sound data. And, often conclusions are made without paying attention to the biases in the data.

Effective learning from data should adhere to 3 basic principles (see Chapter 5 of Learning From Data), the second of which is, and I quote:

“If the data data was sampled in a biased way, learning will produce a similarly biased outcome”

In short, if there are biases in the way the data is collected, then anything can happen. BEWARE.