George Box once said “You have a big approximation and a small approximation. The big approximation is your approximation to the problem you want to solve. The small approximation is involved in getting the solution to the approximate problem.” In a similar vein, John Tukey said “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”

I’ve never been entirely convinced by these statements. They have the ring of nice soundbites (especially when polished up, for example to “An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question”) but it seems to me that the critical thing is the accuracy of both approximations.

Nonetheless, the underlying point, that people should think carefully about the problem they actually want to solve, holds good. Researchers should not expend energy answering the “wrong question” unless they are confident that it is near enough to the right one.

A particularly simple example of this is whether to use the mean or the median to summarise a set of data. Since these statistics are different, they naturally have different properties. Indeed, as all statisticians will know it’s possible for one of two groups to have a higher mean but lower median than the other group. Changes in the extreme values will impact the mean, but not the median. So, for example, one can make the sample mean as large as one likes by increasing the single largest value enough, while the median remains unchanged. If the world’s richest person’s wealth increases sufficiently while everyone else’s declines, the mean wealth goes up, while the median wealth decreases.

For such reasons, it’s common to see statements to the effect that the choice should depend on how the data are distributed, and that the mean is a better measure of “location” than the median for symmetric distributions, while otherwise the median should be used.

But this is an oversimplification—and you will note that that description of which measure is appropriate made no reference at all to the question being asked.

If, as an employer, I choose the remuneration of new recruits randomly from a pronouncedly skewed distribution of salaries, then the average which will interest me is the mean of the distribution: those receiving large salaries will be compensated for by the larger number receiving small salaries, and my total wage bill is the product of the number of employees and their mean salary. In contrast, a potential new recruit considering joining the firm will be interested in the median salary. To her the mean is of little interest, since she is very likely to receive substantially less than that.

The distribution has the same shape in each case, but the appropriate average depends on what one wants to know.
I picked the mean/median example because it was the simplest example I could think of, but the principle is ubiquitous: the choice of statistical method depends on the question you want to answer.

Correlation coefficients are another very widely used basic summary statistic. The Pearson product-moment coefficient is known to be a measure of the strength of linear relationship. Often, however, one wants a weaker measure of relationship—perhaps merely a measure of strength of monotonic relationship. Correlation coefficients for this are sometimes called nonparametric measures of correlation, and they are invariant to monotonic increasing transformations of the two variables involved. The Spearman coefficient is an example. This works by transforming the observed values to ranks and calculating the Pearson coefficient of the ranked data. That’s equivalent to transforming the raw data to uniform scores, before applying the Pearson measure. But the choice of a uniform distribution here is arbitrary—or at least, in almost all the applications I have encountered, it’s arbitrary. No-one has told me why, for their problem, they believed uniform scores were appropriate, rather than, for example, normal scores, or scores derived from some other distribution. Unfortunately, the derived value of the Pearson coefficient will depend on the chosen distribution. What this means is that the value of the coefficient is using arbitrary “information” that the researcher has injected into the calculation, not merely the information in the data.

To overcome this, we need to step back and think more carefully about how the invariance to monotonic transformations of the two variables is achieved. The Spearman coefficient does it by mapping to a standard representation, but an alternative approach would be to base one’s measure solely on the ordinal properties of the data. A measure which does this is the Kendall coefficient. This measure thus sidesteps the intrinsic arbitrariness implicit in the Spearman measure. Again the two measures are different, with different properties, and which is appropriate depends on what you want to know.

As Box, Tukey, and other great statisticians have pointed out, it is critical in a statistical analysis to make sure you solve the right problem.

David Hand is Senior Research Investigator and Emeritus Professor of Mathematics at Imperial College, London, and Chief Scientific Advisor to Winton Capital Management. He serves on the Board of the UK Statistics Authority. He is a Fellow of the British Academy, and a recipient of the RSS Guy Medal. He was made OBE for services to research and innovation.

Leave a comment

Welcome!

Welcome to the IMS Bulletin website! We are developing the way we communicate news and information more effectively with members. The print Bulletin is still with us (free with IMS membership), and still available as a PDF to download, but in addition, we are placing some of the news, columns and articles on this blog site, which will allow you the opportunity to interact more.
We are always keen to hear from IMS members, and encourage you to write articles and reports that other IMS members would find interesting. Contact the IMS Bulletin at bulletin@imstat.org

What is “Open Forum”?

In the Open Forum, any IMS member can propose a topic for discussion. Email your subject and an opening paragraph (to bulletin@imstat.org) and we'll post it to start off the discussion. Other readers can join in the debate by commenting on the post. Search other Open Forum posts by using the Open Forum category link below. Start a discussion today!

Recent posts

About IMS

The Institute of Mathematical Statistics is an international scholarly society devoted to the development and dissemination of the theory and applications of statistics and probability. We have about 4,500 members around the world. Visit IMS at http://imstat.org