Categories

Meta

statistics

It shouldn’t come as a surprise that psychological studies on “priming” may have overstated the effects. It sounds plausible that thinking about words associated with old age might make someone walk slower afterwards for example, but as has been shown for many effects like this, they are nearly impossible to replicate.

Now Ulrich Schimmack, Moritz Heene, and Kamini Kesavan have dug a bit deeper into this, in a post at Replicability-Index titled “Reconstruction of a Train Wreck: How Priming Research Went off the Rails”. They analysed all studies cited in Chapter 4 of Daniel Kahneman’s book “Thinking Fast and Slow”. I’m also a big fan of the book, so this was interesting to read.

I’d recommend everyone with even a passing interest on these things to go and read the whole fascinating post. I’ll just note the authors’ conclusion: “…priming research is a train wreck and readers […] should not consider the presented studies as scientific evidence that subtle cues in their environment can have strong effects on their behavior outside their awareness.”

The irony is pointed out by Kahneman himself in his response: “there is a special irony in my mistake because the first paper that Amos Tversky and I published was about the belief in the “law of small numbers,” which allows researchers to trust the results of underpowered studies with unreasonably small samples.”

The American Statistical Association (ASA) has published their “statement” about p values. I have long held fairly strong views about p values, also known as “science’s dirtiest secret”, so this is exciting stuff for me. The process of drafting the ASA statement involved 20 experts, “many months” of emails, one two-day meeting, three months of draft statements, and was “lengthier and more controversial than anticipated”. The outcome is now out, in The American Statistician, with no fewer than 21 discussion notes to accompany it (mostly people involved from the start as far as I can gather).

The statement is made up of six principles, which are:

P-values can indicate how incompatible the data are with a specified statistical model.

P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

Proper inference requires full reporting and transparency.

A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.

By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

I don’t think many people would disagree with much of this. I was expecting something a bit more radical – the principles seem fairly self-evident to me, and don’t really address the bigger issue of what to do about statistical practice. That question is addressed in the 21 comments though.

It probably says something about the topic that it needs 21 comments. And that’s also where the disagreements come in. Some note that the principles are unlikely to change anything. Some point out that the problem isn’t with p-values themselves, but the fact that they are misunderstood and abused. The Bayesians, predictably, advocate Bayes. About half say updating the teaching of statistics is the most urgent task now.

So a decent statement as far as it goes, in acknowledging the problems. But not much in the way of constructive ideas on where to go from here. Some journals have banned p-values altogether, which sounds like a knee-jerk reaction in the other extreme direction. I’d just like to see poor old p’s downgraded to one of the many statistical measures to consider when analysing data. Never the main one, and definitely not the deciding factor on whether something is important or not. I may have to wait a bit longer for that day.

The UK Department of Health released their long-awaited updated guidelines on safe levels of alcohol consumption last week. The headline news was a small reduction on what is recommended as “safe” drinking for men, and that the limits are now the same for women and men. The predictable consequence was an avalanche of moaning. Pick any newspaper from that week and you will find a column inviting “the nanny state government” to come and take the poor oppressed journalist’s last bottle of wine, but only from their cold, dead hands. Yes, there was also some intelligent commentary on the findings and methodology of the review behind the new guidelines (David Spiegelhalter, The Stats Guy), but mostly it was all predictably sad, confused, and badly informed.

The reasons are not at all mysterious. (1) Alcohol is a touchy subject; most people in the UK drink, and any talk about the harms is taken as an accusation of their personal choices. (2) Risks are difficult to understand (and in this case they are also very difficult to estimate – the medical evidence itself is still far from settled). And as a consequence, (3) Risk information is difficult to communicate well.

There is not much that can be done about the first two. Which means that the third issue becomes even more important. I’m not sure how much better the publicity for this particular announcement could have been, but surely it can’t have come as a surprise that it was so badly received.

And for every journalist and pub commentator saying that their drinking never did them any harm, why not take a moment to consider the bigger picture. A small change in cancer risk might be acceptable to you personally, but from a public health perspective, looking at 50 million people in the UK, even a small reduction in alcohol consumption and therefore the number of future cancer cases will mean enormous savings for the health care system. That’s what the guidelines are all about.

The following “bibliography” should be a complete listing of all my published academic writings from 1998 to 2004. I’m posting it here both for the world to see, and as a handy place where I can find it myself if I ever need it again! I’ve added brief comments for some of the items; sometimes just to remind myself what the papers were all about. I’ve also included links to everything that is available online (most of the old working papers now only via web.archive.org). Outliers vs. nonlinearity in time series econometrics is the main theme here, and there are also several papers on long memory in the form of fractional integration. My non-academic writings, including a rock gig review at Rumba, to follow some other day perhaps!

This is my economics PhD dissertation, which contains an introduction and four articles: three published ones (in Communications in Statistics, Finnish Economic Papers and Applied Financial Economics), and an unpublished one analysing the impact of outliers on ARFIMA model estimation, with a simple robust two-stage estimation method.

A simulation study, showing how even a single outlier in a time series of 500 observations can seriously distort some commonly used tests for nonlinearity (ARCH and bilinearity tests here). Previous work had only considered more frequent outliers – this paper shows that the number of outliers can be very small, and the adverse impact still significant.

Evaluating the impact of outliers on macroeconomic time series analysis. Conclusion: outliers can have a significant impact, and their treatment should always be carefully considered. I’m afraid I have yet to come to a completely satisfying conclusion about the best way of handling outliers in empirical work.

First, a simulation study showing that the presence of outliers will bias time series (fractional integration) long memory estimates towards zero. An empirical example then shows that long memory is detected in stock market data more often if outliers are first taken into account.

Fractional integration long memory models are used to estimate a measure of unemployment persistence for different labour force groups. The results show that unemployment is less persistent for females and young people, than for males and the entire labour force.

An empirical assessment of the presence of long memory in Finnish stock market data. Depending on the testing method used, statistically significant long memory is detected in 24% to 67% of the series, which is considerably more than what is usually found in data of this kind. This article is based on a working paper with some additional results (see below).

Possibly my best idea, and also the most cited thing I’ve published. Proposes a new method for simultaneous outlier detection and variable selection, which overcomes a number of problems in this kind of statistical analysis. I’ve also got an application of this method for economic growth data, which I’ll try to polish and share here soon.

My main contribution here was to propose a specific kind of hidden Markov model (HMM) for modelling financial data series. The estimated HMM had two components to model the majority of observations: one with low, one with high volatility, to mimic “normal” and turbulent periods. Additional HMM components were then added to model outliers, or very extreme observations. Sadly this was never published anywhere.

Showing that once you take outliers and level shifts into account, there is very clear evidence for the presence of long memory in macroeconomic data. I can’t remember why I wrote this one in Finnish, as the results could have been of interest outside of Finland as well. And this paper also does not seem to be available anywhere online any more?

Miten olla hyvä taloustieteilijä? Kansantaloudellinen aikakauskirja, vol 2/2001, pp 339-341 (2001) [How to be a good economist? A book review of McCloskey, D. N.: How to be human – though an economist, the Finnish Journal of Economics]

We live in exciting times for sure. “Big data” (enormous databases and methods of analysing them) is creating all kinds of new knowledge. So I’m not saying that it’s all hype, and I did for example enjoy reading Kenneth Cukier and Viktor Mayer-Schönberger’s book Big Data.

But there sure is a lot of hype around as well. One particular meme I’m not so keen about is the claim that we now live in a whole new “N = all” world, where statistics is no longer needed, since we can just check from the data exactly how many x are y (e.g. people who live in London and bought something online last month, or something else that in the past we would have had to estimate from a sample to find out). Yes, there is a lot of information like this that is now easily available, and the big data advocates have many cool anecdotes to tell. And Google probably knows more about us that we do ourselves.

One obvious situation where old-fashioned statistical inference will be needed for some time still is medical research. Say you’re developing a new drug. You will need to do your phase 1, 2, and 3 trials just as before, and convince people at each stage that it’s safe to carry on. Unless you can somehow feed your new prototype drug to everyone in the world, record the outcomes in your data lake, and do your data mining? And there are surely many other situations like it, outside of academia as well. One of my previous jobs was on bank stress testing, which requires econometric modelling using very limited data sets and, yes, plenty of statistical inference.

I suppose in terms of the hype cycle, we are still in the initial peak phase of great expectations. And eventually all of these new methods will find their place in the great toolbox of data analytics. Right next to the boring old regression models, and slightly less old and never boring decision trees and neural networks.

Statistical hypothesis testing has always been close to my heart. I’ve long been critical of the use of p values, especially as most people seem to misunderstand their interpretation. I may even have failed a job interview once due to my stance on this. I suspect my interviewers didn’t believe that I knew what I was talking about when the subject came up.

This week, I read another two papers on this theme, one in finance, one in psychology. The first, Evaluating Trading Strategies by Campbell Harvey and Yan Liu looks at empirical evaluation of stock trading strategies. It is a nice illustration of the usual pitfalls of data mining – the bad kind where you do so many tests that some are bound to be statistically significant by chance alone – and has a useful discussion of ways of correcting for such multiple testing.

These two papers aren’t anything revolutionarily new, as this stuff has been known and talked about for ages. And yet, lots of people still fall into these traps, especially when it comes to publishing their work, and evaluating previous findings (for example doing a meta-analysis). My first thought was that it must be due to institutional factors: academics must publish, so they will fish and mine and p hack until they find something statistically significant to publish, and journals need to fill their pages somehow so they stick to the old rules, even though everyone knows that it’s all a bit of a sham and not to be trusted.

But then I came across Hoekstra, Morey, Rouder & Wagenmakers’s Robust misinterpretation of confidence intervals. This one interviewed 120 psychology researchers and 442 students about their understanding of confidence intervals for a simple hypothesis test. Both groups were equally misinformed about the interpretation of the test results, such as confidence intervals and p values. And what’s more (quoting from the abstract): “Self-declared experience with statistics was not related to researchers’ performance, and, even more surprisingly, researchers hardly outperformed the students, even though the students had not received any education on statistical inference whatsoever.”