Category Archives: statistics

In the past weeks, the following graphic poster has been shared frequently on social media:

Poster made by the J. Walter Thompson agency for the National Centre for Domestic Violence.

The small print is pretty small; it reads: “if England get beaten, so will she. Domestic violence increases 26% when England play, 38% if they lose”.

Tomas van Dijk, a journalist working for the Dutch newspaper De Volkskrant contacted me this week. He’s writing a piece factchecking these numbers and asked me for my opinion. As these numbers, 26% and 38%, are taken from a paper from 2013, he suggested to look at this years data as well. Furthermore, he noted that it is known that there is a relation between temperature and domestic violence (more violence when it is too warm), so he asked if I could have a look at the role of temperature as well.

The reason I looked at this paper – being so outside of my own field of work – is because of one of the last paragraphs, one sample size. Here, the authors write:

For optogenetic activation experiments, cell-type-specific ablation experiments, and in vivo recordings (optrode recordings and calcium imaging), we continuously increased the number of animals until statistical significance was reached to support our conclusions.

I was extremely surprised to read this, because of three reasons:

This is not a correct way to decide upon the sample size. To be more precise: this is a very wrong way of doing so, kind of invalidating all the results;

The authors were so open about this – usually questionable research practices are more hidden;

None of the three reviewers, nor the editor, has spotted this blatant statistical mistake – even though it’s a textbook example of a QRP and the journal has an astronomical impact factor.

The second reason is reassuring to some extent: it is clear that there’s no ill intent from the authors. Without proper and thorough statistical training, it actually sounds like a good idea. Rather than collecting a sample of, say, size n = 50, let’s see step by step if we can work with a smaller sample. Especially when you are conducting animal studies (like these authors are), it’s your ethical obligation to select the sample size as efficient as possible.

My tweet about this yesterday received quite some attention: clearly I’m not the only one who was surprised to read this. Andrew Gelman wrote a blog post after seeing the tweet, in which he indicates that this type of sequential analysis doesn’t have to be problematic, if you steer away from null hypothesis significance testing (NHST). He makes some valid points but I think that in practice researchers often want to use NHST anyway. Below, I will outline (i) what the problem is with sequential analyses with unadjusted testing; (ii) what you could do to avoid this issue.

Unadjusted sequential testing

The story here holds true for all kinds of tests, but let’s stick to a straightforward independent t-test. You begin with 2 mice in each group (with 1 mouse per group, you cannot compute the within-group-variance, thus cannot conduct a t-test). You put some electrodes in their brains, or whatever it is you have to do for your experiment, take your measurements and conduct your t-test. It gives a p-value above 0.05. It must be because of the small sample, let’s add another mouse per group. Again, non-significant. You go on, and on, and on, until you reach significance.

If there is no effect, a single statistical test will yield a false positive, so p < 0.05, in 5% of the times. This 5% is something we think is an acceptable percentage for the false discovery rate (although you can make a motivated choice for another rate – but that’s another discussion). If you would do two independent tests (and there is no effect), you would reach a significant result 1 – (1 – 0.05)2 = 90.25%, and with k tests, this is 1 – (1 – 0.05)k, which goes towards 1 pretty fast if k goes up. This is the basis behind the Bonferroni correction.

Here, the situation is slightly different: you’re not performing independent tests. The p-value for a t-test with 30 measurements will be not too dissimilar from a p-value for a t-test with those 30 measurements and 1 more. Still, the multiple testing issue remains – albeit not as severe as with independent tests. You can prove mathematically (don’t worry, I won’t do that here) that with this sequential approach it actually is guaranteed (i.e. probability of 1) that you will reach significance at some point. Even if there is no effect! This approach will give a guaranteerd false discovery rate of 1 – and that is as bad as it sounds…

Example

We can use a computer simulation to see what happens. This is a situation in which H0 is true: there is no effect, i.e. both groups are not different. Rejecting H0 in this situation is an error (Type I error). In the picture below, I did just what I described: starting with n = 2, I kept on increasing n by 1. As you can see, the p-value ‘converged to significance’ at n = 42. But it also moved away from it! At n = 150, we’re kind of back where we started, with a very non-significant p-value.

Sequential p-values: at n = 42, we ‘dive under’ the 5% threshold.

Simulation

So, in this instance it happened at n = 42. With a new simulation it might happen at some other point, but two things are for sure: you will reach significance and you will reach non-significance after that…

Let’s now study how bad the problem is. I simulated 1000 of these sequential strategies, and recorded at what value of n significance was reached for the first time. Sometimes you’re “lucky” and have it with a small n, sometimes you have to wait for ages. The simulation results are as follows:

False discovery rate for sequential strategy.The blue curve indicates independent tests, the red one dependent tests.Same picture as above, now zoomed in to the area with n < 25.

As you can see, the problem is huge. Even if you would apply some rule where you stop the strategy once n = 25, your False Discovery Rate exceeds 25%, more than five times what you want.

Note that this problem not only affects the p-values, but also the estimates. Using this strategy, the distance between the means of both groups will sometimes increase, sometimes decrease – just as a consequence of coincidence. If we continue sampling until the means of the experimental and control group are sufficiently far apart in order to call it significant, it means we overestimate the effects. Not only is the significance biased, so is the effect size.

So, in an attempt to ‘use’ as few animals as possible – something that should be applauded – the authors actually and accidentally invalidated their study, leading to more test animals that are used unnecessarily…

So, what can we do?

Hopefully, I’ve managed to explain that unadjusted sequential analysis is problematic. It is, however, possible to still apply this approach – increasing your sample size in small bits until you meet some threshold. The main difference is that the threshold should not be taken fixed at 5%, but should take the issue of multiple testing into account. The mathematical backbone to this approach was developed in the 1940’s by Abraham Wald, with a pivotal paper in 1945. Around the same time, and independent of Wald, British war hero and polymath Alan Turing derived a similar approach based on Bayesian reasoning. This sequential approach helped Turing to crack the German Enigma machines and thus saved millions of lives.

These sequential approaches are more technical than the standard t-test, and they are usually not included in easy to use software packages. Recently, several people have written accessible tutorial papers on how to perform such a sequential analysis. A good starting point is this paper by Daniel Lakens.

Conclusion

In their paper, Franz Weber and colleagues used an incorrect method to decide upon the sample size. As a consequence, all test results in this paper are invalid. How this passed peer review in a top journal, is difficult to understand, but these things happen. It’d be interesting to see how Nature Communications deals with the aftermath of this paper…

It’s January. Exam time. And for the unfortunate ones, next month is resit-exam time. Each year I get the same complaints after the resit exam, so I’ve decided to write a blog post about it. From now on, complaining students will receive no more than the hyperlink to this blog.

Each year, I end my course with an exam. For those that have to miss the exam due to force majeure, a resit opportunity exists. Fortunately, it’s a relatively small number of students that have to miss the exam due to illness, bereavement, etc. The resit opportunity is also open to students that did participate in the first exam, but failed.

Most of the time, the pass rate of the resit exam is considerably lower than the pass rate on the first exam. Students – and especially those that failed the resit exam – see this as proof of some evil plan of mine: I made the resit much more difficult than the first opportunity. Why else would the pass rate be lower? Statistics don’t lie!

True, they don’t. But people – including students that fail a statistics course two times in a row – are prone to misinterpreting statistics. And they are misinterpreting the numbers here.

The grade a student receives at an exam depends on three aspects: (1) his/her proficiency (usually a combination of motivation and intelligence); (2) the difficulty level of the exam; (3) coincidence (being (un)lucky with guessing multiple choice questions, ‘not having a good day’, etc.).

Let’s make a thought experiment. Suppose that all students would participate in both the first opportunity as the resit exam. That way, you end up with two grades for everyone. Let’s keep things easy and only look at whether a student passes or fails the exam opportunity.

Let’s make up some numbers now (homework exercise: re-do this example with different numbers and observe that the conclusion still holds true). Suppose we’ve got 100 students (because 100 is easy to compute with). Here they are:

Opportunity 2

Pass

Fail

Opportunity 1

Pass

48

10

Fail

12

30

As you can see, the pass rates of Opportunity 1 (48% + 10% = 58%) and Opportunity 2 (48% + 12% = 60%) are comparable; the performance on the resit is actually even slightly better. Nearly half the students are able to pass twice, nearly a quarter of students passes once, and 30 students won’t pass this year.

This, however, is not data that you would observe in reality. The students that pass the first opportunity need not take the second one. (Even more so: if you pass the first and earn the course credits, you will lose these again by failing in the resit). Thus, we only observe the resit-results for those that failed the first opportunity:

Opportunity 2

Pass

Fail

Opportunity 1

Fail

12

30

Now the pass rate suddenly is 12/42 = 28.6%, much lower than the 58% pass rate of the first opportunity. Thus, the fact that the resit exam has a much lower pass rate than the first opportunity exam does not imply that the resit exam is more difficult. If you still believe it does, you don’t receive a passing grade on your statistics exam.

On September 21st, Romy van der Lee and Naomi Ellemers published a paper in PNAS in which they claim to have found compelling evidence of gender bias against women in the allocation of NWO Veni-grants in the period 2010-2012.

The day after, I posted a blog post in Dutch criticising this study (and they day after that an abridged version in English). In these posts, I explained how the significance of the result is due to Simpson’s paradox – thus a statistical artefact rather than true evidence for gender bias. This blog post sparked an amount of public interest which was new to me. I normally publish on linear algebra, (minor) improvements to statistical procedures and other topics that are generally regarded as boring. This time, I’ve been interviewed by Nature, Science and various Dutch academic newspapers. (Great evidence on how post-peer review and blog posts are Science 2.0 – but that’s another topic).

Last week, an abbridged and updated version of my blog post appeared as a peer-review letter in PNAS.

Independently, Beate Volker and Wouter Steenbeek had their letter published in PNAS a few days later.

Van der Lee and Ellemers responded to both letters (response 1 and 2). In their response they misinterpret the consequences of the Simpson’s paradox. I wasn’t planning on responding again – my time is limited – but since they repeat this incorrect interpretation in multiple responses as well as in the newspaper, I find it important to outline why their statistical reasoning is flawed.

In this blog post I will outline that a correct interpretation of Simpson’s paradox results in insignificance of many p-values and not just the one I focussed on in my criticism. In their response to my letter, Van der Lee and Ellemers wrote:

“Further, Simpson’s paradox cannot explain that fewer women than men are selected for the next phase in each step of the review procedure”.

In their response to Volker and Steenbeek, they phrased this as:

“Simpson’s paradox also cannot account for the observation that in every step of the review procedure women are less likely than men to be prioritized.”

It is clear from this figure that the gender bias seems to increase in each step of the process. It is true that I, in my letter, focussed on gender bias in the final step – the number of awarded grants. This, however, was due to the word count limit that PNAS imposes and not because the other steps cannot be explained by Simpson’s paradox as well: they can.

It is easier to show this through a constructed example, rather than the true NWO data. Suppose that the setting is as follows. The funding agency has two research disciplines, A and B. Both receive 100 applications and through three stages (pre-selection, interviews, awards) it is decided who gets funded. In neither field A nor field B gender bias is present: gender is no issue in this example. However, the percentage of applications by women differs per field, and so does the amount of applications that receives funding.

Field A receives 100 applications: 75 by men and 25 by women. Finally, 40 applications will be funded. So 60 applicants receive bad news, which is equally distributed over the three steps: in each step, 20 scientists will be disappointed. In the case of total absence of gender bias (and coincidence), this leads to the following table:

Field A

# M

# F

% M

% F

Step 0: Applications

75

25

75%

25%

Step 1: Pre-selection

60

20

75%

25%

Step 2: Interviews

45

15

75%

25%

Step 3: Funding

30

10

75%

25%

As you can see, in each step the gender ratio is 75%-25%. No gender bias at all.

Field B also receives 100 applications: 50 by men and 50 by women. Out of these 100, only 10 will be funded: in each step 30 applications lose out. This leads to the following table:

Field B

# M

# F

% M

% F

Step 0: Applications

50

50

50%

50%

Step 1: Pre-selection

35

35

50%

50%

Step 2: Interviews

20

20

50%

50%

Step 3: Funding

5

5

50%

50%

Thus also no gender bias in Field B. If we combine the tables for fields A and B (by simply adding up the frequencies for each cell), we obtain:

Field A + B combined

# M

# F

% M

% F

Step 0: Applications

125

75

62.5%

37.5%

Step 1: Pre-selection

95

55

63.3%

36.7%

Step 2: Interviews

65

35

65.0%

35.0%

Step 3: Funding

35

15

70.0%

30.0%

Converting these percentages into a graph similar to Van der Lee and Ellemers’ Figure 1 provides:

The pattern from the table and figure is very clear: in each step of the process men seem to be favoured at the cost of women. Although the percentages for this example are obvious different than those from the NWO-data, the type of pattern is the same. Since in my example there is no-gender bias whatsoever, Van der Lee and Ellemers’ claim that “Simpson’s paradox also cannot account for the observation that in every step of the review procedure women are less likely than men to be prioritized” evidently is false. The power of paradoxes should not be underestimated.

As a final note: as outlined above, the significant results claimed by Van der Lee and Ellemers is lost once correct statistical reasoning is applied. It is important though to realise that the absence of significant gender bias does not imply that there is no gender bias. There could be and it is important to find out whether – and where! – this is the case or not. To conclude, I quote Volker and Steenbeek, who write:

More in-depth analyses with statistical techniques that overcome the above-mentioned issues are needed before jumping to conclusions about gender inequality in grant awards.

In the early seventies, the University of California, Berkeley received sincere negative attention due to supposed gender bias in graduate admissions. The data for fall 1973 clearly seemed to point in this direction:

Nr. of applications

admissions

Male

8442

44%

Female

4321

35%

Out of 8442 male applicants, 44% was admitted, whereas out of the 4321 female applicants, only 35% was admitted. The χ2-test on the 2×2 frequency table (or any other sensible test for 2×2 tables) will give a very significant result, with a p-value smaller than one in a billion. A scrutiny of the data in Science by Bickel, Hammel and O’Connel (1975) revealed that there was no evidence for gender bias. This apparent counterintuitive result was due to the interaction with an external variable. Not all departments at the university had the same admission rate, and there was a relation between the proportion of female applications and the admission rate.

Competitive departments such as English received relatively many female applications, whereas departments such as chemistry, with a surplus of male applications, where much less selective. When studying the male/female admissions on a departmental level, the supposed gender bias disappeared. (For the fall 1973 data, there even was evidence of bias in favour of women.) This paradox is termed spurious correlation or Simpson’s paradox, after the British statistician Edward Simpson. (For a recent open access paper on Simpson’s paradox in psychological science, see Kievit, Frankenhuis, Waldorp and Borsboom, 2013.)

The authors, correctly, point at another pitfall: although there seemed to be evidence of bias (in favour of women) for fall 1973, there is no such evidence for other years. A significant result once in a number of years, could just be coincidence.

In the analysis by Van der Lee and Ellemers the same two flaws occur in a setting not too dissimilar from the one discussed above. Based on the results of n = 2,823 grant applications to the “VENI programme” of the Netherlands Organisation for Scientific Research, NWO, in the years 2010, 2011 and 2012, the authors conclude that the data “provide compelling evidence of gender bias in personal grant applications to obtain research funding”. One of the main results this claim is based upon the following table:

applications

Succesfull

Male

1635

17,7%

Female

1188

14,9%

When applying a standard χ2-test to the data, the authors find a just significant p-value of .045. It is not only questionable to denote a p-value this close to 0.05 as “compelling evidence”, due to Simpson’s paradox, this p-value simply is wrong.

In the supplementary table S1 (Van der Lee and Ellemers, 2015), available online without paywall, a breakdown of the 2,823 grant applications per discipline is presented. The proportion of female applicants varies from 11.8% (physics) to 51.4% (health sciences), and the total succes rate varies from 13.4% (social sciences) to 26.3% (chemical sciences).

Proportion of applications by female scientists vs total success rate. Size of the markers is proportional to number of applications within the discipline.

The figure above visualises these data and immediately shows a clear negative relation between the proportion of female applicants and the total succes rate (i.e. the rate for men and women combined). In four out of the nine disciplines, women have a higher succes rate than men, and in five out of nine, men have a higher succesrate than women. When taking into account that multiple comparisons are performed, for none of the disciplines the gender bias – either in favour of women or in favour of men – is significant (at the α = .05 level). Thus, when taking into account the spurious correlation, the “compelling evidence” is lost.

Bickel et al. (1975) pointed at a second pitfall, concerning focussing on the year(s) where the difference was signicant and ignoring the other year(s) where it was not. Again, a similar situation occurs here. NWO publishes the results of all VENI rounds since its establishment in 2002 until 2015 (except for 2012) on its website. In some years, such as 2011, men received relatively more grants than women; and in other years, such as 2010 and 2015, the reverse was true. The z-test for log-odds ratio only provides a significant sign of gender bias in favour of men for the years 2010 (z = 2.002, p = .023) and 2011 (z = 1.752, p = .040) and a significant gender bias in favour of women for 2002 (z = 2.005, p = .022). When applying the Bonferroni correction for multiple comparisons none of these gender biases are significant.

Conclusion. Van der Lee and Ellemers failed to recognise the dependence of the results on the different NWO disciplines. Futhermore, they focused on results during a three-year, whereas the results of the other periods in which VENI-grants where provided did not confirm the just significant results for 2010-2012. As a consequence, the conclusion of “compelling evidence of gender bias” is inappropriate. In the data, there is no evidence for gender bias (which does not have to mean that there is no gender bias). In discussions on institutional sexual discrimination, it is important to stay factual.

Furthermore, I find it worrying that this analysis gets published. Simpson’s paradox is one of statistics most well-know paradoxes (I teach it yearly to a new batch of psychology students in Groningen) and PNAS is a high-ranking journal with an impact factor of nearly ten. This paper – where conclusions are drawn on basis of flawed methodology – is not an exception. Apparently, the current peer-review system is inadequate in filtering out methodological flaws in papers. If a system doesn’t work, it should be changed.

Final note. The paper by Van der Lee and Ellemers focusses on more tests than just the one criticised by me here. However, these other tests make use of related data (e.g. the number of applicants that go through to the interview-stage) and it is not unlikely that Simpson’s paradox plays a role there too. (The data provided in the paper was insufficient for me to check this.) And even if it does not: the authors are providing interpretations to effects with tiny effect sizes (partial eta-squareds of 0.006(!))… Furthermore, the paper contains a section on “language use” in NWO documents. My comments do not apply to this section.

I’m currently updating my course materials, aimed at undergraduate students in psychology, for next academic year. Since the text book is lacking a (thorough) description on how to do inference (hypothesis testing and confidence interval construction) for the product-moment correlation coefficient, I’ve written something myself.

It might be useful for someone else who feels that the text books aimed at social sciences students are lacking this information (and the text books aimed at mathematics students are too technical for other students), so I’ve put a copy here. Feel free to use it.

In a new blogpost, Daniël Lakens explains why using ω² is better than using η². Based on literature review and his own simulations, he shows convincingly that the bias of η² is much larger than that of ε² and ω². Or, in Daniël’s words, “Here’s how bad it is: If η² was a flight from New York to Amsterdam, you would end up in Berlin”.

I agree with Daniël that the flight doesn’t take you to Amsterdam, but things are less severe than he claims, as I will outline below. My post is a follow-up to his, so please read his post before you read mine.

Daniël clearly shows that η² clearly disqualifies itself as an estimator in terms of bias. However: bias is only part of the story. Obviously you do want the bias to be small (or, ideally, 0, i.e. an unbiased estimator). But wishes are not unidimensional. You also want a stable estimator, i.e. an estimator with small variance. And in that category, η² performs the worst out of the three estimators that Daniël studied.

I ran Daniel’s R-code (available at the bottom of his post; I’ve set nsim = 10000 for practical purposes, I’ve got to finish work before the kids get out of school) and the variance of ε² is about 1,5% (when n=100) to 17% (when n=10) larger than that of η². For ω² these variance ratios are 1,1% up to 13,4%.
(You can check it yourself by re-running Daniel’s code and then running “SDmat[,2]^2/SDmat[,1]^2” and “SDmat[,3]^2/SDmat[,1]^2”).

There is always a trade-off between bias and variance. It’s easy to make an estimator with zero-variance. Let’s make one now: casper² is defined as always being equal to 0.2. Always. Clearly, casper² has zero-variance, but it will usually have a large bias (unless the true effect size actually is 0.2, but we don’t know that value (otherwise we wouldn’t have to estimate it)). Thus, It might not have been a smart move to name this poor estimator after myself. Which is why I’ll redefine it as TimHunt². That’ll teach him!)

The convential way to deal with the trade-off is to compute the Mean Squared Error. The MSE is defined as the sum of squared differences between the estimate and the true value. The MSE can be computed as MSE = variance + bias². Large values can have too much impact, which is why we often use the root of the mean squared error, conveniently called root mean squared error (RMSE).

If you look at the RMSE (which is easy; Daniel’s code already computes it for you (in the variable RMSEmat)), you see that ε² and ω² both do have lower RMSE’s than η², but that the difference is close to neglectible. (Credits for the visualisation go to Daniël; I’ve used his code and simple replaced “BIASmat” by “RMSEmat”).

When n = 10, for instance, RMSE(η²) = 0.122, RMSE(ε²) = 0.112 and RMSE(ω²) = 0.110. When n = 100, the values are respectively 0.0316, 0.0311 and 0.0310. (With some uncertainty due to the fairly low number of replications). To take it back to the New York to Amsterdam-flight comparison: now you don’t land at Berlin anymore, but at Groningen International Airport, which is, according to the airport’s website “conveniently close”.

To summarise: η² does indeed perform worse than ω² and ε², but the difference in performance is not as extreme as Daniël suggests. The poor behaviour of η² in terms of bias is almost completely compensated by good behaviour of η² in terms of variance. This especially holds when n is larger than, say, 25.

Another often-mentioned advantage of η² is that it is easier to compute than ω². However, we are not living in the era where we do our computations manually. Decent software (such as R or JASP) computes ω² for you with a press of a button. Furthermore, ease of computation can never be an argument: if you want to do easy things, don’t do science…