Beware the Friedman test!

In section 10.4.4 of Serious stats (Baguley, 2012) I discuss the rank transformation and suggest that it often makes sense to rank transform data prior to application of conventional ‘parametric’ least squares procedures such as t tests or one-way ANOVA. There are several advantages to this approach over the usual approach (which involves learning and applying a new test such as Mann-Whitney U, Wilcoxon T or Kruskal-Wallis for almost every situation). One is pedagogic. It is much easier to teach or learn the rank transformation approach (especially if you also cover other transformations in your course). Another reason is that there are situations where widely used rank-randomization tests perform very badly, yet the rank transformation approach does rather well. In contrast, Conover and Iman (1981) show that rank transformation versions of parametric tests mimic the properties of the best known rank randomization tests (e.g., Spearman’s rho, Mann-Whitney U or Wilcoxon T) rather closely with moderate to large sample sizes. The better rank randomization tests tend to have the edge on rank transformation approaches only when sample sizes are small (and that advantage may not hold if there are many ties).

The potential pitfalls of rank randomization tests is nicely illustrated with the case of the Friedman test (and related tests such as Page’s L). I’ll try and explain the problem here.

Why the Friedman test is an impostor …

I’ve always thought there was something odd about the way the Friedman test worked. Like most psychology students I first learned the Wilcoxon signed ranks (T) test. This is a rank randomization analog of the paired t test. It involves computing the absolute difference between paired observations, ranking them and then adding the original sign back in. Imagine that the raw data consist of the following paired measurements (A and B) from four people (P1 to P4):

A

B

P1

13

4

P2

6

9

P3

11

9

P4

12

6

This results in the following ranks being assigned:

A – B

Rank

P1

+9

+4

P2

-3

-2

P3

+2

+1

P4

+6

+3

The signed ranks are then used as input to a randomization (i.e., permutation) test that, if there are no ties, gives the exact probability of the observed sum of the ranks (or a sum more extreme) being obtained if the paired observations had fallen into the categories A or B at random (in which case the expected sum is zero). The basic principle here is similar to the paired t test (which is a one sample t test on the raw differences).

The Friedman test is (incorrectly) generally considered to be a rank randomization equivalent of one-way repeated measures (within-subjects) ANOVA in the same way that the Wilcoxon test is a a rank randomization equivalent of paired t. It isn’t. To see why, consider three repeated measures (A, B and C) for two participants. Here are the raw scores:

A

B

C

P1

6

7

12

P2

8

5

11

Here are the corresponding ranks:

A

B

C

P1

1

2

3

P2

2

1

3

The ranks for the Friedman test depend only on the order of scores within each participant – they completely ignore the differences between participants. This differs dramatically from the Wilcoxon test where information about the relative size of differences between participants is preserved. Zimmerman and Zumbo (1993) discuss this difference in procedures and explain that the Friedman test (devised by the noted economist and champion of the ‘free market’ Milton Friedman) is not really a form of ANOVA but an extension of the sign test. It is an impostor.

This is bad news because the sign test tends to have low power relative to the paired t test or Wilcoxon sign rank test. Indeed, the asymptotic relative efficiency relative to ANOVA of the Friedman test is .955 J/(J+1) where J is the number of repeated measures (see Zimmerman & Zumbo, 1993). Thus it is about .72 for J = 3 and .76 for J = 4, implying quite a big hit in power relative to ANOVA when the assumptions are met. This is a large sample limit, but small samples should also have considerably less power because the sign test and the Friedman test, in effect, throw information away. The additional robustness of the sign test may sometimes justify its application (as it may outperform Wilcoxon for heavy-tailed distributions), but this does not appear to be the case for the Friedman test. Thus, where one-way repeated measures ANOVA is not appropriate, rank transformation followed by ANOVA will provide a more robust test with greater statistical power than the Friedman test.

The rank transformation version of the ANOVA is relatively easy to set up. The main obstacle is that the ranks need to be derived by treating all nJ scores as a single sample (where n is the number of observations per J repeated measures conditions – usually the number of participants). If your software arranges repeated measures data in broad format (e.g., as in SPSS) this can involve some messing about cutting and pasting columns and then putting them back (for which I would use Excel). For this sort of analysis I would in case prefer R – in which case the data would tend to be in a single column of a data frame or in a single vector anyway.

The following R code using demo data from the excellent UCLA R resources runs first a friedman test, then a one-way repeated measures ANOVA and then the rank transformation version ANOVA. For these data pulse is the DV, time is the repeated measures factor and id is the subjects identifier.

It may be helpful to point out a couple of features of the R code. The Friedman test is built into R and can take formula or matrix input. Here I used formula input and specified a data frame that contains the demo data. The vertical bar notation indicates that the time factor varies within participants. The repeated measures ANOVA can be run in many different ways (see Chapter 16 of Serious stats ). Here I chose ran it as a multilevel model using the nlme package (which should still work even if the design is unbalanced). As you can see, the only difference between the code for the conventional ANOVA and the rank transformation version is that the DV is rank transformed prior to analysis.

Although this example uses R, you could almost as easily use any other software for repeated measures ANOVA (though as noted it is simplest with software that take data structured in long form – with the DV in a single column or vector).

Other advantages of the approach

The rank transformation is, as a rule, more versatile than using rank randomization tests. For instance, ANOVA software often has options for testing contrasts or correcting for multiple comparisons. Although designed for analyses of raw data some procedures are very general and can be straightforwardly applied to the rank transformation approach – notably powerful modified Bonferroni procedures such as the Hochberg or Westfall procedures. A linear contrast can also be used to run the equivalent of a rank randomization trend test such as the Jonckheere test (independent measures) or Page’s L (repeated measures). A rank transformation version of the Welch-Satterthwaite t test is also superior to the more commonly applied Mann-Whitney U test (being robust to homogeneity of variance when sample sizes are unequal which the Mann-Whitney U test is not).

3 Comments

V.Fidler

We have published a paper on this subject years ago (Fidler V and Nagelkerke NJD (1986) Generalized signed rank test for comparison of p treatments. Statistica Neerlandica, 40, 145-155.). Our test sign-ranks jointly all J(J-1)/2 within-subject pairwise differences. It is equivalent to Wilcoxon-signed-ranks test for J=2.

Bastien

Thank you for these explanations! It is very interesting. I was wondering if you do not have to check for normality of the residuals after having run the mixed model on the ranks. If the normality is not respected, I guess the test can not be run. Thank you for your answer!

I don’t that normality of residuals would be an issue. The normality assumption is about the least important assumption of a regression model and what tends to cause problems is extreme values – which can’t really occur in after a rank transformation. In theory the sphericity assumption could be violated so it would make sense to keen an eye on that. For example, in independent measures designs the analogous homogeneity of variance assumption can be violated even with rank transformed data (though you can apply corrections such as the Welch-Satterthwaite correction).