Post navigation

Comparing χ² tests for separability

Researchers often wish to compare the results of their experiments with those of others. Alternatively they may wish to compare permutations of an experiment to see if a modification in the experimental design obtains a significantly different result.

This question concerns an empirical analysis of the effect of modifying an experimental design on reported results, rather than a deductive argument concerning the optimum design.

Many researchers attempt this type of evaluation by employing statements about their results (citing, t, F or χ² values, error levels or “p values”, etc), as benchmarks for the strength of their results, implying a comparison that is frequently misunderstood (Goldacre 2011).

Alternatively, descriptive statistics of effect size such as percentage difference, log odds ratios, or Cramér’s φ may be used for comparison. These measures adjust for the volume of data and measure the pattern of change observed. However, effect size comparisons are discussed in the literature in surprisingly crude terms, e.g. ‘strong’, ‘medium’ and ‘weak’ effects (cf. Sheskin 1997: 244). In this paper we explain how to evaluate differences in effect size statistically.

In summary:

The fact that one chi-square value or error level exceeds another merely means that reported indicators differ. It does not mean that the results are statistically separable, i.e. that the results are significantly different from each other at a given likelihood of error.

However if we wish to claim a difference in experimental outcomes between experimental ‘runs’, this is precisely what we must establish.

In this paper we attempt to address how this question of separability may be evaluated.

We begin by focusing on comparing the results of two paired contingency tests:

two 2 × 2 tests for homogeneity (independence) and

two 2 × 1 goodness of fit tests.

The idea is that both dependent and independent variables are matched but not necessarily identical, i.e., in both subtests we attempt to measure the same quantities by different definitions, methods or samples. The new test then compares these subtest results for separability and tells us if the effect of the change in experimental design obtains a significantly different result.

Consider the example below, from Aarts, Close and Wallis (2013). The two tables summarise contingency tests for two different sets of data. The results appear to be different, especially if we consider effect size measures φ and d%. The question is whether we can test if they are significantly different from each other.

(spoken)

shall

will

Total

χ²(shall)

χ²(will)

summary

LLC (1960s)

124

501

625

15.28

2.49

d% = -60.70% ±19.67%

ICE-GB (1990s)

46

544

590

16.18

2.63

φ = 0.17

TOTAL

170

1,045

1,215

31.46

5.12

χ² = 36.58 s

(written)

shall+

will+’ll

Total

χ²(shall+)

χ²(will+’ll)

summary

LOB (1960s)

355

2,798

3,153

15.58

1.57

d% = -39.23% ±12.88%

FLOB (1990s)

200

2,723

2,923

16.81

1.69

φ = 0.08

TOTAL

555

5,521

6,076

32.40

3.26

χ² = 35.65 s

A pair of 2 × 2 tables for shall/will alternation, after Aarts et al. (2013): upper, spoken, lower: written, with other differences in the experimental design. Note that χ² values are almost identical but Cramér’s φ and percentage swing d% are different.

The idea is summarised by the figure below. There are two broad classes of test: those that distinguish results of goodness of fit tests (“separability of fit”) and comparing tests of homogeneity (“separability of independence”).

Visualising separability tests.

In this paper we concentrate on 2 × 2 and 2 × 1 tests because they have one degree of freedom, so significant results can be explained by a single factor.

It is possible to employ a similar approach for evaluating pairs of larger “r × c” or “r × 1” tables (see section 4 in the paper). However, we argue elsewhere (Wallis 2013) that it is good practice that such tables, which have many degrees of freedom (and therefore contain multiple potential areas of significant variation), should be analysed by subdivision into tables with one degree of freedom to identify areas of significant difference. The simplest tests we describe here may therefore have the greatest utility.

The tests we describe here represent a kind of meta-analysis: they provide a method for comparing and summarising experimental results. Other tests for comparing contingency test results include McNemar and Cochran Q tests (Sheskin 1997) which compare distributions, but not differences, and are known to be weak tests.

Zar’s (1999: 471, 500) chi-square heterogeneity analysis is the most similar class of tests in the literature to ours. Section 5 reviews these tests and compares them with our approach. The key difference is that Zar’s method requires that data has (approximately) the same prior distribution (i.e. the same starting point), whereas our tests do not.

Finally, note that in this paper we discuss contingency tests. There is a comparable procedure for comparing multiple runs of t tests (or ANOVAs) but it is rarely recognised as such. This is the test for interaction in a factorial analysis of variance (Sheskin 1997: 489) where one of the factors represents the repeated run.