Friday, February 5, 2016

Effect stability: (1) Two-group, three-group, and interaction designs

When planning the sample size to estimate a population parameter, most psychology researchers choose the size that could allow an inference that the parameter is non-zero -- in other words, researchers attempt to maximize statistical significance. However, both practical and scientific interest often centers around whether the estimate is good or stable -- that is, close to its population parameter.

These two criteria, significance and stability, are not the same. Indeed, with a sample size of 20, a correlation of $r$=.58, which has a $p$-value of .007, could plausibly range between .18 and .81.

The stability of the correlation coefficient

Effect stability is the issue taken up in a wonderful paper by Schönbrodt and Perugini (2013). Schönbrodt and Perugini define a stable estimate as one that is likely to remain close to its population value even after adding additional data.

To be more concrete, let's imagine that we have a population of two quantitative variables that are correlated as $\rho$ = .3. If we draw a sample of size 20 from this population, the estimated correlation in that sample may be close to .3, but it could plausibly be quite far away.

Let's further imagine that we add cases to this sample and recalculate the correlation after each case, creating a trajectory of correlations. As we add cases along this trajectory, the correlation will fluctuate around the population value of .3, sometimes getting closer and sometimes moving further away. After a certain point, the correlation will tend toward .3 and will be unlikely to stray far from this value. If we follow this process of creating trajectories many times, we should be able to estimate the sample size cutoff after which, for most trajectories, adding more data does not cause the correlation estimate to stray far from .3.

This is exactly the procedure developed by Schönbrodt and Perugini. They define the sample size cutoff as the "point of stability", or POS, the desired closeness of the estimate to its population value as the "corridor of stability", or COS (and its half-width, $w$), and percentage of trajectories that should be within the COS after the point of stability as the "confidence" of the point of stability. The relationship between the different parameters, and the general method for estimating the POS, are shown in Schonbrodt's and Perugini's Figure 1, below.

Figure 1 from Schönbrodt and Perugini (2013)

Schönbrodt and Perugini applied their method to the correlation between between two quantitative variables for several values of $\rho$, $w$, and the confidence of the POS by simulating 10000 trajectories for each value of $\rho$. The major results are shown below, which is a reproduction of Schönbrodt's and Perugini's Table 1. Given that the average published effect has a value of $r = .21$ (Richard, Bond, & Stokes-Zoota, 2003), they interpret these results to mean that a good estimate of an effect requires a sample size between 150 and 250.

Estimating stability in other designs

In my opinion, the paper by Schönbrodt and Perugini is extremely interesting and provocative. However, most of my research, and, I believe, much of the research of other psychologists, involves estimating quantities other than the correlation between two quantitative variables. Indeed, psychologists use a huge variety of designs, including two-group designs, three-group designs, interaction designs, simple mediation designs, and many others. There are many differences between these designs and one involving two quantitative variables, any of which could plausibly have some impact on stability. For example, in many studies involving categorical variables (especially experimental studies), the researcher has some control over the level of the categorical variable that is sampled. Most researchers will balance their sampling across categories for maximum statistical efficiency, perhaps affecting stability.

With these differences in mind, I applied Schönbrodt's and Perugini's method to investigate stability in two-group, three-group, and interaction designs. All these designs involve categorical variables, and I balanced sampling across categories under the assumption that most researchers would sample for maximum efficiency. I also investigated mediation designs, which I will describe in a separate post. I was greatly aided in these efforts by the fact that Schönbrodt and Perugini made their source code freely available. Across all the cases I investigated, the changes I made to their code are minimal.

You can find more details about my methods in the technical note at the end of this post. My source code is freely available here.

Two-group designs

In two-group designs (i.e., designs involving one dichotomous and one quantitative variable), the emphasis is usually on estimating the mean difference (or standardized mean difference) of the quantitative variable across the two groups of the dichotomous variable. Fortunately, the standardized mean difference is simply a transformation of the correlation coefficient, as shown below.

$$d = \frac{2r}{\sqrt{1 - r^2}}$$

This means that the procedure developed by Schönbrodt and Perugini for the correlation between two quantitative variables should be easily applicable to two-group designs involving one dichotomous and one quantitative variable.

The results of this simulation are below. I have included a column showing the value of the standardized mean difference $d$ along with the correlation $\rho$.

For easy comparison to the case of two quantitative variables, you can find a table of the differences between the points of stability for the two group case and the two quantitative case ($POS_{\text{two group}} - POS_{\text{two quantitative}}$) below.

At the smallest values of $\rho$, the points of stability are large and similar to, or even slightly larger than, those in a design with two quantitative variables. At larger values of $\rho$, the points of stability in the two-group design decrease, and decrease faster than in the design with two quantitative variables.

Three group designs

Many researchers test for mean differences in designs with 3+ groups using an omnibus ANOVA. However, if we assume that most researchers are interested in specific differences between groups that can be represented using a one degree of freedom contrast (for example, the linear contrast $[-1, 0, 1]$, we can examine the stability of the correlation between this one degree of freedom contrast and the outcome variable just as we did in the two-group design.

This is exactly what I did. The results below represent the points of stability for the correlation $\rho$ (or standardized mean differences $d$) between a linear contrast and a quantitative variable for different values of $w$ and different levels of confidence. [I also investigated whether the results are sensitive to the form of the one degree of freedom contrast by running this same simulation with a Helmert contrast $[1, 1, -2]$; they are not].

The points of stability for a three-group design are largely similar to the points of stability for a two-group design; when $\rho$ or $d$ are small, large sample sizes are required to obtain stability. Compared to a two-group design, the points of stability do not decrease quite as quickly with increasing $\rho$; indeed, the points of stability are marginally larger than in a two-group design (1.20 cases, averaged across values of $\rho$, $w$, and confidence values).

Crossover interactions

Designs in which the interaction is the parameter of interest are different from two-group and three-group designs in that there are two independent variables in the design rather than one. Moreover, an interaction is typically tested in a model that includes the conditional, lower-order effects of these two variables in the model.

The estimator for the interaction was the semi-partial correlation of the interaction, conditional on the lower-order effects in the model. I also only tested stability in the presence of a crossover interaction -- in other words, in a case in which the main effects of both independent variables was 0. I tested stability two cases -- one in which both independent variables were dichotomous, and one in which one independent variable was dichotomous and one was quantitative.

Of all the designs I investigated, this design has points of stability most similar to those in a two-quantitative design. In fact, this is the only design for which the points of stability are consistently higher than the two-quantitative design -- 2.17 cases, on average.

Conclusions

Across all the designs that I investigated, the points of stability were high when the population effect sizes were low, and quite similar to those found in the two-quantitative design investigated by by Schönbrodt and Perugini. In two-group, three-group, and dichotomous by dichotomous interaction designs, the points of stability dropped more rapidly with increasing population effect sizes than did the points of stability in the two-quantitative case.

However, this relative decrease may be due to the fact that I used a balanced sampling scheme for all categorical variables across all the simulations I conducted. Studies with random sampling schemes may not enjoy this stability benefit. If the stability benefit is indeed due to the balanced sampling scheme, it is noteworthy that the points of stability in the categorical by quantitative interaction design were nonetheless higher on average than the points of stability in the two-quantitative design, despite the fact that sampling was balanced across levels of the categorical variable in the interaction design.

Overall, I echo the sentiments of Schönbrodt and Perugini by enjoining psychologists to increase their sample sizes if they wish for accurate estimates of their effects. Most effects in psychology are in the $\rho = .1$ to $.2$ ($d = .2$ to $.4$) range (Richard, Bond, & Stokes-Zoota, 2003), suggesting that sample sizes of ~150-250 or more are necessary for accurate effect size estimates.

Technical notes
For each design, I induced a correlation matrix that represented the population effect size of interest into a population of 100,000 cases using a program developed by Ruscio and Kaczetow (2009). In the case of interactions, I induced one correlation into 50,000 cases and a correlation of the opposite sign into a second population, then stacked the two populations on top of each other along with an indicator variable tracking the population the case belonged to.

For each population, I simulated 10,000 trajectories. Each trajectory started with a sample of 10 cases per category in the design, and ended with a sample of 1000. Because simulating these trajectories is computationally intensive, I conducted all simulations with a cluster managed by the University of Wisconsin-Madison's Center for High Throughput Computing.

After the trajectories were simulated, I found, for each trajectory and each desired half-width of the COS ($w = .10$, $w = .15$, $w = .20$), the point at which adding successive cases did not cause a break from the COS. I defined "break" as a single case for which the statistic of interest left the COS (Schönbrodt and Perugini found that defining a "break" in other ways, such as two consecutive cases that leave the COS, has a minimal impact on the estimated points of stability). I calculated the 80%, 90%, and 95% quantiles of these breaks, resulting in the points of stability at 80%, 90%, and 95% levels of confidence.