General Information

For generating the simulated populations included in the manuscript,
we assumed a uniform distribution over a set of parameters, reflecting
our lack of information about the nature of clustering in real
datasets. To gain some preliminary information about real datasets,
we revisited some of our own data using linear mixed-effect modeling.

Datasets

Description of datasets used in the analysis. The "Design" column describes the design of the study, indicating the number of levels of each factor (2) as well as whether it was within (W) or between (B) subjects / items.

ID

Article

Expt.

Subjs/ Items

Design

Task

Manipulation

Dependent variables

1–4

Matsuki et al. (2011)

3

32/48

2W / 2W

Silent reading (Eyetracking)

high/low event prototypicality of patient noun

First fixation, Gaze duration, "Go past" times, Total time

5

Yao & Scheepers (2011)

1

20/24

2Wx2W / 2Wx2W

Oral reading

context (fast/slow), quotation style (direct/indirect)

Syllables per second

6

Yao & Scheepers (2011)

2

48/24

2Wx2W / 2Wx2W

Silent reading

context (fast/slow), quotation style (direct/indirect)

"Go past" times (ms)

7-8

Levy et al. (2011)

1

41/24

2W / 2W

Self-paced reading

TODO

Reading times (same DV considered separately over two manipulations)

9

Rohde et al. (2011)

2

55/20

2W / 2W

Self-paced reading

TODO

Reading time

10

Keysar et al. (2000)

1

18/12

2W / 2W

Visual world eyetracking (spoken lang comprehension)

Competitor present/absent

Latency of target gaze

11

Kronmüller & Barr (2007)

2

56/32

2Wx2Wx2W / 2Wx2Wx2W

Spoken language comprehension

Speaker, Precedent, Cognitive Load

Response Time

12

Barr & Seyfeddinipur (2011)

1

92/12

2Wx2W / 2Wx2W

Spoken language comprehension

Speaker, Filled/unfilled pause

Distance of Mouse Cursor from Target

13

Gann & Barr (in press)

1

64/16

2Bx2Wx2W / 2Wx2Wx2B

Referential Communication

Listener, New/Old Referent, Feedback

Speech Onset Latency

Parameter space used in the simulations

For convenience, the distributions of population parameters as given
in the original manuscript are reproduced below.

Ranges for the population parameters; \( ∼ U(min, max) \) means the parameter was sampled from a uniform distribution with range \([min, max]\).

Parameter

Description

Value

\( \beta_{0} \)

grand-average intercept

\( \sim U(-3, 3) \)

\( \beta_{1} \)

grand-average slope

0 (H0 true) or .8 (H1 true)

\( {\tau_{00}}^2 \)

by-subject variance of \( S_{0s} \)

\( \sim U(0, 3) \)

\( {\tau_{11}}^2 \)

by-subject variance of \( S_{1s} \)

\( \sim U(0, 3) \)

\( \rho_S \)

correlation between \( (S_{0s},S_{1s}) \) pairs

\( \sim U(-.8, .8) \)

\( {\omega_{00}}^2 \)

by-item variance of \( I_{0i} \)

\( \sim U(0, 3) \)

\( {\omega_{11}}^2 \)

by-item variance of \( I_{1i} \)

\( \sim U(0, 3) \)

\( \rho_I \)

correlation between \( (I_{0i},I_{1i}) \) pairs

\( \sim U(-.8, .8) \)

\( \sigma^2 \)

residual error

\( \sim U(0, 3) \)

\( p_{missing} \)

proportion of missing observations

\( \sim U(.00, .05) \)

Analyses

Slope variance relative to intercept variance

The first analysis considered how much of the total variance related to
a given sampling unit (subject or item) was attributable to the random
slope versus the random intercept. We used the following formula:

\(\frac{{\tau_{11}}^2}{{\tau_{00}}^2+{\tau_{11}}^2}\)

Slope variance as a proportion of total variance for a given sampling unit

ID

Subject

Item

1

.00003

.73085

2

.00564

.37926

3

.00362

.29886

4

.00035

.04059

5

.17032

.25304

6

.00463

.03488

7

.35727

.51755

8

.01663

.39627

9

.04805

.44358

10

.64403

.04626

11

.49218

.49753

12

.77840

n/a

13

.40245

.57898

MIN

.00003

.03488

MEAN

.22489

.35147

MED

.04805

.38777

MAX

.77840

.73085

There is a broad range across experiments, with slope variance
accounting for anywhere from <1% of the total subject variance up to
78%. The by-item variance also shows broad dispersion, with slope
variance carrying from 3% to 73% of total item variance. The
by-subject measurement seems bimodally distributed, with observations
clumping toward either end of the range.

Random effects in relation to residual variance

One factor that became apparent in our analysis of real datasets was
that our simulations assumed that by-subject or by-item random effect
variance was roughly proportionate to residual variance. This
assumption is unlikely to hold in actual datasets, where the random
effect variance is typically much smaller than the residual variance.
In other words, actual data sets tend to be much noisier than our
simulated datasets.

Below are the results for each dataset, showing the residual variance
and the by-subject/by-item random effect variance as a proportion of
this residual variance. For each dataset containing multiple factors
(e.g., in 2x2 designs), we present the average by-subject and average
by-item slopes.

ID

Residual

\({\tau_{00}}^2\)

\({\tau_{11}}^2\)

\({\omega_{00}}^2\)

\({\omega_{11}}^2\)

1

3572

0.2163

0.0000

0.0143

0.0389

2

8438.8531

0.1404

0.0008

0.0282

0.0172

3

24387.6356

0.1046

0.0004

0.1339

0.0571

4

29933.581

0.3207

0.0001

0.1473

0.0062

5

0.493532

1.8765

0.3852

1.0238

0.3468

6

275362.526

0.4492

0.0021

1.0269

0.0371

7

230191

0.1058

0.0588

0.0910

0.0976

8

231824.16

0.1721

0.0029

0.0864

0.0567

9

51371.6

0.4117

0.0208

0.1076

0.0858

10

7536625

0.0363

0.0656

0.1198

0.0058

11

406043

0.2286

0.2216

0.3269

0.3237

12

0.128353

0.1042

0.3661

0.0000

0.0000

13

242830

0.4258

0.2867

0.0820

0.1127

MEAN

0.3532

0.1085

0.2452

0.0912

MED

0.2163

0.0208

0.1076

0.0567

MIN

0.0363

0.0000

0.0000

0.0000

MAX

1.8765

0.3852

1.0269

0.3468

One thing that is apparent is that the by-subject and by-item random
effects, as a proportion of residual variance, vary wildly across
studies (from 0% to 187% of residual variance). Typically, they are
only about 10-40% of the total variance. Generally, we also see more
variance on the intercept than on the slope. It should also be noted
that slope variance does not seem to be uniformly distributed over the
range; rather, it seems clumped at the top and bottom of the range.
It should be kept in mind that whereas intercept variances indexe
differences in overall level, slope variances index differences in
sensitivity to manipulations. It is possible that participants (or
items) were simply insensitive to some of the manipulations in these
studies, yielding no variance nor any overall effect.

Subsampling from the observed ranges

The next analysis addresses how unrepresentative the main results from
our simulations might be. Specifically, did the parameter space we
used lead us to be too pessimistic about random-intercepts-only models
and model-selection approaches and too optimistic about maximal
models?

To address this, from the values reported in the previous section we
derived the following plausible ranges from which to subsample
our simulation data:

Parameter

Min

Max

\({\tau_{00}}^2/\sigma^2\)

0.00

0.45

\({\tau_{11}}^2/\sigma^2\)

0.00

0.40

\({\omega_{00}}^2/\sigma^2\)

0.00

0.35

\({\omega_{11}}^2/\sigma^2\)

0.00

0.35

This resulted in the selection of 3154 (about 3%) of the total runs
for further analysis. On this subsample, we compared the power of
maximal LMEMs to min-\(F'\), \(F_1 \times F_2\), RI-only LMEMs, and
LMEMs using model selection for the random effects. From the various
possible model selection techniques for within-items design, we chose
the best performing model (the "backward best path" model, \(\alpha\)
for inclusion set to .05) to see if it would improve power in this
region of the space relative to the maximal model. The results are in
the table below.

Type I error rate for original simulations and for the parameter subspace

Subspace

Original

Subspace

Original

Subspace

Original

Subspace

Original

wsbi.12

wsbi.12

wsbi.24

wsbi.24

wswi.12

wswi.12

wswi.24

wswi.24

min-\(F'\)

.0384

.0445

.0387

.0446

.0216

.0271

.0263

.0307

\(F_1 \times F_2\)

.0653

.0628

.0770

.0772

.0549

.0574

.0656

.0724

LMEM, Maximal

.0758

.0703

.0596

.0575

.0611

.0589

.0592

.0559

LMEM, Selection

.0796

.0702

.0612

.0575

.1053

.0683

.0726

.0579

LMEM, RI-only

.1055

.1023

.1027

.1105

.2483

.4398

.3167

.4980

For between-items (wsbi) designs, the Type I error rates do not
differ much from the original simulations for any of the analyses.
For the within-items designs, ANOVA-based and maximal LMEMs perform
similarly on the subsample as they do on the original sample.
However, model selection approaches become slightly more
anticonservative, while random-intercepts-only LMEM becomes
substantially less anticonservative on the subsample. But even though
RI-only LMEMs are performing better, their Type I error rates
still remain intolerably high (.25 and .32).

Power (and corrected power, CP) for original simulations and for the parameter subspace

Subspace

Original

Subspace

Original

Subspace

Original

Subspace

Original

wsbi.12

wsbi.12

wsbi.24

wsbi.24

wswi.12

wswi.12

wswi.24

wswi.24

min-\(F'\)

.3003

.2099

.4984

.3281

.4471

.3268

.6826

.5116

\(F_1 \times F_2\)

.3675

.2518

.5961

.4034

.5961

.4400

.8098

.6432

LMEM, Maximal

.3965

.2672

.5643

.3636

.6215

.4603

.7921

.6104

LMEM, Selection

.4017

.2689

.5685

.3636

.6715

.4730

.8025

.6120

LMEM, RI-only

.4543

.3185

.6368

.4492

.8708

.8534

.9610

.9351

\(F_1 \times F_2\) (CP)

.3291

.2236

.5187

.3375

.5748

.4158

.7695

.5780

LMEM, Maximal (CP)

.3242

.2225

.5322

.3418

.5830

.4325

.7685

.5914

LMEM, Selection (CP)

.3266

.2229

.5301

.3424

.5200

.4144

.7495

.5880

LMEM, RI-only (CP)

.3231

.2156

.5040

.3140

.6180

.3791

.7961

.5313

It is notable that all approachees (including maximal LMEMs) are more
powerful on the subspace than on the original dataset. When power is
corrected for anticonservativity (rows labeled "CP" in the table), one
interesting outcome is that in the parameter subspace, maximal LMEMs
are nearly always just as powerful and occasionally even more powerful
than approaches using model selection. Finally, for within-items
designs, RI-only LMEMs, once corrected for anticonservativity, showed
only a very minor advantage relative to maximal LMEMs (6% and 4%
increase in power for 12 and 24 item datasets, respectively). In
contrast, for within-items designs, model selection approaches showed
a disadvantage in corrected power relative to maximal LMEMs (11% and
2.5% drop for 12 and 24 item datasets, respectively).

Summary

In closing, the analyses of actual datasets show that our simulations
assumed that by-subject and by-item random variance was a bigger
portion of the total variance than actually turned out to be the case.
Yet it was clear that even for the subregion of the parameter space
spanning the range of the observed data sets, maximal models reflect
the best compromise between controlling Type I error and power.
Unfortunately, we have no way of knowing whether our datasets are
representative of the kinds of experimental datasets analyzed in
experimental psychology. Nonetheless, these findings lend further
confidence to our contention that maximal LMEMs provide the best
approach for confirmatory hypothesis testing.