Patrick Warren’s questioning of the evidence for the value of Bedtime Math in helping kids with math (see last Thursday’s post) motivates me to state more clearly what the evidence is. The result the authors highlight most is certainly vulnerable to serious criticism: the result that among those students using the math app, math performance has a substantial correlation with how often the app is used. As Patrick points out, there is a serious problem here of omitted variable bias: how much a child likes math is likely to have a positive effect on how often the child uses the app and to have a positive effect on math performance regardless of whether the app does anything to augment performance or not. That is, kids who like math will tend to be good at math whether or not they have a math app because they will find some way to do math and thereby get better at it. (See my column “How to Turn Every Child into a ‘Math Person’”.) So that result offers very little proof of the efficacy of the app.

However, the article has another piece of evidence that the authors should have led with. Parents were categorized into those who were anxious about math and those who weren’t. Among those parents categorized as anxious about math, the overall difference between kids’ math performance of those who received the Bedtime Math app and those who received a reading app was substantial: extra math achievement equivalent to what students get from 3 months of school. This was enough to be quite unlikely to be due to chance. The authors report a 4.8% probability of this result being due to chance, but that is for a two-tailed test that is only appropriate if one would have taken seriously a seeming finding that the kids did worse in math because of having a math app. With a more appropriate one-tailed test, there would be only a 2.4% probability of this result being due to chance.

In other words, of the 4.8% chance of declaring a fluke a real result that the authors report, only 2.4% is the chance of declaring a fluke in the positive direction a real result. The other 2.4% is the chance of declaring a fluke in the negative direction a real result. If one is bound and determined from the beginning not to declare a seeming result in the negative direction a real result, then the overall probability of declaring a fluke a real result is only the 2.4% chance coming from a seeming result in the positive direction. (See the Wikipedia article on “One- and two-tailed tests.”)

Of course, this 2.4% p-value is truly correct only if this hypothesis of an effect on math anxious parents had been the one and only central hypothesis spelled out in advance–as could be true in a replication of the experiment.

Why is this better evidence that the app does the job it was intended to do than the relationship between how much the app is used and math performance? The difference is that assignment to the math app group as opposed to the reading app group was random, and has no reasonable causal path to affect math performance than through the effects of the app itself. But amount of time spent with the app is nonrandom, and can easily reflect characteristics of a child or a child’s parents that could affect math performance in ways that don’t depend on the app at all.

The authors genuinely don’t seem to realize the effects of the assignment to the math app or the reading app–in interaction with math anxiety on the part of the parents–represents their only piece of solid evidence for the efficacy of the app. Not only is this result not clearly described in the abstract, it is not featured in a figure. The figures are reserved for the result about usage that, as discussed above, provides very little proof of the efficacy of the app. Their language instead suggests that they are going the extra mile by doing this intent-to-treat analysis. It wasn’t the extra mile. It was the first mile! But it was a good mile.

Here is the key passage:

We expected the math achievement of children with high-math-anxious parents to be more affected by use of the math (versus reading) app because these children would not generally be provided with high-quality math input at home (6). Therefore, we first separated parents on the basis of whether they were lower or higher in math anxiety (median split). We then performed an “intent-to-treat” analysis in which we looked at the effect of group (math versus reading app) on children’s end-of-year math achievement (controlling for beginning-of-year math achievement) independent of actual app usage. For children of high-math anxious parents, we found a significant effect of group, with children in the math group outperforming those in the reading group by almost 3 months in math achievement by school year’s end [beta-hat_21 = 5.25; t=1.99; P=0.048]. We did not find this same pattern for children of low-math anxious parents [beta-hat_31 = −0.61; t = −0.27; P = 0.79] (Model S3). An intent-to-treat analysis allows us to rule out factors possibly related to app usage—such as motivation or interest—as explaining our findings.

This is the heart of the paper, unbeknownst to the authors. You don’t need to read anything else.

The one thing that would make one be suspicious of this result is the possibility that the authors tinkered in various ways–including the split by math anxiety–to get the results they wanted. But that is easily remedied simply by having another group of researchers replicate the experiment. Although the substantive size of the effect is large, there is enough random variation that the statistical precision in the Berkowitz-Schaeffer-Maloney-Peterson-Gregor-Levine-Beilock experiment with 420 families is none too large. So it would be wise for someone undertaking a replication to involve at least 1000 families. The importance of the scientific question amply deserves that kind of care.