What Counts as Success in Vocabulary Instruction?

I’ve discussed in previous posts (here and here) the inefficiency of academic vocabulary teaching programs such as Word Generation. In one evaluation of the program, Snow, Lawrence, and White (2009) found that after 30 hours of instruction, Word Generation students learned fewer than three extra words compared to a control group.

That’s a whopping one new word for every 10 hours of instruction.

Yet the researchers in these studies usually consider their vocabulary instruction to be a success. In Snow et al. – after claiming (wrongly) that their Word Generation teaching program resulted in two years’ worth of gains in just 22 weeks – the researchers added this justification as proof the massive effort to teach three new words was worth it:

It is of interest to compare the effect size obtained with the Word Generation curriculum to that obtained in other vocabulary interventions. A similarly structured intervention, the Vocabulary Improvement Program (Carlo et al., 2004), obtained an effect size of .50. The Stahl and Fairbanks (1986) meta-analysis of vocabulary curricula reviewed studies with effect sizes ranging as high as 2 under short-term laboratory-teaching conditions and as low as 0 under more authentic educational conditions. Thus, although Word Generation is not just a vocabulary intervention, and by design did not try to teach large numbers of words, its impact on students compares well with that of other successful programs. (p. 341, emphasis added)

To understand Snow et al.’s logic, it is necessary to know a little about effect sizes (see also this post).

Effect sizes are very useful in comparing the results of different studies or even results within the same study. The most common effect size used in educational experiments is sometimes called the standardized mean difference (Bloom, Hill, Black, & Lipsey, 2008 (PDF)). This is computed by taking the average (mean) score of the treatment group minus the average score of the control group, and dividing that result by the common or pooled standard deviation. There are other ways of computing effect sizes, but that’s a good enough method for our purposes.

Effect sizes tell us the magnitude, in statistical terms, of the difference that a treatment made compared to an alternative. But you still need some context to make sense of what those numbers mean in practical terms (for a good but at times technical discussion, see Lipsey et al. 2012 (PDF)).

It is true that Snow et al.’s effect size is in the ballpark of other studies of this type, but what if all of these other studies also yielded terrible practical results?

Let’s take as an example Stahl (1983), the published study that had one of the largest effect sizes in a relatively recent review of vocabulary instruction studies that aimed to improve reading comprehension, a meta-analysis by Elleman, Lindo, Morphy, & Compton (2008). Ellman et al. list the effect size of Stahl’s experiment as 2.02, which would certainly be judged by most researchers as a very “large effect.”

Stahl studied the impact of two different methods of teaching vocabulary to 5th graders: having students study definitions versus having them study definitions plus looking at the word in context. Stahl also included a control group, students who never saw the taught words at all.

(Side note: “No treatment” control groups such as this can be useful in research, but I suspect that a lot of teachers who sit through in-services listening to a presenter or an administrator drone on about “what the research says” are unaware that Method X being advocated was probably compared to doing nothing at all. This is very common in educational research, so caveat lector.)

After teaching the kids for a total of about 70 minutes, Stahl found that the best performing treatment group (the Definitions + Context condition) scored 8.32 on a 10- question multiple-choice vocabulary test, while the control group had an average score of 4.47. When you divide the difference between the scores (3.85) by the standard deviation, you get the large effect size of 2.02.

After an hour or so of training, then, the treatment group knew about four more words than the kids in the control group (all of the students were pretested to make sure they did not know a “significant portion” of the taught words (p. 37) before the experiment).

Should we be impressed with this “success”?

Granted, it’s leaps and bounds better than what Snow et al. achieved (0.08 extra words per hour!), but as Gerry Coles has noted, we always need to consider the question, “Compared to what?”

In this case, it was compared to doing literally nothing.

But the kids in Stahl’s study didn’t have to do “nothing.” They could have read instead.

How many words would the students have acquired if they were just given the opportunity to read?

Nagy and his colleagues (Nagy, Herman, & Anderson, 1985; Herman, Nagy, Anderson, 1987) estimated that the average middle school reader picks up the meaning of around 5-10% of the unknown words in a text by “just” reading. Several other researchers have replicated Nagy and colleagues’ findings using both first and second language readers, with children and adults (see Swanborn and de Glopper, 1999 for a partial review, but also here).*

Assuming, as Nagy et al. do, that about 2% of the words will be unknown to the reader, and that the average reading rate for a middle-schooler is about 166 wpm, we can calculate the number of words likely to be acquired without any special instruction:

166 wpm * .02 * .05 = 0.17 words per minute = 9.96 words per hour.

Stahl’s “large effect” now seems rather less impressive. Just reading is more than twice as efficient as drilling students in vocabulary. Stahl’s results, I should add, were unusually good in terms of the number of words learned per hour. Most other studies do far worse than this.

A large effect size, then, doesn’t always mean a large practical effect, and it is definitely not a measure of relative instructional efficiency. In the case of vocabulary instruction, it may only mean that your study’s results are just a lousy as previous studies.

*I recently submitted a paper to a major reading journal on the topic of vocabulary instruction. Nearly all of the reviewers and the editors doubted that you could, in fact, acquire much vocabulary from reading, either ignoring or dismissing 30 years of research on the topic. So there’s that.