Congress may be getting dumber, but grade levels don’t prove it

A news story’s making the rounds this week that the members of the U.S. Congress have stopped talking at an 11th-grade level and have started talking at a 10th-grade level. This fits very neatly into the overall feeling that America is becoming ever more anti-intellectual, that Congress has become a group of petty and immature cliques who exist primarily to prevent each other from accomplishing anything, which is why the story has picked up steam. And perhaps these feelings are accurate, but this story doesn’t provide any evidence of it.

In short, the Flesch-Kincaid readability test that’s used in this analysis is completely inappropriate for the task.

How do we deal with speech errors? Speech has something that writing doesn’t have: disfluencies. Whether it’s a filled pause (uh, um, you know), a correction (We have — I mean, don’t have), an aborted phrase (I am a man with– I have goals), there’re lots of words that come through in speech that wouldn’t be in edited writing. Here’s an example from the 2008 debate, where Gwen Ifill said:

“The House of Representatives this week passed a bill, a big bailout bill — or didn’t pass it, I should say.”

That’s a sentence supposedly at the eighth-grade level. If we remove the mistakes & repetitions, we get a sentence that has now dropped a grade level. That’s the same drop that Congress supposedly has undergone. Maybe they just started editing the Congressional Record more tightly?

Grade levels aren’t based on content or ideas. The Flesch-Kincaid grade level calculation uses two statistics: syllables per word and words per sentence. These are imprecise stand-ins for want we really want, which is presumably the difficulty of the individual words and the complexity of the sentence structure. Word difficulty is going to be tied to their predictability in context, their frequency in the language, their morphological complexity, and other factors, all of which are loosely correlated with the number of syllables. Longer words will in general be more difficult, but there is a lot of noise in the correlation. Because we’re only using an estimate of the difficulty, our estimate of the grade level is inherently imprecise.

There is no punctuation in speech. There are lots of different ways to punctuate a speech. Is a given pause supposed to indicate a comma, a semicolon, or a period? The difference between these can be substantial; Nilep’s post shows how punctuating the speech errors as sentences of their own drop a sentence from grade level 28(!) to 10.

The rhetorical style of a speaker also comes into play here. Suppose Senator X and Senator Y deliver the same speech. Senator X uses a staccato style, where each clause becomes its own sentence. Senator Y uses a more relaxed and naturalistic style, combining some clauses with semicolon-ish pauses. Because the reading level calculation is based largely on number of words per sentence, Senator Y is going to get a much higher grade level, even though the only difference is in the delivery, not any of the content.

What does the grade level measure? The idea of grade-level estimation for writing was to give a quick estimate of how difficult a passage is to understand. The main readability scores were calibrated by asking people with known reading proficiency (as determined by a comprehension test or the grade level they were in) to read passages of various difficulty and to answer comprehension questions. The goal of the calibration was to get it so that if a piece of writing had a grade level of X, then people who read at the X level would be able to get some given percent of the comprehension questions right. Crucially, the grade level does not measure the content of the text, or the intelligence of the ideas it contains. In fact, for readability — the purpose the tests were developed for — a lower score is always better, assuming the same information is conveyed.

As I mentioned above, there’s a world of difference between reading and writing, so this calibration is probably invalid for speech. But if was valid, then we’d probably want to see the level go down.

The designers knew grade levels were imprecise measures. In a 1963 paper, George Klare wrote:

“Formulas appear to give score accurate to, or even within, one grade-level. Yet actually they are seldom this accurate.”

“Typical readability formulas are statistical regression equations, not mathematical identities, and do not reach that level of precision.”

I mention the two quotes here because they span 40 years of readability research, and the point remains the same. Grade-level assessment is somewhat informative, but it’s not very precise. You can be reasonably certain that a child will understand a third-grade level story better than a twelfth-grade level one. It is not nearly so certain that a tenth-grade level and eleventh-grade level story will be distinguishable. In fact, the Kincaid et al paper from 1975 that debuted the Flesch-Kincaid reading level calculation acknowledges its imprecision:

“Actually, readability formulas are only accurate to within one grade level, so an error of .1 grade level is trivial.”

Conclusions. So what we have here is a difference of 1 grade level (which is the edge of meaningfulness in ideal circumstances) when the reading level calculation is applied to speech, on which it is uncalibrated and in which we don’t have clear plans in place to account for the vagaries of punctuation and the issue of speech errors. Also, we have no data on the cause of the grade level decrease, whether it’s due to dumbing down, a push for clarity, or just new punctuation guidelines at the Congressional Record.

Which is to say, we have no reason to believe in this effect, nor to draw conclusions about its source, other than the unfortunate fact that we have a belief crying out to be validated.

About The Blog

A lot of people make claims about what "good English" is. Much of what they say is flim-flam, and this blog aims to set the record straight. Its goal is to explain the motivations behind the real grammar of English and to debunk ill-founded claims about what is grammatical and what isn't. Somehow, this was enough to garner a favorable mention in the Wall Street Journal.

About Me

I'm Gabe Doyle, currently a postdoctoral scholar in the Language and Cognition Lab at Stanford University. Before that, I got a doctorate in linguistics from UC San Diego and a bachelor's in math from Princeton.

In my research, I look at how humans manage one of their greatest learning achievements: the acquisition of language. I build computational models of how people can learn language with cognitively-general processes and as few presuppositions as possible. Currently, I'm working on models for acquiring phonology and other constraint-based aspects of cognition.

I also examine how we can use large electronic resources, such as Twitter, to learn about how we speak to each other. Some of my recent work uses Twitter to map dialect regions in the United States.

Ridger: You’re right; I’m extrapolating a bit here. In the papers I scanned through (e.g., Kincaid et al), they calibrated the scores against a reading comprehension test. But the comprehension test in turn has to have been calibrated against something, and that’s where the grade they were in comes in. Of course, as I think you’re noting, the very idea of a grade level for reading is somewhat unintuitive, given the variance between members of a grade.

The only way to do this calibration is to test readers who are actually in a grade to see how well they comprehend a particular passage. Then you measure words/sentence and syllables/word and find a correlation. Not a prediction but a correlation. Now you’ve got a rough measure of grade level readability for a piece of text.
What you do not have is any measure of how sophisticated a text is. That’s quite a different question. The Flesh-Kincaid does not measure it.
Similar issues arise with all tests; for example, IQ tests and personality tests. They do measure something, but you have to be very careful about interpreting what they measure and what it means.
The MMPI (Minnesota Multi-phasic Personality Inventory) measures differential responses to test items between people who are hospitalized for psychiatric problems and people who have not been hospitalized for psychiatric problems. The MMPI has diagnostic value. It gives interesting results for personality theories and theories of abnormal psychology.
However, personality tests are not necessarily a good tool for making hiring decisions, for example. That is a different question.

Thank you for this post. I heard the “story” on NPR and when they described the Flesch-Kincaid’s measurements, my radar went up. It was exactly like they were readily admitting it measured only the syllable-count or words and word-count of sentences, and was therefore not a valid way to measure grade-level of written language. Furthermore, written language (i.e., measured by the Flesch-Kincaid) is NOT the same as spoken language. Can you even measure the “grade-level” of spoken language? And further yet, it completely discounts the sociolinguistic code-switching that many congressmen employ when speaking to each other vs. speaking to constituents vs. speaking officially vs. speaking extemporaneously.

And as a follow up to the comments, I am an SLP working with kids with dyslexia. Though we work heavily on spoken and written language comprehension, reading comprehension is VERY hard to measure, even in a research-based environment. What types of questions? What length passages? Details? Main ideas? Inferences? Etc.