Analyzing Complexity of Text

George SDpStoring text in a database is common. What isn’t common is needing to know the reading ease and grade level of the text, but I was presented such a requirement (actually it was more of a wish list item) this week. There are ways of solving this problem. In the conclusion to this post, I outline the steps for implementing T-SQL code to estimate reading complexity. I think the topic of reading ease and grade level ratings is potentially of greater general interest than you might at first think. For example, you could have data driven web pages accessed by the general public where choices entered by the users deliver custom content. Perhaps you need to deliver product specific operating or safety instructions.

For purposes of this post, I’ve assumed that you have a large body of unanalyzed text in VARCHAR and NVARCHAR columns. Text stored in Word documents stored in SQL Server FILESTREAM is out of scope for this post.

There are several well-known and relatively simple algorithms for estimating the grade level and reading ease of text. Microsoft Office has the Flesch-Kincaid algorithm that you can use to obtain an estimated grade level and the Flesch Reading Ease algorithm. To do readability analysis in Word, you’ll need to enable it. See http://blogs.office.com/b/microsoft-word/archive/2007/06/26/can-word-improve-your-writing.aspx and follow the easy instructions for doing this. Notice that the page shows an analysis of something written by Dr. Seuss which has a grade level of zero and a reading ease of 100. For comparison purposes, I analyzed the United States Internal Revenue Service instructions for completing a form 1040 income tax return. Notice that income tax instructions have a much lower ease of reading than Dr. Seuss, but somehow I think you already knew that.

Before you can write code to calculate reading difficulty, you need to pick an algorithm. The Flesch and Flesch-Kincaid algorithms require that you know the total syllables, total words, and total sentences in the body of text to be analyzed. The Simple Measure of Gobbledygook (SMOG), Gunning fog index, and Coleman-Liau index are similar. If you want to implement something simple using T-SQL, finding the number of syllables is too difficult. The Dale-Chall (Edgar Dale and Jean Chall, 1948) and Spache (George Spache, 1953) algorithms require that you use a list of words considered to be common so that you can find the percentage of complex words. Finding a copy of one of these word lists in a single column format is a bit of a challenge. I found the updated and expanded list of Dale-Chall words at http://lindacarlton.net/thoughts/2010/02/dalechall_list.php if you need to implement something possibly more accurate than the algorithm in the next paragraph.

The Automated Readability Index is sufficiently easy to code. The greatest difficulty you will likely encounter is in determining the number of sentences. Since computing readability isn’t an exact science, you could count the number of periods in a block of text to estimate the number of sentences. The accuracy could be improved by reducing the total by the number of ellipses (…) found in the text. As the linked document shows, the Automated Readability Index was developed for the United States Air Force in 1967. The document shows both a multiple regression version of the algorithm to estimate grade level as well as a simplified equation to compute the Automated Readability Index:

At this point, all that has been asked of me is to explain what it would take to analyze existing textual data for readability. I’ve presented algorithms that can be implemented. Do you remember how your advanced math textbooks would say the proof is obvious and is left as an exercise for the reader? I won’t provide the code for your stored procedure or function today – I leave the implementations details as an exercise for the reader.

This document has a Flesch Reading Ease of 44.7 and a Flesch-Kincaid Grade Level of 11.5. If reading something this complex gives you a headache, at least I provided a link to information about an analgesic!

Comments

Leave a Comment

About John Paul Cook

John Paul Cook is a database and Azure specialist in Houston. He previously worked as a Data Platform Solution Architect in Microsoft's Houston office. Prior to joining Microsoft, he was a SQL Server MVP. He is experienced in SQL Server and Oracle database application design, development, and implementation. He has spoken at many conferences including Microsoft TechEd and the SQL PASS Summit. He has worked in oil and gas, financial, manufacturing, and healthcare industries. John is also a Registered Nurse currently studying to be a psychiatric nurse practitioner. Contributing author to SQL Server MVP Deep Dives and SQL Server MVP Deep Dives Volume 2. Connect on LinkedInFollow @JohnPaulCook