Stupid Software for Clever People

Around two and a half years ago I wrote a blog post looking at the use of words like computing, code, programming etc in primary school Ofsted reports. I thought it might be a good way to track if computing was on the rise in this age group.

I’ve repeated the exercise with a sample of recent reports and a sample from 2011/12. The disappointing news is that there is no significant difference in the frequency of computing related words.

Clearly it would have been a better story if there had been a difference…. but given that I’d gone to the effort of scraping the text of almost 6000 reports from Ofsted’s website, I thought I should at least look for what the differences actually are. How has the language changed over the last couple of years?

Here are the top 75 most significant differences. The first chart shows words that appear more often in recent reports. The second shows words that appear more often in older reports. Theres some technical stuff below about the method, but what is graphed here is something called the log-likelihood of a difference – basically the bigger the line the more significant a difference there is.

Firstly, I was at least encouraged by the fact that “Mathematics” is mentioned a lot more now (at least thats Computational ). The other thing that jumped out at me was the change from the use of the word “disabilities” to “disabled”… a reflection of a general shift in language in this area?

As for the more educational aspects, I’m not really qualified to comment on how meaningful this analysis is, but would be interested to hear views from teachers.

The technical bits

The idea to use a log-likelihood score came from this paper by Paul Rawson and Roger Garside – However, as always with these things, there was a lot of munging required before being able to use the method.

Ofsted publish reports as PDF’s – these were scraped (painfully) from their site and converted to text using Python and pyPdf for the conversion. The reports contain a lot of boilerplate text and this has evolved over time. To prevent that from influencing the final results I wrote some code to remove the 1000 most often repeated lines from each corpus. Not perfect but it seemed pretty effective. NLTK was used to tokenise the text, remove stop words and do the basic frequency counts and then I coded up the log likelihood scores directly from the paper referenced above. Graphs were done in R.