I'm a language grad and a student language teacher, and in my spare time I learn languages. I have a special interest in minority languages and as a former IT professional I am particularly interested in where human and computer meet.

29 May 2012

Authentic: long v short pt 5

Today was a wee bit frustrating. I spent a solid chunk of time trying to get the ntlk data module installed, and with it the file english.pickle that would have allowed me to do part-of-speech (POS) tagging. This would have made it almost trivially easy to eliminate the proper nouns and get a genuine look at the real "words" that are of interest to the learner.

Ah well, it looks like it wasn't meant to be.

So I started working towards custom code to eliminate the proper nouns manually, something which would be handy in the future anyway. The first step was to identify some candidates for further inspection, and seeing as I'm working with English, that's pretty easy: if it's not all in lower case, there's something funny about it. I wrote the code to identify all the tokens (words) that contained capitals. Yes, at this point I could have checked whether it was the start of the sentence or not, but that wouldn't have really helped, because proper nouns occur at the start of sentences too, so i'd still need to check.

When I generated my set of candidates, though, it was a little long. For The 39 Steps, I was looking at 919 tokens to check manually, and that's a fairly short book. As I'm doing this for fun, it seemed like checking that many would be a little bit boring, particularly in longer books. (I later checked the candidate set for the 3 books in total, and it turned out to be over 3000 words, which is more than my time's worth.)

My first quick test then was to have a look at the difference in figures. Eliminating every single item with any capitals in it drops the type:token ratio in The 39 Steps from 14.48% to 13.14% -- that's almost a a 10% drop (it's 1.35 percentage points, but it's 9.27 percent). Before properly addressing the proper nouns, I wanted to see how big a difference this crude adjustment makes to the figures. It seemed just a little too high to realistically be led by proper nouns alone. But can that be? I mean, how many words are likely to occur only at the start of sentences?

So on I went, hoping that the data I could generate at this stage would start to shed some light on this figure.

The first graph I produced showed me the running type:token ratios and introduction rates for both the full token set, and the token set with non-lowercase words eliminated:

The two pairs of lines follow each other pretty closely, getting closer together as they progress. But in order to start getting a clear idea of what was going last time, I had to go to another level of abstraction and measure some useful differences. So here is the difference between the running ratios for all words and lowercase only, and the corresponding difference in introduction rates:

Now you'd be forgiven for thinking that the difference is diminishing here -- I was fooled into thinking the same thing, but then I realised I was dealing with numbers here rather than proper stats, and I redid the analysis but with a difference in percentage:

The overall running type:token ratio does indeed decrease, but it halves (20% down to 10%) then stabilises. The introduction rate, on the other hand, is all over the place -- there's no identifiable trend at all. Even subsampling my data didn't give any clear and understandable trends (and since I'm using a desktop office package for my analysis it's a bit of faff to do the resampling automatically -- it's just further proof that I need to get myself familiar with the statistical analysis tools for Python (eg numpy), but my head's full with the NLTK stuff for now, so I'll leave the improved statistical stuff for
another time). Here's the same graphs, but with 2000 word samples instead of 500 word samples:

So not promising, really. Still no stable, identifiable trends.

Books as a series
But I had all the infrastructure in place now, so I figured I might as well rerun the analysis on the 3 books as a single body and see what came out. Let's just go straight to the relative difference between the lines for all words and eliminating all words not entirely in lower case:

Oooh... now where did I leave those figures on where the individual books started...? 44625 and 152034, and there's a notable period of high difference (20-30%) from about 45000 words, and that massive spike you seem on the graph -- which is actually a 63.64% difference -- occurs from 152000-152500.

Bingo: we've got decent support for Thrissel's suggestion that a lot of proper nouns are introduced early on in... at least some novels.

Not the sort of information I was originally looking for, but actually quite interesting. It's kind of turning the project in a slightly different direction than I had planned. I'll just have to go with the flow.

What I did wrong today
One of the minor irritations of the day was when I started writing up my results, and after having done the coding, data generation and analysis, I realised a fairly simple refinement I could have made. It was a real *palmface* moment: I could have simply taken my first list of candidate proper nouns and eliminated any candidates that also appeared completely in lower case. Having done that, I would have been left with a much shorter list of candidates, and it may well have been worth my time manually checking the results.

>sigh<

But of course, that's as much the point of the exercise as anything: to work through the process and the problems and to start thinking about what can be done better.

It also occurs to me now that I also managed to eliminate every single occurrence of the word I from the books! Quite a fundamental error, even if it only made a minute difference to the final ratios.

Perhaps I'm being a little too "hacky" in all this. I'll have to pick up my game a bit soon....