I'm a language grad and a student language teacher, and in my spare time I learn languages. I have a special interest in minority languages and as a former IT professional I am particularly interested in where human and computer meet.

25 May 2012

Authentics: long vs short part 3

Excuse the slight change of title -- I figured the original long title was probably getting truncated in people's feeds, so I wanted to abbreviate it. If you've been following my blog recently, you should have already seen my previous twoposts on my little project; I am trying to investigate whether my normal advice that long fiction (novels or TV serials) is better than short fiction (short stories and feature films) for the learner.

Sample sizes
Anyway, as I said last time, I wanted to start comparing a fixed length of text, rather than variable-length chapters as my benchmark. I was looking for a sampling length that would give a clear picture of the overall progression without having too much interference from little local fluctuations. My first set of results suggests that this is a fool's errand. The following set of images shows the graphs for the novel Greenmantle by John Buchan, with samples taken ever 1000, 2500, 5000 and 1000 words.

While using larger samples gives a much smoother line, it also unfortunately obliterates some of the most important detail in the graph, in that we start to lose the steep drop at the start -- that's information that's really crucial to my investigation, so I'll have to make put up with various humps and wiggles in the line for now. However, that's not to say that the other graphs aren't interesting in and of themselves -- the little hump at around 50000-60000 words in the 5000 word sample version suggests that something important may be happening at this point in the story, causing a batch of new vocabulary to be introduced, or perhaps the introduction of a new character with a different style of speech. Anyway, as interesting as that may be, it would be a diversion from the matter at hand.

Alternatively, I could move away from using linear sampling/projections and start charting using logarithmic or exponential data, and while now would be a good time to start refreshing my memory on that sort of statistical analysis, it also risks diverting me from the task at hand, and I'm following the Coursera.org machine learning course currently, so I should be able to get the computer to do the work itself in a few weeks anyway. Besides, I've still not got myself a high-frequency word list, and the pattern might be completely different once I've eliminated common words of English from the equation.

So for now I'll stick to working with multiple sample sizes. I'll admit to being a bit simplistic in my approach to this so far, as I ran my little Python program once for every sample size, rather than just running it once with the smallest sample size then resampling the data.

This takes an NLTK token list (it would work with any simple list of strings too, though) and the size of samples to be taken, then builds up a list of lists [[a1,b1],[a2,b2],...] where each a is the number of the last word included in the sample, and each b is the number of unique tokens from the beginning of the text to the ath word.

The number_of_types function just returns len(set(w.lower() for w in token_list)).

This means that at every stage I have a running total of tokens, and it's only when I want to produce a graph that I calculate the number of new tokens in the given slice (= b(n) - b(n-1)), and there's therefore no reason why I can't skip several slices to decrease my sampling rate (eg b(n) - b(n-3)).

Next up
I've taken a running sample of three books from the same series -- The 39 Steps, Greenmantle and Mr Standfast, and run them through as one text, so I'll look at the output of that next, but I don't think it'll be much use until I've got something to compare with -- either/both of: a selection of novels by one author that aren't a series; and a selection of novels by different authors.

3 comments:

Btw, when you're eliminating common words - did you consider eliminating proper nouns as well? Gut feeling tells me these would also mostly appear for the first time during the first few chapters of a book (although admittedly their numbers are perhaps usually so small that any effect on the results would be virtually imperceptible).

I considered it, but it's too big a job. You can get lists of common words on the internet, but you won't get a list of the proper nouns in a book that easily. If I was carrying out genuine full-scale research, I'd probably do it -- or at the very least I'd investigate the statistical significance of proper nouns.

I might give it a go at the end if it looks interesting, but for now it's just a bit of fun and I'll try to keep things as simple as possible.

Actually, thinking about it, it's the sort of task I need to be working on: a semi-automatic process that identifies potential proper nouns then asks me to verify which ones are genuinely proper nouns.

There's several other things that I could build out of that sort of code, so I might have to give it a wee go...