I write about (or at least allude to) the statistical behavior of language quite a bit in this blog. After all, it takes its name from Zipf’s Law, which was originally an observation about the behavior of language. We’ve also talked a fair bit about the Poisson distribution and how we can use it to understand why some days a second language learner is going to feel like they have made no progress whatsoever.

In these discussions, I’ve pretty much always ignored a basic fact about language. It’s not always a bad thing to ignore this fact, and we can actually get computers to do quite a bit of fun/useful things with language even when we ignore it. However, it is a fact nonetheless. Most statistical models (not just of language, but of anything) assume that samples are independent: that is, that having made any one particular observation has nothing to do with any other observation. However, when you’re talking about words, this is probably never true. (I tend to avoid the words never and always when talking about language, but in this case, “never” might actually be appropriate.) Rather, the probability of any one word is typically related to the presence of other words. Let’s take an example: bark and dog. Both of these are words that are relatively infrequent in English, as compared to, say, the, or and. From a statistical point of view, they’re actually pretty rare. Intuitively, however: if you’ve already run into the word dog, then it’s not quite as surprising if you run into the word bark, and vice versa. The same doesn’t hold for avocado–running into the word dog (or bark) wouldn’t make it seem much more likely to run into avocado than if you hadn’t just seen the word dog or the word bark.

We can quantify this statistical relationship without too much difficulty. I’m going to take some liberties here, so my apologies to my language peeps out there.

Let’s look at the frequencies of some words in a big sample of the English language: a collection of just over 96,000,000 words of English called the British National Corpus.

The frequency of the word bark in this collection of English as a whole is 11.8/million words. That is to say: for every 1 million words in the collection as a whole, 11.8 of them are bark.

Now let’s take just the sentences that contain dog. (There are about 12,000 of them.) If we look at just the sentences that contain the word dog, the frequency of bark changes quite a bit: it is now 1,073 per million words. The frequency of bark is not independent of dog: if we don’t know anything about the surrounding words, the frequency of bark is 11.8/million words. On the other hand, if we know that the word dog is nearby, then the frequency of bark is 91 times higher–1,073 per million words, versus just 11.8/million words. We say that bark and dog are not conditionally independent. Rather, the frequency of bark is conditionally dependent on the presence (and probably absence, but I haven’t demonstrated that) of dog.

The thing is this: there is pretty much never conditional independence when you’re talking abut words. Rather, the probability of seeing any particular word is related to the words that occur around it. This is true on the level of sentences, and it’s also true on the level of situations–it’s not an accident that when I run into new French words, I tend to run into other new French words that are related by, say, subject matter.

All of this came up today when I found myself repeatedly looking up words in order to be able to read the morning’s emails, and found that many of them contained the French word merde, or “shit.” Somebody did something that they shouldn’t have, it pissed somebody else off, and soon the emails were flying fast and furious. Here are the shit-related words from my day’s email, plus some related words that I came across while looking them up. All examples are taken from my correspondence:

There’s an implication here for how to study a language: the “structured vocabulary” approach that textbooks take, where you are introduced to a variety of words related to the same theme, works. When you get beyond the point where there are no textbooks for the level that you’ve achieved in a language, then other resources that bring together words on the same subject can be really useful to you. I like Mastering French Vocabulary: A Thematic Approach (Mastering Vocabulary Series), 2nd Edition, by Wolfgang Fischer and Anne-Marie Plouhinec. It separates the vocabulary of a domain into more central and more peripheral vocabulary, and also gives example sentences. However, there are many others. I am also a big fan of the Oxford-Duden pictorial dictionaries, and there’s a French-English bilingual one. They’re not quite as user-friendly as something like the Fischer and Plouhinec book–no verbs at all, and no examples of usage, and no indication of when words are ambiguous–but they are excellent for technical and obscure vocabulary.

I said above that words are probably never conditionally independent. I can think of one particular kind of language in which you might see something like conditional independence. This is the phenomenon of word salad.

Wikipedia defines word salad as a “confused or unintelligible mixture of seemingly random words and phrases.” Random is the key word for us here–if you are random, then there is no conditional dependence–that is, knowing that any particular word shows up tells you nothing about the probability of some other word showing up, just as picking up, say, a piece of lettuce when you’re eating a (tossed) salad doesn’t let you predict anything about what you’re going to pick up next. Here are some word salad examples from schizophrenics:

Back to the drama: the emails flew fast and furious for a while. Ultimately, the issue was decided by appeal to logic and the basic principle of égalité–equality. In this case, that meant that identical standards would be applied to everyone, which might sound obvious, but in a similar situation in the US, that would not necessarily be assumed to be the case, at all. (I say that after having seen similar situations in the US many, many, many times.) In France: you can get pretty far here by arguing for logic and consistency, as far as I can tell. Seems pretty sane to me…

What you get when you search for the lemma “dog” as a noun in the British National Corpus. “Lemma” means that it includes both the singular “dog” and the plural “dogs.” Picture source: screen shot by me.

Technical note: I got the initial frequencies for dog and bark through Sketch Engine. I saved all sentences containing dog in a Sketch Engine search as a text file. Then I counted the total words. I counted the number of lines containing bark, making the simplifying assumption of one token of bark per line. I then normalized the frequency to words per million.