Wednesday, July 25, 2012

Don't worry, I'm a physicist.

Today, I came across a science news item from ABC (the Australian Broadcasting Corporation) with the title "Study opens book on English evolution." Oh goodness. Here are the opening paragraphs:

A study of 500 years of the English language has confirmed that 'the', 'of' and 'and' are the most frequently printed words in the modern era.

The study, by Slovenian physicist Matjaz Perc, also found the top dozen phrases most-printed in books include "at the end of the", "as a result of the" or "on the part of the".

That sound you hear is the stunned silence of linguists everywhere over the fact that you can get into the science news with the primary result that "'the' is the most common English word."

But to be fair, what the author was trying to argue is that the Zipfian distribution of word frequencies is a result of "preferential attachment," where frequent words get more frequent. He tried to demonstrate this by showing that the frequency of a word in a given year is predictive of its frequency in the future, specifically that relatively high frequency words will be even more frequent in the future. They key result is shown in Figure 4 in the paper, available here.

Say what?

While that quantitative result may stand, the fact that Perc is a physicist probably contributed to some really bananas statements about language. In the first paragraph, he almost completely conflates human language and written langauge as being the same thing, and erases the validity and richness of cultures with unwritten languages.

Were it not for books, periodicals and other publications, we would hardly be able to continuously elaborate over what is handed over by previous generations, and, consequently, the diversity and efficiency of our products would be much lower than it is today. Indeed, it seems like the importance of the written word for where we stand today as a species cannot be overstated.

He also presents some results of English "coming of age" and reaching "greater maturity" around 1800 AD (Figure 3). Finally! It only took us like, what, a thousand years or so?

The discussion section kicks off with the statement

The question ‘Which are the most common words and phrases of the English language?’ alone has a certain appeal [...]

That may be true for physicists, but for people who are dedicated to studying language (what are they called again?) not so much. Fortunately, his ignorance of linguistics is actually a positive quality of this research!

On the other hand, writing about the evolution of a language without considering grammar or syntax, or even without being sure that all the considered words and phrases actually have a meaning, may appear prohibitive to many outside the physics community. Yet, it is precisely this detachment from detail and the sheer scale of the analysis that enables the observation of universal laws that govern the large-scale organization of the written word.

See, linguists are just too caught up in the details to see the big picture! Fire a linguist and your productivity goes up, amirite?

For real though?

But back to the substantive claim of the paper. Is the Zipfian distribution of words due to the rich getting richer? That is, are words like snowballs rolling down a hill? The larger they are, the more additional snow the pick up, the even larger they get. Maybe, but maybe not.

Here's a little experiment that I was told about by Charles Yang, who read about it in a paper by Chomsky that I don't know the reference to. Right now, we're defining "words" as being all the characters between white spaces. But what if we redefined "words" as being all the characters between some other kind of delimiter? The example Charles used was "e". If we treat the character "e" as being the delimiter between words, and we apply this a large corpus, we'll get back "words" like " " and " th" and less frequently "d and was not paralyz". What kind of distribution to these kinds of "words" have?

Well, I coded up this experiment (available here: https://github.com/JoFrhwld/zipf_by_vowels) where I compare the ordinary segmentation of the Brown corpus into words by using white spaces to segmentations using "a", "e", "i", "o" and "u." Here's the resulting log-log plot of the frequencies and ranks of the segmentations.

It all looks quite Zipfian. So are not only the characters between spaces, but the characters between any arbitrary delimiters subject to a rich-get-richer process? Keep in mind that the definition of "word" as being characters between spaces is relatable to representations in human cognition, the definition of "word" as characters between arbitrary delimiters is not, especially not with English's occasionally idiosyncratic orthography.

Maybe it's possible for the results of my little experiment to be parasitic on a larger rich-get-richer process operating over normal words, but for now I'm dubious.

2 comments:

Writing about molecules without considering the properties of atoms may appear prohibitive to many outside the linguistics community. Yet, it is precisely this detachment from detail and the sheer scale of the analysis that enables the observation of universal laws that govern the large-scale organization of physical matter.

Hence I propose a practice theoretic interpretation of atomic bonds. Hydrogen atoms have the social practice of bonding with one other atom while carbon atoms can be socialized into a variety of social networks. Ethane (C2H6) and methane (CH4) are two Communities of Practice with homologous habitus, while ethane and ethylene (C2H2) represent incommensurable Discourses.