Behold the Corpus

March 5, 2008

Ben Zimmer, like most lexicographers we meet, has a fascinating a background: A self-described "dictionary hound" as a kid, he volunteered in college as a "reader" for the Oxford English Dictionary, scanning music magazines for new terminology. He then worked as a linguistic anthropologist researching the languages of Indonesia before returning to his lexicographic roots. Long discussions with the OED editors about emerging technology led ultimately to his current job, as Editor for American Dictionaries at Oxford. It's a job where he's intimately involved with the Oxford English Corpus, a high-tech infrastructure for writing dictionaries. Ben graciously spoke to us about his work:

VT: First, what is the difference between the Oxford English Dictionary and the "current" dictionaries you work on, such as the New Oxford American Dictionary and the Oxford College Dictionary?

Ben: The OED has an enormous database of citations that readers have collected over the past 150 years. The OED has been soliciting these quotations to help illustrate the different senses, words and phrases in the dictionary. This database is quite useful in many ways, certainly for understanding the historical progression of words and how those words developed over time. The OED also has their own sub-set of research tools involving electronic databases to try to find, for instance, the earliest known citation for a word or phrase, and then track its history from the earliest thing you can find to the present, and whether it's still being used or it has died out. So that's the historical lexicography side of things.

On the current dictionary side, the Corpus represents a different kind of database. It's not something that's hand-selected in the same way as the OED database of citations is. If you're an OED reader, you're looking for something that seems particularly interesting. In the olden days, that's when you would write it out on a slip, and the slip would go into the OED database - which is now being all done electronically, of course. What we developed for the current dictionaries, on the other hand, is a database that's much more all-encompassing, comprising not just things that might be interesting to readers, but a wide selection of texts that can be analyzed in terms of usage and the recent developments in English.

That's what we call the Oxford English Corpus. It's an enormous collection of texts, currently more than two billion words. And it's all "tagged" as well -- the data isn't just lying there. First of all, there's a kind of automatic tagging to identify parts of speech. So you know whether the word appearing in text is being used as a noun or a verb or an adjective. This gives us a lot of information. For instance, we can see how words interact with other words in a sentence, and their grammatical relations. What adjectives modify what nouns? What verbs serve as the predicate for a noun? This is really useful for understanding the way a word operates in the language, beyond just its appearance in a given text.

The other important thing about tagging is that it provides "metadata" for each bit of text in the Corpus. This tells us what type of text it is and the kind of domain where it's coming from, whether, say, it's a work of romance fiction or a news article. It also identifies whether it's formal or informal language. And if we can identify the gender of the writer as male or female, we add that to the metadata, too.

All these variables are available to us when we're analyzing the way a word or a phrase is used. This gives us amazing insight into not only the way that a word might be used in a sentence and its relationship to other words, as we were discussing, but also its geographical distribution. Is this word more common in American English, British English or another world English? Is it something that appears in certain types of text more often than others? All this metadata has been gathered since the year 2000, so we now have a dynamic snapshot of 21st Century English.

VT: Wow. So is this kind of corpus a whole new approach to dictionary development?

Ben: Corpus lexicography itself is not new. Corpora have been used beginning with the Brown Corpus back in the 60's, and in the 1990's we used the British National Corpus. But a corpus of this size is really unprecedented in lexicography. It's the leading edge of this type of research -- and it's just been in the last few years that we've had the tools at our disposal to do this kind of work.

VT: You must have a team of computer programmers working alongside the lexicographers.

Ben: We do have a bunch of data gurus. I'm actually sitting near the data development editor, Orion Montoya, who is responsible for, as he puts it, "the care and feeding of the corpus." What we shoot for in the Corpus is to have 40% American English, 40% British English, and the remaining 20% the various international English varieties. We're also concerned with representation of the types of sources we have. We don't want to have too much romance fiction or too much of a particular style of writing or a particular subject matter. These are the kinds of concerns that the data gurus and my fellow lexicographers have to balance to insure that the corpus is actually useful for American English.

VT: Can you give us an example of how you use the Corpus in your work?

Ben: We're working right now on the second edition of the Oxford American Writers' Thesaurus, and we're using the insights we find in the Corpus to show nuances among closely related words. The Visual Thesaurus does a great job of graphically displaying those types of relations in a way that gives you something different from just a regular thesaurus where synonyms are grouped together on a page. We're looking at ways of presenting information where we can use the findings from the Corpus to say, okay, here are two words that would normally just be grouped together as synonyms. But what can we say to show how they're used differently in English? If they're two adjectives, for example, what types of nouns do they modify?

For instance, eccentric and quirky are two words that you would find next to each other in the thesaurus. The Corpus creates what's called a "word sketch," which tells us not just what words appear close to those examples but their collocations ["A sequence of words or terms which co-occur more often than would be expected by chance" - Ed]. But which are the most salient ones? It's not so interesting to just say eccentric man, for instance, because you can attach so many different adjectives to the word man.

Our corpus tools can show us what words eccentric modifies more so than other similar adjectives. We find, for instance, that the people who are generally called eccentric are very often rich people, millionaires and billionaires. Very often they're recluses and loners. And certain kinds of relatives are also called eccentric. Uncle, by far, is the most common type of eccentric relative you can have. Aunt comes a distant second. We can find the types of nouns that eccentric typically modifies in the actual way people typically use that word.

If we look at the nouns that quirky modifies, we find very different types of words. We find that it's not people like with eccentric, but behavior, style, humor or different aspects of performance or music that get identified as quirky. These types of comparisons of words that seem similar are the kinds of things that we can look at very quickly with our tools.

Also, we very often come up with insights that we wouldn't have been able to figure out without these tools. In the past, dictionaries and thesauruses would say, "This word is used this way; that word is used that way." Now we can actually check, is that really right? Is that actually the way people are using them? Sometimes we're quite surprised by what we find. It just goes to show that you need lots and lots of data to create a reliable dictionary or thesaurus.

VT: We spoke about the difference between current and historical dictionaries. How does the Corpus help you structure your "current" dictionaries?

Ben: Dictionaries and thesauruses are generally based on earlier works, so earlier dictionaries from the same publisher are revised and become new ones. But in that process of revision you're not necessarily thinking about things in a fresh way. What the Corpus allows us to do is to take a step back and say, okay, regardless of what dictionaries have said about a particular word in the past, let's think about this in a new way. This will tell us how, for instance, we might want to think about ordering the senses, and determine what is most important now and what is subsidiary.

Even very common words might be changing. For instance, we've been thinking recently about how to properly create a definition for browse. If you check a dictionary that's organized historically... well, let me just pull a dictionary off the shelf here to see what it says. I'll check the Shorter OED. The Shorter OED has the historical order of senses. So the first thing you get for browse is an animal feeding on the leaves and shoots of trees and bushes, from late Middle English. There are various senses relating to how animals feed on leaves and shoots. Eventually, you get to "look through a book casually" and from there you get "read or survey data files, Internet sites, etc., typically via a network." These days, of course, that last sense is the one that's really booming and developing in new ways, even though historically this word related to animals grazing on leaves or whatever.

We need to investigate how this new sense of "surveying files or web pages" is developing and how people are using browse today. Are they just browsing or are they using browse with a construction like through or over or on or about? Or are they using it as a transitive verb, as in "to browse the Internet" and that sort of thing? A lot of this development is very new because it's related to the way people interact with the Web, and what they're doing when they say they're browsing. This is much different from browsing through newspapers and magazines at the newsstand, let alone the way that an animal eats leaves.

VT: Fascinating.

Ben: That's just one small example of a word that's been around since late Middle English, but one that has had some very important developments in its meaning in just the past five years or so.

VT: Where does the Corpus go from here?

Ben: The future of corpus lexicography is full of possibilities, since we're only really scratching the surface with the use of massive corpora like the Oxford English Corpus. We'll be looking to fine-tune our own tools for corpus-crunching, and at the same time trying to keep up with the increasingly huge amounts of data available for analysis. One thing we know for sure is that there will never be a shortage of texts, or types of texts, at our disposal. Now it's simply a matter of learning how best to harness all that data so that dictionaries and thesauruses can really reflect how people are using language.

Join the conversation

Comments from our users:

A most inspiring article, particularly after getting up extra early to look up 'brain drain' versus 'brain gain' a thought 'given' prior to waking up. I have some great information which could be of interest to the the people behing visual thesaurus which by the way I really appreciate.
Raymond