Hunting for lexical blends – the computational way

Our guest blogger today is Paul Cook, who works in the Department of Computing and Information Systems at the University of Melbourne. His cross-disciplinary research in computational linguistics considers new ways in which computational methods can be used to study language and identify new words and meanings.

How do lexicographers find new words to consider adding to dictionaries? Traditionally they’ve read. A lot. But computers and the Internet have opened up exciting new possibilities for automating parts of this job. This post is about some recent research on using Twitter to find lexical blends.

We’ve talked about blends before in the blog. A blend is a portmanteau word, like brunch, fantabulous, or mockumentary, that typically combines the beginning of one word with the ending of another. (In case you haven’t figured it out, those words combine breakfast and lunch, fantastic and fabulous, and mock and documentary.) But how do we find words like these to add to a dictionary?

Writers (and especially journalists) often give us clues when they’re using a word that’s new to them or that might be new to their readers. In the following sentences the underlined words are pretty good indicators that the words in bold are new.

“Mongo” is slang for garbage salvaged from streets and trash heaps.

The syndrome is so pervasive that heart surgeons and cardiologists have coined a term for it: pump head.

Grant Barrett, an American lexicographer, observed patterns like these and used them in making the Official Dictionary of Unofficial English, a dictionary focusing on otherwise undocumented words, mostly slang and jargon. (In fact, the examples above are taken from that dictionary. And if you’d like to check it out, Barrett has made a PDF version freely-available online.)

Writers also often explain the meaning of blends. The following sentences, taken from documents on the Web, explain the source words of a blend.

Avoision The practice of borderline acts that fall between legal avoidance and illegal evasion of laws, especially tax laws.

You can indeed cross a horse with a zebra (both are equines) and you can get a zorse.

By searching for both of these patterns at the same time we can find cases where writers are telling us about a new blend. Although these patterns are rare, if we analyse enough data we’ll still get plenty of hits. But where should we go looking for these patterns?

Twitter claims that around 340 million tweets are sent every day. That’s a lot of text! Because tweets tend to be rather informal, and we see a lot of other types of creative usages on Twitter, there might be quite a few blends in there too. And crucially, Twitter provides APIs (application programming interfaces) that make it easy for us to get access to lots of tweets.

We wrote a simple computer program to search tweets for those patterns that indicate that a new blend is being explained. (We weren’t going to read 340 million tweets every day!) We’ve been running this program for almost a year and we get roughly 45 unique hits each day. Our analysis shows that well over 50% of these candidates are blends, and at that level of precision it’s worth our time to sift through the hits.

Many of the words we find this way, like misdevious, are madeupical, or “nonce formations” – words which are coined by one person, but which never get more widely used. These items wouldn’t typically be recorded in dictionaries. But some of the words we find, like mocial, might become more common. Once we’ve identified a word that looks promising, we can keep an eye on it to see how its usage changes over time. Moreover, as dictionaries move online, the physical space restrictions that used to affect our decisions about which words to include are becoming less of an issue. Why shouldn’t sockos be included in dictionaries? There’s plenty of evidence for its usage! In the future, computational methods like the ones we’re working on are likely to play a role in helping to automatically generate entries that could possibly later be revised by (human) lexicographers.

If you’d like to learn more about this work, check out our paper*.

*Paul Cook. 2012. Using social media to find English lexical blends. In Proceedings of the 15th EURALEX International Congress (EURALEX 2012), pages 846–854, Oslo, Norway: available here. There’s also some sample output, but be warned that this includes some obscenities.

You might like these related posts ...

Hi, thank you for sharing this article and providing your paper about lexical blends. As a neologist — I name things — it’s always great to find new sources of ideas, especially those which are highly productive. I use Sketch Engine as a computational corpus to help me find names for products and companies based on the contexts in which relevant words appear. I wrote an article about that here:http://operativewords.blogspot.com/2011/08/how-to-create-names-using-worlds-most.html