how ‘koo’ is rtweet?

This semester, I teach corpus-based sociolinguistics to third-year students. One challenge is to compile corpora that are suitable for the study of variation in language, especially when variation is correlated with geographic origin. Another challenge is to plot elements of linguistic variation on a map.

Well known to sociolinguists is the fact that you can easily compile corpora from Twitter. R enthusiasts are familiar with Jeff Gentry’s twitteR package, which allows the user to extract tweets based on a keyword, a hashtag, or a specific account. The package works great, but its setup is somehow fastidious.

Recently, I discovered rtweet, developed by Michael W. Kearney. Although not yet fully stable, the package is well documented, and its API authorization procedure1 is fast and simple. All you need is an active Twitter account. Once you have downloaded and installed the rtweet package, you are prompted to log in to your Twitter account and authorize the rtweet application when you launch your first query. That’s it.

On November 25, 2017, Scotland smashed Austrialia to conclude their Autumn rugby test series (congrats Scotland!). The outcome being very unusual, thousands of tweets were sent during the game. I launched the following query 60 minutes into the game.

As the map shows, the game sparked interest beyond the borders of Scotland (and the neat Glasgow-Edinburgh continuum).

The question remains whether mapping tweets based on hashtags can serve sociolinguistic purposes. On the one hand, Jack Grieve’s swear maps of the USA or Word Mapper app suggest that Twitter-based maps stand halfway between traditional isoglosses, as found in the Atlas of North American English for instance, and maps based on web surveys, as found in the Cambridge Online Survey. On the other hand, you can only map what Twitter gives you, i.e. what Twitter users feel like tweeting about. If you are interested in a specific word, chances are you will not find enough tweets featuring this word if it is rare.

The good news is that, with its 280-character limit, its high reliance on elliptical syntax, shortcuts, acronyms, emojis, and external media, Twitter is a highly constrained linguistic environment, just like any kind of computer-mediated communication. Linguists like constraints. More precisely, they like measuring the effect of language-external constraints on language or the impact of language-internal constraints on the environment. By setting up your query cleverly, you can therefore tap into Twitter in a sociolinguistically relevant manner and observe sociolinguistic variables either directly or indirectly.

After reading this review by US lexicographer Ben Zimmer, I compared the geographic distribution of tweets featuring coo and koo, which are alternative spellings of cool. Coo is thought to originate from Southern California, and koo from Northern California. Here is the code that I used for koo.

« Koo. https://t.co/twqeY9mIc8 »
« that shit not koo tho »
« If you have to continue to ask someone if they’re koo or to calm down then just know that’s not the person for you. »
« This is koo respect https://t.co/9JwIepNfIN »
« It’s still 0-0 Koo »
« Together we KOO me N her can’t Loseee »

Geolocation of tweets featuring « coo » in the USA on Nov 27, 2017

Geolocation of tweets featuring « koo » in the USA on Nov 27, 2017

The trend is, indeed, well represented in California. But the neat theoretical divide between Northern Calfornia koo and Southern California coo is far from obvious. Among many other reasons, this may be because (a) people travel, and (b) online innovations spread fast. Also, I am not comfortable taking these maps for granted when, no matter what you are looking for, the most densely populated areas are always over-represented. Some further statistical processing is needed to account for this size effect.

The package has minor glitches. The automatic language recognition needs fine-tuning (I found some tweets in French and Spanish in my dataset) and sometimes you will need to wait fifteen minutes before you launch a new query (but that’s on Twitter). Still, rtweet is koo, and I will definitely add some more posts featuring this promising package in the future. Twitter has a lot to say on the pain au chocolat vs. chocolatine debate in France,2 and I really want to map this!