How tweets reveal where you're from

On the Internet, nobody knows you're a dog, but on Twitter, your tweets likely reveal where you are. Computer scientists report that the microblogging service reflects regional dialects and slang.

In northern California, for example, when something is cool, it's tweeted as "koo," while in southern California, it's "coo," post-doctoral fellow Jacob Eisenstein and his colleagues at Carnegie Mellon University found. The word "something" is tweeted as "sumthin" in most parts of the country, but New Yorkers favor the term "suttin" instead. LOL, the acronym for "laughing out loud," is common on Twitter almost everywhere but Washington, D.C., where the cruder "LLS" takes precedence.

How they did itFor the study, Eisenstein and his co-authors collected a week's worth of Twitter messages in March 2010 and selected geotagged messages from users who wrote at least 20 tweets. That gave them a database of 9,500 users and 380,000 messages.

They then analyzed the raw text in those messages with a model trained to pick out regional differences such as favored Twitter slang terms ("hella" in Northern California, "wasssup" in New York) as well as sport-team preferences (for example, the Celtics in Boston, the Knicks in New York, the Cavs in Cleveland).

The researchers found that Twitter postings also reflect well-known regionalisms from spoken speech, such as Southerners' "y'all" vs. Pittsburghers' "yinz," and the regional-based references to soda vs. pop vs. Coke.

The model, verified with the geotag information, could predict the location of a microblogger in the U.S. to within 300 miles.

Eisenstein et al. / CMU

Researchers clustered Twitter users based on the regional terms they included in their tweets. This map shows how tweets were clustered to reflect different characteristic regions, including Northern and Southern California, Chicago, the Lake Erie region, Boston, New York, Washington, Northern vs. Southern states, and Florida.

Evolving language"The study shows that people continue to develop new ways of using language, regardless of whether they're talking over lunch or exchanging messages on Twitter," Eisenstein told me via e-mail today.

"But we don't know whether the geographical specificity of these new forms are simply the result of random variation propagating through social networks that are geographically local, or whether it represents an inherent need to express our regional and community affiliations using language."

Written language is traditionally more homogenized than spoken language, but Eisenstein theorizes that Twitter is more reflective of regional dialects because tweets are more informal and conversational. "It will be interesting to see what happens. Will 'suttin' remain a word we see primarily in New York City, or will it spread?" Eisenstein mused in a news release sent out today.

In addition to Eisenstein, the authors of "A Latent Variable Model for Geographic Lexical Variation" include Brendan O'Connor, Noah A. Smith and Eric P. Xing, all from Carnegie Mellon University. The research was supported in part by funding from Google, the Air Force Office of Scientific Research, the Office of Naval Research, the National Science Foundation and the Alfred P. Sloan Foundation.