If you’re not a linguist, don’t do linguistic research

A paper recently published in PNAS claims that human language tends to be positive. This was news enough to make the New York Times. But there are a few fundamental problems with the paper.

Linguistics – Now with less linguists!

The first thing you might notice about the paper is that it was written by mathematicians and computer scientists. I can understand the temptation to research and report on language. We all use it and we feel like masters of it. But that’s what makes language a tricky thing. You never hear people complain about math when they only have a high-school-level education in the subject. The “authorities” on language, however, are legion. My body has, like, a bunch of cells in it, but you don’t see me writing papers on biology. So it’s not surprising that the authors of this paper make some pretty basic errors in doing linguistic research. They should have been caught by the reviewers, but they weren’t. And the editor is a professor of demography and statistics, so that doesn’t help.

Too many claims and not enough data

The article is titled “Human language reveals a universal positivity bias” but what the authors really mean is “10 varieties of languages might reveal something about the human condition if we had more data”. That’s because the authors studied data in 10 different languages and they are making claims about ALL human languages. You can’t do that. There are some 6,000 languages in the world. If you’re going to make a claim about how every language works, you’re going to have to do a lot more than look at only 10 of them. Linguists know this, mathematicians apparently do not.

On top of that, the authors don’t even look at that much linguistic data. They extracted 5,000–10,000 of the most common words from larger corpora. Their combined corpora contain the 100,000 most common words in each of their sub-corpora. That is woefully inadequate. The Brown corpus contains 1 million words and it was made in the 1960s. In this paper, the authors claim that 20,000 words are representative of English. That is, not 20,000 different words, but the 5,000 most common words in each of their English sub-corpora. So 5,000 words each from Twitter, the New York Times, music lyrics, and the Google Books Project are supposed to represent the entire English language. This is shocking… to a linguist. Not so much to mathematicians, who don’t do linguistic research. It’s pretty frustrating, but this paper is a whole lotta ¯\_(ツ)_/¯.

To complete the trifecta of missing linguistic data, take a look at the sources for the English corpora:

Corpus

Word count

English: Twitter

5,000

English: Google Books Project

5,000

English: The New York Times

5,000

English: Music lyrics

5,000

If you want to make a general claim about a language, you need to have data that is representative of that language. 5,000 words from Twitter, the New York Times, some books and music lyrics does not cut it. There are hundreds of other ways that language is used, such as recipes, academic writing, blogging, magazines, advertising, student essays, and stereo instructions. Linguists use the terms register and genre to refer to these and they know that you need more than four if you want your data to be representative of the language as a whole. I’m not even going to ask why the authors didn’t make use of publicly available corpora (such as COCA for English). Maybe they didn’t know about them. ¯\_(ツ)_/¯

Say what?

Speaking of registers, the overwhelmingly most common way that language is used is speech. Humans talking to other humans. No matter how many written texts you have, your analysis of ALL HUMAN LANGUAGE is not going to be complete until you address spoken language. But studying speech is difficult, especially if you’re not a linguist, so… ¯\_(ツ)_/¯

The fact of the matter is that you simply cannot make a sweeping claim about human language without studying human speech. It’s like doing math without the numeral 0. It doesn’t work. There are various ways to go about analyzing human speech, and there are ways of including spoken data into your materials in order to make claims about a language. But to not perform any kind of analysis of spoken data in an article about Language is incredibly disingenuous.

Same same but different

The authors claim their data set includes “global coverage of linguistically and culturally diverse languages” but that isn’t really true. Of the 10 languages that they analyze, 6 are Indo-European (English, Portuguese, Russian, German, Spanish, and French). Besides, what does “diverse” mean? We’re not told. And how are the cultures diverse? Because they speak different languages and/or because they live in different parts of the world? ¯\_(ツ)_/¯

The authors also had native speakers judge how positive, negative or neutral each word in their data set was. A word like “happy” would presumably be given the most positive rating, while a word like “frown” would be on the negative end of the scale, and a word like “the” would be rated neutral (neither positive nor negative). The people ranking the words, however, were “restricted to certain regions or countries”. So, not only are 14,000 words supposed to represent the entire Portuguese language, but residents of Brazil are rating them and therefore supposed to be representative of all Portuguese speakers. Or, perhaps that should be residents of Brazil with internet access.

[Update 2, March 2: In the following paragraph, I made some mistakes. I should not have said that ALL linguists believe that rating language is an notoriously poor way of doing an analysis. Obviously I can’t speak for all the linguists everywhere. That would be overgeneralizing, which is kind of what I’m criticizing the original paper for. Oops! :O I also shouldn’t have tied the rating used in the paper and tied it to grammaticality judgments. Grammaticality judgments have been shown to be very, very consistent for English sentences. I am not aware of whether people tend to be as consistent when rating words for how positive, negative, or neutral they are (but if you are, feel free to post in the comments). So I think the criticism still stands. Some say that the 384 English-speaking participants is more than enough to rate a word’s positivity. If people rate words as consistently as they do sentences, then this is true. I’m not as convinced that people do that (until I see some research on it), but I’ll revoke my claim anyway. Either way, the point still stands – the positivity of language does not lie in the relative positive or negative nature of the words in a text (the next point I make below). Thanks to u/rusoved, u/EvM and u/noahpoah on reddit for pointing this out to me.] There are a couple of problems with this, but the main one is that having people rate language is a notoriously poor way of analyzing language (notorious to linguists, that is). If you ask ten people to rate the grammaticality of a sentence on a scale from 1 to 10, you will get ten different answers. I understand that the authors are taking averages of the answers their participants gave, but they only had 384 participants rating the English words. I wouldn’t call that representative of the language. The number of participants for the other languages goes down from there.

A loss for words

A further complication with this article is in how it rates the relative positive nature of words rather than sentences. Obviously words have meaning, but they are not really how humans communicate. Consider the sentence Happiness is a warm gun. Two of the words in that sentence are positive (happiness and warm), while only one is negative (gun). This does not mean it’s a positive sentence. That depends on your view of guns (and possibly Beatles songs). So it is potentially problematic to look at how positive or negative the words in a text are and then say that the text as a whole (or the corpus) presents a positive view of things.

Lost in Google’s Translation

The last problem I’ll mention concerns the authors’ use of Google Translate. They write

We now examine how individual words themselves vary in their average happiness score between languages. Owing to the scale of out corpora, we were compelled to use an online service, choosing Google Translate. For each of the 45 language pairs, we translated isolated words from one language to the other and then back. We then found all word pairs that (i) were translationally stable, meaning the forward and back translation returns the original word, and (ii) appeared in our corpora in each language.

This is ridiculous. As good as Google Translate may be in helping you understand a menu in another country, it is not a good translator. Asya Pereltsvaig writes that “Google Translate/Conversation do not translate. They match. More specifically, they match (bits of) the original text with best translations, where ‘best’ means most frequently found in a large corpus such as the World Wide Web.” And she has caught Google Translate using English as an intermediate language when translating from one language to another. That means that when going between two languages that are not English (say French and Russian), Google Translate will first translate the word into English and then into target language. This represents a methodological problem for the article in that using the online Google Translate actually makes their analysis untrustworthy.

It’s unfortunate that this paper made it through to publication and it’s a shame that it was (positively) reported on by the New York Times. The paper should either be heavily edited or withdrawn. I’m doubtful that will happen.

Update: In the fourth paragraph of this post (the one which starts “On top of that…”), there was some type/token confusion concerning the corpora analyzed. I’ve made some minor edits to it to clear things up. Hat tip to Ben Zimmer on Twitter for pointing this out to me.

Update (March 17, 2015): I wrote a more detailed post (more references, less emoticons) on my problems with the article in question. You can find that here.

Run tell that

Like this:

Related

Published by Joe McVeigh

I'm a linguist who researches email marketing. I also teach at the University of Jyväskylä in Finland. I write about language and linguistics on my blog, ...And Read All Over, and I write about language and marketing on my other blog, Email and Linguistics.
View all posts by Joe McVeigh

[…] A paper recently published in PNAS claims that human language tends to be positive. This was news enough to make the New York Times. But there are a few fundamental problems with the paper. Linguis… […]

Your claim seems to be that the data are likely to be highly unreliable, but are you claiming that there’s a systematic bias in the method which leads to a false conclusion? Otherwise there’s always the possibility that the original claim holds water, despite the substantial noise in the data. Then again I’m only half a linguist so I might only be half entitled to comment 😛

No worries, Martin, everyone is fully entitled to comment 🙂 Yes, the original claim might be true, but I’m saying that we wouldn’t know it from the data that they analyze. Basically, there’s not enough of it and it’s too noisy. That and they make some pretty poor choices in how to evaluate it.

I think what would have been perfectly reasonable – and an interesting piece of research – would be if the authors had made claims about the varieties of language of Twitter, as opposed to Language in general. Or about the language in the the New York Times, as opposed to the language of journalism or all human language. It’s tempting to use Big Data to make some sweeping claims, especially since you can appear to back them up with very large numbers – millions of Tweets, thousands of participants, etc. But claims like this about language look ridiculous to linguists.

I’m Peter Dodds, one of the lead authors, and I feel I should say something
here as there are several misconceptions about our study.

1. As noted in the comments, we did have a linguist on our team (from
MITRE) and people from many other backgrounds (see below).

2. We processed very large corpora to get to these word lists of 5000 to 7000 of the most
frequently used words in each one. We derived the Twitter word lists, for example,
from multiple hundreds of billions of words (we’re now up
to a trillion). There are 1.8 million NYT articles in the 20 years sample,
music lyrics covers 50+ years, and the Google Books project is enormous as well.

3. The 5000–7000 words are the most commonly used ones
and these then account for typically 90 per cent of the complete
corpora. We stopped at these limits for two reasons: the coverage was
very good (again, around 90 per cent), and the cost of these studies
was a barrier.

4. Re coverage of the world’s languages: We did cover 24 diverse kinds
of corpora for 10 languages spread around the world (literature, the
news, music lyrics, web crawls, and Twitter). We needed
languages that are written and have extensive corpora available.
We also showed that the positivity of translatable words between all 10 languages (45
comparisons) is highly robust (Fig. 2).

5. Our aim from the start has been to build instruments for measuring
emotion in large-scale texts. Sentiment analysis is a big field but
we’ve generated a transparent, scalable instrument built on
simple evaluations of words that works well (it comports with
the Gallup Well-being Index for example). You can find more
at hedeonometer.org.

6. Previous work (including an initial one of ours) had failed to deal with the usage frequency properly
(by, for example, using LIWC or ANEW, and also the Brown Corpus). Many papers using these data
set are problematic.

7. Finding a positivity bias was not what we set out to do but
it became such a clear signal that we looked into the literature
and found the Polyanna Principle.

8. We are not simply mathematicians. This always happens in the
press, and within academia (“Explain how this is computer science.”).
And we’ve heard this before in other fields: “If you aren’t a(n) X don’t do
X* research”. We know the pitfalls, we know the jokes about
physicists (http://xkcd.com/793/).

We have a distributed training across mathematics, theoretical
physics, Electrical engineering, Earth
sciences, biology, and computational science in general,
but also the computational social sciences, psychology, linguistics, and sociology.

Personally, on top of a PhD in math/physics working on problems in biology,
Earth sciences, and ecology, I have what is effectively a second PhD in sociology
(I was a Postdoc and Research Scientist for 6 years at Columbia in the
social sciences). I would rather just be called a scientist but the
press doesn’t like this.

Hi Peter,
Thanks for commenting. I’ll respond to some of your points and use the same numbers so they’re easy to reference.

2. While the sizes of your corpora are impressive, maybe I wasn’t clear that size isn’t what I have a problem with. Basically, I’m worried that your corpora are not representative of the languages you analyze and that they are not balanced. 1.8 million (or 1.8 trillion) articles from the New York Times do not even represent English journalism. In order to do that, you need to have a sampling from major news publications from around the English speaking world. And then you need to balance the news articles based on the subgenres of journalism that are typical in newspapers – financial, sports, arts, etc. – or based on how prevalent each subgenre is typically. This is what general corpora do and the representativeness has always been more important than the size. Basically, if it’s not representative of the language (no matter how big it is), you can’t make general claims about that language.

4. Let’s say that the 10 languages are diverse enough. I agree that you need to have written languages with corpora available. And let’s say the corpora are representative (not extensive). That still doesn’t explain why you are comparing different genres or registers – such as song lyrics and tweets.
As for the translations, Google Translate is highly flawed. You can barely trust it going from English to another language, but as I pointed out in my post, I don’t think you can trust it at all going to/from languages that are not English.

5. I think your aims are noble ones. I hope you’re able to make these tools even better as you add to and refine them. I mentioned in the edit that I withdrew my comment about there not being enough ratings.

6. I’m not sure what you’re referring to when you say that previous research “failed to deal with the usage frequency properly”. But was there a reason you didn’t choose the available general corpora, such as COCA, the BNC, etc. for English?

8. Sorry to pigeonhole you and your fellow authors. I’m not against interdisciplinary work at all and I know that all of us in academia wear a few hats, even though we’re supposed to specialize. And thanks for the xkcd. Hadn’t seen that one yet 🙂

I think the biggest problem is the grandiose headline. Your paper certainly supports the Pollyana hypothesis, but something like that is difficult to “prove”. McVeighs most valid objection, imho, was regarding the breadth of genres, rather than the size. I’m willing to believe the Pollyana hypothesis for books and twitter (of the specified languages) because the coverage is good and likely unbiased enough.

Also, all of these things have one major thing in common (besides being text and not speech): they are intended for public consumption. Email or chat correspondence would be interesting to see. Obviously hard to get in practice. Maybe the NSA could take your software and run the study :-).

Point of pedantry: “Twitter API” and “The MITRE Corporation” are not citations (Table S1) . I’m assuming the tweets were downloaded using the streaming API, but I can’t know that for sure. Or when were they retrieved, over what time period, whether the computer downloading them could keep up with the stream, etc.

Overall, I thought the paper was interesting and well executed. The sentiment parser looks great, and in particular the list of rated words is the best I’ve ever seen. Kudos to you!

Hi Jacob, sorry about the grandiose headline. It was kind of meant to show my frustration with what I saw as a deeply flawed paper. As a linguist, why should I have to follow best practices and accepted methodology when scholars from other fields are flouting them? And getting published on top of that. And getting written up in the New York Times…

You’re fine for believing that the Pollyanna hypothesis holds, but be clear that you believe it holds for the language varieties mentioned in the paper (when they are given). For example, the New York Times is a subgenre of American journalism. It’s not the be all and end all so it can’t be said to represent the whole genre of American journalism. We’re not told where the Tweets come from – are they just written in English or is the Twitter subcorpus balanced so that it represents the world English-speaking population? You can see where the problem comes from – if we can’t make claims about one part of one language (American English journalism), then how can we make claims about all human language?

I have to mention that even though I’m no mathematician, I can think of several branches of mathematics off the top of my head that can make do without the concept of zero. 🙂

Great article, though, and I think Peter Dodds’ response reveals that he did not quite grasp (or at least fails to acknowledge) the crucial objections to the conclusions his team drew in their research.

Yeah, the zero analogy was a bad one (my specialty). It could have maybe worked on a very abstract level – linguist finds problems with mathematicians doing language research and then has problems with math. But if you have to explain it…

I thank Peter for replying. It may be that he missed my objections, but I’ll admit that they could have been grounded in some objectionable research. I’m going to address that soon.

I have never seen so much arrogance in one blog (not only this article). It’s always differentiated between linguists, who are sooo smart, and non-linguists, who know nothing. I have the impression that the author just wants to provoke, maybe because he himself has no success… “In fact, I really hope you’ll get in touch because I’ve tried again and again to get email marketers to work with me and come up with bupkis.”

great comment. In your view, it is “arrogant” to ask that people who do research in an area actually be trained in the area. It is saying people are “sooo smart” because they’ve decided to devote part of their lives to studying something before writing about it.

I look forward to hearing about your next bout of surgery, when you tell the medical staff how arrogant they are being for wanting a licensed and degreed surgeon to operate on you. Soooo arrogant.

I actually meant the headline of Dodds paper. It applies to this post too, though. There seems to be a paradox: make an accurate headline and nobody reads your work, make it really over-the-top and a bunch of people criticize the work for being over-the-top.

>why should I have to follow best practices and accepted methodology when scholars from other fields are flouting them

What exactly are “best practices” and “accepted methodology” in this case? I’m genuinely asking, as I’m a non-linguist interested in doing linguistic research. Sentiment analysis and tokenization are both open problems in linguistics, as far as I know. The authors took simple approach which will be mostly right, most of the time. So statistical aggregates of that data (which is what they are presenting) should be correct, particularly with large amounts of data.

>We’re not told where the Tweets come from

Yeah this bugged me too. They reference papers for some of their corpora but not all of them. That’s not good practice.

I think the reason why the authors constructed their own corpus rather than rely upon a pre-existing one was because they wanted the data to be gathered in much the same way for all the languages they investigated (I doubt there are corpora for all the languages modeled on COCA.) There’s always going to be trade offs in work like this (COCA certainly made some) and whether those trade offs skew their results is certainly a fair question to ask. It’s also difficult (well, impossible) to attain a representative sample of something as broadly defined as “language.” I think in practice the criteria varies from study to study, depending upon what aspect of language is being investigated. I would be much more critical of the corpus if they were trying to describe change or variation. For their purposes, it might be fine, since they really are looking at a massive amount of words and are not making rankings within their corpus.

Ultimately, this is an empirical question, and the one that can be answered because the authors made the datasets available (something which is unfortunately not the norm in linguistic research). So it is possible for someone to investigate whether the wordlists they are using are unrepresentative or not simply by doing some linguistic research.

I’m sure there are other corpora, but I also don’t know if they are constructed in the same way as COCA. And as you said, COCA has its limits. I think everyone would agree that we can’t get a representative sample of “language”, but that could be an argument for not making claims about “language”. I think that’s why a lot of papers try to get a representative sample of some genre or register of language and then make claims about that. It is nice that they made their data available (and would be nicer if more linguists did that, I agree) and I’m going to dig into it to address the issue I had with representativeness. I’m still unclear about a few things, but I’m going to try to address them in a future post.

Oops, my bad. I kind of agree with you on the headline writing. That’s an interesting research topic – Buzzfeed through the years: A diachronic analysis of clickbait.

I’ll have to get back to you on the best practices part. I’m writing another post where I spell out my two main issues with the paper and test their data. I’ll be sure to explain the problems in that one. Joe F below links to a paper where they discuss the Twitter corpus. I’ll include that if it’s relevant, but feel free to have a look.

Haha, yeah it would normally be fewer. I was trying to make a play on the old advertising slogan “Now with less fat/carbohydrates/whatever!”. I don’t even know if that’s a real thing or just something that exists in my brain 😀

Interesting that you mention Asya Pereltsvaig in this context. The book she co-authored with Lewis on the Indo-European Controversy is terrific ((I got it for Christmas and read it in two days), and shows the dangers of non-linguists trying to claim that their computer models are somehow free of bias and are going to reveal truths about language that intelligent analysis of the facts won’t.