Tag: human language

Last week I wrote a post called “If you’re not a linguist, don’t do linguistics”. This got shared around Twitter quite a bit and made it to the front page of r/linguistics, so a lot of people saw it. Pretty much everyone had good insight on the topic and it generated some great discussion. I thought it would be good to write a follow-up to flesh out my main concerns in a more serious manner (this time sans emoticons!) and to address the concerns some people had with my reasoning.

The paper in question is by Dodds et al. (2015) and it is called “Human language reveals a universal positivity bias”. The certainty of that title is important since I’m going to try to show in this post that the authors make too many assumptions to reliably make any claims about all human language. I’m going to focus on the English data because that is what I am familiar with. But if anyone who is familiar with the data in other languages would like to weigh in, please do so in the comments.

The first assumption made by the authors is that it is possible to make universal claims about language using only written data. This is not a minor issue. The differences between spoken and written language are many and major (Linell 2005). But dealing with spoken data is difficult – it takes much more time and effort to collect and analyze than written data. We can argue, however, that even in highly literate societies, the majority of language use is spoken – and spoken language does not work like written language. This is an assumption that no scholar should ever make. So any research which makes claims about all human language will therefore have to include some form of spoken data. But the data set that the authors draw from (called their corpus) is made from tweets, song lyrics, New York Times articles and the Google Books project. Tweets and song lyrics, let alone news articles or books, do not mimic spoken language in an accurate way. For example, these registers may include the same words as human speech, but certainly not in the same proportion. Written language does not include false starts, nor does it include repetition or elusion in near the same way that spoken language does. Anyone who has done any transcription work will tell you this.

The next assumption made by the authors is that their data is representative of all human language. Representativeness is a major issue in corpus linguistics. When linguists want to investigate a register or variety of language, they build a corpus which is representative of that register or variety by taking a large enough and balanced sample of texts from that register. What is important here, however, is that most linguists do not have a problem with a set of data representing a larger register – so long as that larger register isn’t all human language. For example, if we wanted to research modern English journalism (quite a large register), we would build a corpus of journalism texts from English-speaking countries and we would be careful to include various kinds of journalism – op-eds, sports reporting, financial news, etc. We would not build a corpus of articles from the Podunk Free Press and make claims about all English journalism. But representativeness is a tricky issue. The larger the language variety you are trying to investigate, the more data from that variety you will need in your corpus. Baker (2010: 7) notes that a corpus analysis of one novel is “unlikely to be representative of all language use, or all novels, or even the general writing style of that author”. The English sub-corpora in Dodds et al. exists somewhere in between a fully non-representative corpus of English (one novel) and a fully representative corpus of English (all human speech and writing in English). In fact, in another paper (Dodds et al. 2011), the representativeness of the Twitter corpus is explained as “First, in terms of basic sampling, tweets allocated to data feeds by Twitter were effectively chosen at random from all tweets. Our observation of this apparent absence of bias in no way dismisses the far stronger issue that the full collection of tweets is a non-uniform subsampling of all utterances made by a non-representative subpopulation of all people. While the demographic profile of individual Twitter users does not match that of, say, the United States, where the majority of users currently reside, our interest is in finding suggestions of universal patterns.”. What I think that doozy of a sentence in the middle is saying is that the tweets come from an unrepresentative sample of the population but that the language in them may be suggestive of universal English usage. Does that mean can we assume that the English sub-corpora (specifically the Twitter data) in Dodds et al. is representative of all human communication in English?

Another assumption the authors make is that they have sampled their data correctly. The decisions on what texts will be sampled, as Tognini-Bonelli (2001: 59) points out, “will have a direct effect on the insights yielded by the corpus”. Following Biber (see Tognini-Bonelli 2001: 59), linguists can classify texts into various channels in order to assure that their sample texts will be representative of a certain population of people and/or variety of language. They can start with general “channels” of the language (written texts, spoken data, scripted data, electronic communication) and move on to whether the language is private or published. Linguists can then sample language based on what type of person created it (their age, sex, gender, social-economic situation, etc.). For example, if we made a corpus of the English articles on Wikipedia, we would have a massive amount of linguistic data. Literally billions of words. But 87% of it will have been written by men and 59% of it will have been written by people under the age of 40. Would you feel comfortable making claims about all human language based on that data? How about just all English language encyclopedias?

The next assumption made by the authors is that the relative positive or negative nature of the words in a text are indicative of how positive that text is. But words can have various and sometimes even opposing meanings. Texts are also likely to contain words that are written the same but have different meanings. For example, the word fine in the Dodds et al. corpus, like the rest of the words in the corpus, is just a four letter word – free of context and naked as a jaybird. Is it an adjective that means “good, acceptable, or satisfactory”, which Merriam-Webster says is sometimes “used in an ironic way to refer to things that are not good or acceptable”? Or does it refer to that little piece of paper that the Philadelphia Parking Authority is so (in)famous for? We don’t know. All we know is that it has been rated 6.74 on the positivity scale by the respondents in Dodds et al. Can we assume that all the uses of fine in the New York Times are that positive? Can we assume that the use of fine on Twitter is always or even mostly non-ironic? On top of that, some of the most common words in English also tend to have the most meanings. There are 15 entries for get in the Macmillan Dictionary, including “kill/attack/punish” and “annoy”. Get in Dodds et al. is ranked on the positive side of things at 5.92. Can we assume that this rating carries across all the uses of get in the corpus? The authors found approximately 230 million unique “words” in their Twitter corpus (they counted all forms of a word separately, so banana, bananas, b-a-n-a-n-a-s! would be separate “words”; and they counted URLs as words). So they used the 50,000 most frequent ones to estimate the information content of texts. Can we assume that it is possible to make an accurate claim about how positive or negative a text is based on nothing but the words taken out of context?

Another assumption that the authors make is that the respondents in their survey can speak for the entire population. The authors used Amazon’s Mechanical Turk to crowdsource evaluations for the words in their sub-corpus. 60% of the American people on Mechanical Turk are women and 83.5% of them are white. The authors used respondents located in the United States and India. Can we assume that these respondents have opinions about the words in the corpus that are representative of the entire population of English speakers? Here are the ratings for the various ways of writing laughter in the authors’ corpus:

Can we assume that the textual representation of laughter is always as positive as the respondents rated it? Can we assume that everyone or most people on Twitter use the various textual representations of laughter in a positive way – that they are laughing with someone and not at someone?
Finally, let’s compare some data. The good people at the Corpus of Contemporary American English (COCA) have created a word list based on their 450 million word corpus. The COCA corpus is specifically designed to be large and balanced (although the problem of dealing with spoken language might still remain). In addition, each word in their corpus is annotated for its part of speech, so they can recognize when a word like state is either a verb or a noun. This last point is something that Dodds et al. did not do – all forms of words that are spelled the same are collapsed into being one word. The compilers of the COCA list note that “there are more than 140 words that occur both as a noun and as a verb at least 10,000 times in COCA”. This is the type/token issue that came up in my previous post. A corpus that tags each word for its part of speech can tell the difference between different types of the “same” word (state as a verb vs. state as a noun), while an untagged corpus treats all occurrences of state as the same token. If we compare the 10,000 most common words in Dodds et al. to a sample of the 10,000 most common words in COCA, we see that there are 121 words on the COCA list but not the Dodds et al. list (Here is the spreadsheet from the Dodds et al. paper with the COCA data – pnas.1411678112.sd01 – Dodds et al corpus with COCA). And that’s just a sample of the COCA list. How many more differences would there be if we compared the Dodds et al. list to the whole COCA list?

To sum up, the authors use their corpus of tweets, New York Times articles, song lyrics and books and ask us to assume (1) that they can make universal claims about language despite using only written data; (2) that their data is representative of all human language despite including only four registers; (3) that they have sampled their data correctly despite not knowing what types of people created the linguistic data and only including certain channels of published language; (4) that the relative positive or negative nature of the words in a text are indicative of how positive that text is despite the obvious fact that words can be spelled the same and still have wildly different meanings; (5) that the respondents in their survey can speak for the entire population despite the English-speaking respondents being from only two subsets of two English-speaking populations (USA and India); and (6) that their list of the 10,000 most common words in their corpus (which they used to rate all human language) is representative despite being uncomfortably dissimilar to a well-balanced list that can differentiate between different types of words.

I don’t mean to sound like a Negative Nancy and I don’t want to trivialize the work of the authors in this paper. The corpus that they have built is nothing short of amazing. The amount of feedback they got from human respondents on language is also impressive (to say the least). I am merely trying to point out what we can and can not say based on the data. It would be nice to make universal claims about all human language, but the fact is that even with millions and billions of data points, we still are not able to do so unless the data is representative and sampled correctly. That means it has to include spoken data (preferably a lot of it) and it has to be sampled from all socio-economic human backgrounds.

Hat tip to the commenters on the last post and the redditors over at r/linguistics.

A paper recently published in PNAS claims that human language tends to be positive. This was news enough to make the New York Times. But there are a few fundamental problems with the paper.

Linguistics – Now with less linguists!

The first thing you might notice about the paper is that it was written by mathematicians and computer scientists. I can understand the temptation to research and report on language. We all use it and we feel like masters of it. But that’s what makes language a tricky thing. You never hear people complain about math when they only have a high-school-level education in the subject. The “authorities” on language, however, are legion. My body has, like, a bunch of cells in it, but you don’t see me writing papers on biology. So it’s not surprising that the authors of this paper make some pretty basic errors in doing linguistic research. They should have been caught by the reviewers, but they weren’t. And the editor is a professor of demography and statistics, so that doesn’t help.

Too many claims and not enough data

The article is titled “Human language reveals a universal positivity bias” but what the authors really mean is “10 varieties of languages might reveal something about the human condition if we had more data”. That’s because the authors studied data in 10 different languages and they are making claims about ALL human languages. You can’t do that. There are some 6,000 languages in the world. If you’re going to make a claim about how every language works, you’re going to have to do a lot more than look at only 10 of them. Linguists know this, mathematicians apparently do not.

On top of that, the authors don’t even look at that much linguistic data. They extracted 5,000–10,000 of the most common words from larger corpora. Their combined corpora contain the 100,000 most common words in each of their sub-corpora. That is woefully inadequate. The Brown corpus contains 1 million words and it was made in the 1960s. In this paper, the authors claim that 20,000 words are representative of English. That is, not 20,000 different words, but the 5,000 most common words in each of their English sub-corpora. So 5,000 words each from Twitter, the New York Times, music lyrics, and the Google Books Project are supposed to represent the entire English language. This is shocking… to a linguist. Not so much to mathematicians, who don’t do linguistic research. It’s pretty frustrating, but this paper is a whole lotta ¯\_(ツ)_/¯.

To complete the trifecta of missing linguistic data, take a look at the sources for the English corpora:

Corpus

Word count

English: Twitter

5,000

English: Google Books Project

5,000

English: The New York Times

5,000

English: Music lyrics

5,000

If you want to make a general claim about a language, you need to have data that is representative of that language. 5,000 words from Twitter, the New York Times, some books and music lyrics does not cut it. There are hundreds of other ways that language is used, such as recipes, academic writing, blogging, magazines, advertising, student essays, and stereo instructions. Linguists use the terms register and genre to refer to these and they know that you need more than four if you want your data to be representative of the language as a whole. I’m not even going to ask why the authors didn’t make use of publicly available corpora (such as COCA for English). Maybe they didn’t know about them. ¯\_(ツ)_/¯

Say what?

Speaking of registers, the overwhelmingly most common way that language is used is speech. Humans talking to other humans. No matter how many written texts you have, your analysis of ALL HUMAN LANGUAGE is not going to be complete until you address spoken language. But studying speech is difficult, especially if you’re not a linguist, so… ¯\_(ツ)_/¯

The fact of the matter is that you simply cannot make a sweeping claim about human language without studying human speech. It’s like doing math without the numeral 0. It doesn’t work. There are various ways to go about analyzing human speech, and there are ways of including spoken data into your materials in order to make claims about a language. But to not perform any kind of analysis of spoken data in an article about Language is incredibly disingenuous.

Same same but different

The authors claim their data set includes “global coverage of linguistically and culturally diverse languages” but that isn’t really true. Of the 10 languages that they analyze, 6 are Indo-European (English, Portuguese, Russian, German, Spanish, and French). Besides, what does “diverse” mean? We’re not told. And how are the cultures diverse? Because they speak different languages and/or because they live in different parts of the world? ¯\_(ツ)_/¯

The authors also had native speakers judge how positive, negative or neutral each word in their data set was. A word like “happy” would presumably be given the most positive rating, while a word like “frown” would be on the negative end of the scale, and a word like “the” would be rated neutral (neither positive nor negative). The people ranking the words, however, were “restricted to certain regions or countries”. So, not only are 14,000 words supposed to represent the entire Portuguese language, but residents of Brazil are rating them and therefore supposed to be representative of all Portuguese speakers. Or, perhaps that should be residents of Brazil with internet access.

[Update 2, March 2: In the following paragraph, I made some mistakes. I should not have said that ALL linguists believe that rating language is an notoriously poor way of doing an analysis. Obviously I can’t speak for all the linguists everywhere. That would be overgeneralizing, which is kind of what I’m criticizing the original paper for. Oops! :O I also shouldn’t have tied the rating used in the paper and tied it to grammaticality judgments. Grammaticality judgments have been shown to be very, very consistent for English sentences. I am not aware of whether people tend to be as consistent when rating words for how positive, negative, or neutral they are (but if you are, feel free to post in the comments). So I think the criticism still stands. Some say that the 384 English-speaking participants is more than enough to rate a word’s positivity. If people rate words as consistently as they do sentences, then this is true. I’m not as convinced that people do that (until I see some research on it), but I’ll revoke my claim anyway. Either way, the point still stands – the positivity of language does not lie in the relative positive or negative nature of the words in a text (the next point I make below). Thanks to u/rusoved, u/EvM and u/noahpoah on reddit for pointing this out to me.] There are a couple of problems with this, but the main one is that having people rate language is a notoriously poor way of analyzing language (notorious to linguists, that is). If you ask ten people to rate the grammaticality of a sentence on a scale from 1 to 10, you will get ten different answers. I understand that the authors are taking averages of the answers their participants gave, but they only had 384 participants rating the English words. I wouldn’t call that representative of the language. The number of participants for the other languages goes down from there.

A loss for words

A further complication with this article is in how it rates the relative positive nature of words rather than sentences. Obviously words have meaning, but they are not really how humans communicate. Consider the sentence Happiness is a warm gun. Two of the words in that sentence are positive (happiness and warm), while only one is negative (gun). This does not mean it’s a positive sentence. That depends on your view of guns (and possibly Beatles songs). So it is potentially problematic to look at how positive or negative the words in a text are and then say that the text as a whole (or the corpus) presents a positive view of things.

Lost in Google’s Translation

The last problem I’ll mention concerns the authors’ use of Google Translate. They write

We now examine how individual words themselves vary in their average happiness score between languages. Owing to the scale of out corpora, we were compelled to use an online service, choosing Google Translate. For each of the 45 language pairs, we translated isolated words from one language to the other and then back. We then found all word pairs that (i) were translationally stable, meaning the forward and back translation returns the original word, and (ii) appeared in our corpora in each language.

This is ridiculous. As good as Google Translate may be in helping you understand a menu in another country, it is not a good translator. Asya Pereltsvaig writes that “Google Translate/Conversation do not translate. They match. More specifically, they match (bits of) the original text with best translations, where ‘best’ means most frequently found in a large corpus such as the World Wide Web.” And she has caught Google Translate using English as an intermediate language when translating from one language to another. That means that when going between two languages that are not English (say French and Russian), Google Translate will first translate the word into English and then into target language. This represents a methodological problem for the article in that using the online Google Translate actually makes their analysis untrustworthy.

It’s unfortunate that this paper made it through to publication and it’s a shame that it was (positively) reported on by the New York Times. The paper should either be heavily edited or withdrawn. I’m doubtful that will happen.

Update: In the fourth paragraph of this post (the one which starts “On top of that…”), there was some type/token confusion concerning the corpora analyzed. I’ve made some minor edits to it to clear things up. Hat tip to Ben Zimmer on Twitter for pointing this out to me.

Update (March 17, 2015): I wrote a more detailed post (more references, less emoticons) on my problems with the article in question. You can find that here.