Did you consider Twitter’s (lack of) representativeness before doing that predictive study?

Twitter data have many qualities that appeal to researchers, but are probably not suitable for research where representativeness is important. Image: Bernard Goldbach (Flickr).

Twitter data have many qualities that appeal to researchers. They are extraordinarily easy to collect. They are available in very large quantities. And with a simple 140-character text limit they are easy to analyze. As a result of these attractive qualities, over 1,400 papers have been published using Twitter data, including many attempts to predict disease outbreaks, election results, film box office gross, and stock market movements solely from the content of tweets.

Easy availability of Twitter data links nicely to a key goal of computational social science. If researchers can find ways to impute user characteristics from social media, then the capabilities of computational social science would be greatly extended. However few papers consider the digital divide among Twitter users. But the question of who uses Twitter has major implications for research attempts to use the content of tweets for inference about population behaviour. Do Twitter users share identical characteristics with the population interest? For what populations are Twitter data actually appropriate?

A new article by Grant Blank published in Social Science Computer Review provides a multivariate empirical analysis of the digital divide among Twitter users, comparing Twitter users and nonusers with respect to their characteristic patterns of Internet activity and to certain key attitudes. It thereby fills a gap in our knowledge about an important social media platform, and it joins a surprisingly small number of studies that describe the population that uses social media.

Comparing British (OxIS survey) and US (Pew) data, Grant finds that generally, British Twitter users are younger, wealthier, and better educated than other Internet users, who in turn are younger, wealthier, and better educated than the offline British population. American Twitter users are also younger and wealthier than the rest of the population, but they are not better educated. Twitter users are disproportionately members of elites in both countries. Twitter users also differ from other groups in their online activities and their attitudes.

Under these circumstances, any collection of tweets will be biased, and inferences based on analysis of such tweets will not match the population characteristics. A biased sample can’t be corrected by collecting more data; and these biases have important implications for research based on Twitter data, suggesting that Twitter data are not suitable for research where representativeness is important, such as forecasting elections or gaining insight into attitudes, sentiments, or activities of large populations.

Ed.: Despite your cautions about lack of representativeness, you mention that the bias in Twitter could actually make it useful to study (for example) elite behaviours: for example in political communication?

Grant: Yes. If you want to study elites and channels of elite influence then Twitter is a good candidate. Twitter data could be used as one channel of elite influence, along with other online channels like social media or blog posts, and offline channels like mass media or lobbying. There is an ecology of media and Twitter is one part.

Grant: Right. Some commercial products are disproportionately used by wealthier or younger people. That certainly would include certain forms of mass entertainment like cinema. It also probably includes a number of digital products like smartphones, especially more expensive phones, and wearable devices like a Fitbit. If a product is disproportionately bought by the same population groups that use Twitter then it may be possible to forecast sales using Twitter data. Conversely, products disproportionately used by poorer or older people are unlikely to be predictable using Twitter.

Ed.: Is there a general trend towards abandoning expensive, time-consuming, multi-year surveys and polling? And do you see any long-term danger in that? i.e. governments and media (and academics?) thinking “Oh, we can just get it off social media now”.

Grant: Yes and no. There are certainly people who are thinking about it and trying to make it work. The ease and low cost of social media is very seductive. However, that has to be balanced against major weaknesses. First the population using Twitter (and other social media) is unclear, but it is not a random sample. It is just a population of Twitter users, which is not a population of interest to many.

Second, tweets are even less representative. As I point out in the article, over 40% of people with a Twitter account have never sent a tweet, and the top 15% of users account for 85% of tweets. So tweets are even less representative of any real-world population than Twitter users. What these issues mean is that you can’t calculate measures of error or confidence intervals from Twitter data. This is crippling for many academic and government uses.

Third, Twitter’s limited message length and simple interface tends to give it advantages on devices with restricted input capability, like phones. It is well-suited for short, rapid messages. These characteristics tend to encourage Twitter use for political demonstrations, disasters, sports events, and other live events where reports from an on-the-spot observer are valuable. This suggests that Twitter usage is not like other social media or like email or blogs.

Fourth, researchers attempting to extract the meaning of words have 140 characters to analyze and they are littered with abbreviations, slang, non-standard English, misspellings and links to other documents. The measurement issues are immense. Measurement is hard enough in surveys when researchers have control over question wording and can do cognitive interviews to understand how people interpret words.

With Twitter (and other social media) researchers have no control over the process that generated the data, and no theory of the data generating process. Unlike surveys, social media analysis is not a general-purpose tool for research. Except in limited areas where these issues are less important, social media is not a promising tool.

Grant: That is an interesting possibility. The problem is matching Facebook data with other data, like voting records. Facebook doesn’t know where people live. Finding their location would not be an easy problem. It is simpler because Facebook would not need an actual address; it would only need to locate the correct voting district or the state (for the Electoral College in US Presidential elections). Still, there would be error of unknown magnitude, probably impossible to calculate. It would be a very interesting research project. Whether it would be more accurate than a poll is hard to say.

Grant: Surveys are such versatile, general purpose tools. They can be used to elicit many kinds information on all kinds of subjects from almost any population. These are not characteristics of social media. There is no real danger that surveys will be replaced in general.

However, I can see certain specific areas where analysis of social media will be useful. Most of these are commercial areas, like consumer sentiments. If you want to know what people are saying about your product, then going to social media is a good, cheap source of information. This is especially true if you sell a mass market product that many people use and talk about; think: films, cars, fast food, breakfast cereal, etc.

These are important topics to some people, but they are a subset of things that surveys are used for. Too many things are not talked about, and some are very important. For example, there is the famous British reluctance to talk about money. Things like income, pensions, and real estate or financial assets are not likely to be common topics. If you are a government department or a researcher interested in poverty, the effect of government assistance, or the distribution of income and wealth, you have to depend on a survey.

There are a lot of other situations where surveys are indispensable. For example, if the OII wanted to know what kind of jobs OII alumni had found, it would probably have to survey them.

Ed.: Finally .. 1400 Twitter articles in .. do we actually know enough now to say anything particularly useful or concrete about it? Are we creeping towards a Twitter revelation or consensus, or is it basically 1400 articles saying “it’s all very complicated”?

Grant: Mostly researchers have accepted Twitter data at face value. Whatever people write in a tweet, it means whatever the researcher thinks it means. This is very easy and it avoids a whole collection of complex issues. All the hard work of understanding how meaning is constructed in Twitter and how it can be measured is yet to be done. We are a long way from understanding Twitter.