“Twitter says…” – Can big social data tell us about public opinion?

“Like Noah’s ark, (there was) every kind of creature in every walk of life. They included a town wit, a grave citizen, a worthy lawyer, a worship justice, a reverend nonconformist, and a voluble sailor.”

The above description comes from a history of English coffee houses in the seventeenth century¹, but might just as well apply to the twenty first century’s sites of caffeinated conversation: online social networks. With the rapid uptake of the Internet and the more recent rise to prominence of social network sites like Facebook and Twitter, hundreds of millions of ordinary people – the witty, the worthy, and the decidedly neither – are now connected not only to the web, a source of news, but also to social networks, a source of views.

Whether and how researchers should mine these views for valuable insights is another matter. From one perspective, social networks offer an unprecedented gold mine of public sentiment, expressed more freely and obtained more cheaply than traditional research methods would allow. From another, this gold mine can seem like existing only at the end of a rainbow, ridden with seemingly intractable issues over the reliability of the data. A forthcoming paper will address these issues on a comprehensive basis, but this article looks at only one: the claims to representativeness of social media data.

Typically, measuring public opinion is a time-consuming and financially-draining affair. The standard approach taken by social scientists investigating public beliefs is to use a survey such as an opinion poll. For this method, a finite number of respondents are mailed or called and asked a finite number of questions about, say, opinions of a parliamentary candidate or attitudes towards the economy. The limited number of questions and respondents is usually a financial necessity, but such narrow surveys can nonetheless be regarded as a legitimate gauge of public opinion due to a mathematical quirk: the statistical concept of sampling error allows researchers to generalise from a small sample to a population at large, assuming this sample is random. Indeed, (for complicated mathematical reasons) generalising to a population in the hundreds of millions is perfectly possible from a sample as small as one or two thousand (assuming the margin of error that this process generates – usually two or three percentage points either way – is tolerable.)

Therefore, in taking the temperature of a nation, the returns from surveying any more than a couple of thousand people are seriously diminished. For decades, this has allowed a methodological consensus to emerge, a standard, well-established basis on which to conduct and report on surveys of public opinion: construct a random sample; report margin of error; rinse, wash, repeat. But the rise of online social networks offers a disruptively different approach. Through analysis of social network sites, researchers can paint a new picture of public opinion, one with many millions of pixels. This ‘pointillist‘ image stands in stark contrast, conceptually, to the finite nature of surveys; social media data belongs instead to the big data philosophy of ‘more is more’.

It is of course hard to ignore the benefits for our understanding of public opinion that the analysis of the millions of sentiments expressed freely online can bring. In particular, the predictivepower of social sources of data is a persuasive reason to draw upon them. Yet studies which draw on social media data to explain public opinion, whether predictive or descriptive, must for the moment couch their findings within certain key limits – in part because the core assumption of representativeness, intrinsic to the random sample survey model, does not carry across automatically to social media data.

Why should this be so? Surely more data – more citizens expressing more opinions on more subjects – offers a more holistic view of public opinion? The issue though is not so much the size of a sample but the randomness of it. No matter how small their pool of respondents, the best survey researchers go out of their way to construct a demographically random group. There would be no point trying to gauge American attitudes towards Barack Obama just by consulting with a group of inter-city Chicagoans, for example: this wouldn’t constitute a representative sample of the overall electorate, and would in all likelihood be skewed in the president’s favour. This much is fairly obvious, but social networks pose an extra layer of complexity: not only are they likely demographically skewed, but we don’t have especially reliable indicators in what ways or to what extent.

Data from the most recent Oxford Internet Survey suggests that the average UK internet user is younger, more qualified and wealthier than the average Briton in general. Yet the non-representativeness of the Internet as a whole is augmented when the question turns from who uses the Internet to what they use it for. Another recent survey from Pew Research, for example, shows that the racial background of users of popular Internet services like Twitter and Instagram differs from that of US Internet users in general. A service like Facebook, which per Pew now appears broadly representative of the Internet population, nonetheless started life as a service pitched exclusively at elite US university students – an unrepresentative bunch almost by definition.

But even if researchers could fully account for these demographic discrepancies, they would be left with an even more entrenched black box affecting the generalisability of any findings. A complex communications system, the Internet facilitates myriad patterns of communication, froma private one-to-one email exchange to a popstar posting a message to a legion of followers. In many cases, multiple functions are contained on one platform, as with Twitter, which allows asymmetrical ‘follower’ relationships, filtering tools like @replies and #hashtags, and both human and organisational account ownership, resulting in a very complexly connected network.

Inevitably, different users take advantage of these communicative facilities in different ways. Some may enjoy the increased ability to speak, whilst others may appreciate merely the wider array of content available to consume. In other words, analysing only tweet data may not offer a fair overview of Twitter usership per se, let alone a national population more generally. And when researchers use proxies such as hashtags and @replies to measure sentiment regarding a certain issue, the results may be even more skewed towards the talkative. In their analysis of the #auspol hashtag, Axel Bruns and Jean Burgess identified just such a phenomenon, finding that “of the more than 26,000 users who participated from February to December 2011, the most active one percent of users accounted for nearly two thirds of all tweets” using the hashtag.² Tweeters who used this hashtag begin to seem like the smallest in a series of Russian dolls, raising serious questions about the generalisability of findings to the wider world.

In sum, we should not simply assume that the millions of opinions expressed on popular Internet services, however diverse or interesting, really resemble the Noah’s Ark of the seventeenth century coffee house. (Or alternatively, we could update the other side of the historical analogy, noting that for all the diversity of a coffee house clientele, it would be centuries before many of its denizens could actually vote – not to mention the servants, often female, who were seen and not heard in the background.) But none of this should rule out any use of social media data in developing our understanding of public opinion; on the contrary, there are many examples of studies which have done just that. What it does point to is the importance of understanding the nature of the data being used – or perhaps simply acknowledging some of the complexities and limitations involved with it. The forthcoming paper will flesh these issues out in more detail, and with the use of interview data will explore how researchers at the cutting edge are responding to these limitations, in theory and practice.

Read the second in my series of blog posts about big data for public opinion – on the reliability of big social data – here.