Why social data isn’t always a reliable indicator

“What social data can tell you: pretty much everything” proclaimed Azeem Azhar, founder of PeerIndex, in a popular post on LinkedIn earlier this week. We can perhaps forgive Azhar the hyperbolic lead-in, but hisarticle as a whole indulges in untrammeled evangelism for social data which obscures much of the nuance and uncertainty regarding what exactly this new source of data can actually tell us about society.

I wrote earlier on the issue of representativeness of public opinion expressed on social media, but in this post I focus on a second ‘r’, the reliability of the data. How can we be sure that data harvested from social media is a reliable indicator of what people are thinking, saying or doing?

The first piece of evidence Azhar points to is a recent study from Facebook’s data science team, which showed that a couple’s real-world relationship can be characterised by how many wall posts are exchanged between them over the various stages before and after the start of a relationship. The Atlantic post which summarises the finding describes how “the number of wall posts climbs and climbs—until it tumbles when [the relationship] become[s] official.” “Climbs and climbs” and “tumbles” suggests some drastic behavioural change, but in reality, the average number of posts ranges from “a peak of 1.67 posts per day 12 days before the relationship begins, and a lowest point of 1.53 posts per day 85 days into the relationship” according to Facebook: hardly earth-shattering.

This is not to say that this particular finding isn’t at all interesting, or couldn’t indeed be used to predict future behaviour in powerful (and lucrative) ways. But it’s a starting point in understanding of how hype over social data can cause us to misunderstand exactly what the data is and isn’t telling us.

Facebook itself knows this only too well. A paperreleased in January by a pair of Princeton University researchers combined an epidemiological model of social network popularity with Google Trend data to argue that Facebook faces a sharp decline, with the network expected to lose 80% of its user base between 2015 and 2017. The study certainly hoodwinked one or two journalists, but its conclusions rested on hugely flawed assumptions regarding the meaning of the data collected. The offending section is worth quoting at length:

The use of search query data for the study of OSN [online social network] adoption is advantageous compared to using registration or membership data in that search query data provides a measure of the level of web traﬃc for a given OSN. Web traﬃc is arguably the best metric for OSN health in that it represents a measure of user activity or interest within the network. For example, inactive members do not contribute to a social network and would not be counted using search data, but would be counted using other metrics such as registered account data. Thus OSN usership as measured by search query data can be thought of as representing a less tangible, albeit more meaningful, metric of user activity or interest.

Search query data is certainly “less tangible” – but also fairly meaningless in this context. It’s true that, in contrast with measuring user registration, search query data excludes inactive users, but it also excludes a whole raft of active users too; in a world of iPhone apps, autocomplete URLs and browser bookmarks, taking search queries for the term “Facebook” as a direct gauge of the overall popularity of a social network is more than a little absurd. Facebook fired back in good humour, claiming to have detected the demise of both Princeton and the planet’s air supply by similarly relying on Google Trends data.

Making predictions based on social data, then, requires an understanding of the nature of the data harvested. Nowhere is this clearer – or are the stakes higher – than in the realm of election polling. Barack Obama wasn’t the only winner of the 2012 presidential election: statistician Nate Silver emerged triumphant after the success of his model in predicting the outcome of the election in each of the fifty states. Silver didn’t touch data from social media, relying instead on traditional opinion results while adjusting for their historically variable accuracy. Yet others have seen the value in analysing sentiments expressed on social networks like Twitter and Facebook to offer insight into electoral outcomes. This is another area, however, which is fraught with difficulty. A meta-analysis by Gayo-Avello looked at 16 attempts to predict elections based on Twitter data, which either counted references to different candidates or by analysed sentiments expressed about candidates by users. It found that only half of them were successful in predicting electoral outcomes, an overall performance which Gayo-Avello described “close to mere chance”. Demographic bias, spam, bots and sarcasm and other negative sentiment are all troublesome issues which can affect the efficacy of electoral prediction.

Yet this is not to suggest that any and all attempts to harvest useful information are doomed to failure. Rather, it is important simply to realise that research in this area is at an early stage. Care needs to be taken in establishing the validity of findings which emerge from datasets which are often misunderstood, and platitudes declaring social data can tell us “pretty much everything” might more helpfully be avoided.

A good example of a more sensitive approach to interpreting data from social media recently came from Shelton et al, who harvested geotagged tweets discussing the destructive impact and after-effects of Hurricane Sandy, which struck the American eastern seaboard in October 2012. The authors found that while at a national level, the intensity of tweeting about the storm matched up well with the areas of physical impact, “the relatively strong correlation between tweet density and territories most affected by Sandy breaks down at finer scales of analysis”, that is, at the level of individual parts of the city. Staten Island, for example, accounted for half of the deaths from the storm across the whole city, but showed low levels of storm-related tweeting compared to other parts of the city. At this finer-grained level, the authors thus suggest an analytical approach more sensitive to the actual experience of people living through the storm, by utilising tweets referring to specific after-effects – such as a precariously damaged crane on 47th St – in contrast to generic terms like “Sandy” and “storm”.

It’s of the utmost importance, therefore, to take seriously the social nature of social data – whether this is detecting sarcasm in references to a presidential candidate or applying sociospatial understanding to an analysis of geotagged tweets. Accounting for, or at the very least acknowledging, the limits on the reliability of social data is a vital first step on the path towards generating valid findings. Even if it makes for a less enticing headline.