The Challenge of Big Data: A Day at the British Library

On Friday I attended the British Sociological Association’s Presidential Event at the British Library, titled ‘The Challenge of Big Data’. The venue couldn’t have been more symbolically appropriate: the British Library collection holds more than 150 million books, newspapers, maps and manuscripts, which would seem to constitute big data by almost any yardstick. Moreover, the Library is in the process of digitising parts of this collection, rendering it useful to innovative research.

Yet alongside myriad possibilities, big data also presents considerable challenges to social science researchers, and it was on these challenges that the bulk of the day’s discussion dwelled. Perhaps understandably, given that the bulk of the hundred-or-so attendees were practicing sociologists, the central debate was over the changing demands on researchers in capturing and manipulating big data sets.

The first panel inaugurated this debate, presenting powerful alternative perspectives. Evelyn Ruppert, a Senior Lecturer at Goldsmith’s University concerned with the sociology of data, suggested that the lack of a precise definition of big data encouraged creative if critical engagement with the phenomenon. She highlighted the increased reflexivity with which users deal with the data deluge they help to create in more sophisticated ways, as with the self-awareness displayed on social networks and the self-tracking practices of the Quantified Self movement. Even the recent infamous revelations regarding the National Security Agency’s mass surveillance had, she said, increased awareness of the big data phenomenon. Researchers must lead the way, then, in promoting a climate of healthy scepticism by engaging with a ‘public sociology of data’, challenging assumptions such as the notion that all members of a network are human or all web searches reflect a desire to discover more about a subject. Social science researchers are well placed to lead such a critical engagement given decades of experience asking questions of the properties of data and information. Taken together, Dr Ruppert presented less a challenge of big data and more a challenge to engage with it.

The second speaker on this panel was Ken Benoit. A political scientist by trade, Professor Benoit’s talk was very much a dispatch from the front line of big data research. The central logistical question for social science researchers is whether to hire someone with expertise or learn to do it yourself, and for those who choose the second option, Benoit provided a broad overview of the languages and packages needed to deal with the capture, storage and manipulation of big data sets. Whilst to the untrained eye, the list of competencies required – along with screenshots of code – seemed daunting, Benoit insisted such a skill set was not out of the reach of researchers. Although big data was not presented as a panacea, Benoit did make a compelling case for engagement with it, and noted with regret the absence of big data methods training on most social science curricula.

The third speaker, Emma Uprichard of the University of Warwick, presented a sharper critique of big data than her fellow panellists. Professor Uprichard suggested that with big data the “sociological footprint” is more visible, requiring a greater reflexive sensibility and engagement with the how, who and why questions surrounding big data research. Uprichard proceeded to present five core difficulties in using big data: the (mis)understanding of time within the data, the scale of big data, the inability of big data analysis to deal with non-linear dynamics, the question of causality and the challenges of interdisciplinarity. In the discussion that followed, Ken Benoit suggested that the key to understanding society today was through information technology, and cautioned that academia is already at least a decade behind commercial research. Evelyn Ruppert discussed the ‘neoliberalism of data’, highlighting the reconfiguration of relations between researchers and the ‘owners’ of data. Overall, the makeshift conclusion seemed to be that the development of a public sociology or political economy of big data was required, facilitating increased understanding of data in society and research.

The second panel concerned the mixing of data and methods. Professor Alan Warde began by drawing a key distinction between data collected purposely and data for which analysis is a secondary, passive activity, such as web searches or tweets. Professor Warde suggested that purposefully-collected data may give researchers more and better answers, given its greater flexibility. The central test for big data, Warde suggested, is whether it changes the questions researchers can and do ask, especially given that the data is indeterminate and doesn’t speak for itself. Warde also noted that single types of data are seldom sufficient to generate explanation or theory, suggesting the need for a pluralistic approach to research. A prominent example of such an approach was then provided by Dr Abby Day, who described the use of a qualitative research in understanding the findings of the UK census. In her research Dr Day sought to contextualise the surprisingly high number of Britons who identified as Christians and consider questions over ethnicity and religion in the case of Muslim respondents; her insights were used in the design of the next census in 2011. The combined effect of this panel was to reinforce the importance of mixed methods research in the era of big data, and particularly the insights that qualitative approaches can bring to large data sets.

The third panel of the day concerned access to big data in the context of public policy. Another report from the coalface of big data came from Emer Coleman, the founder of technology consultancy dsrptn and who has a background working with open data in government. Coleman highlighted government’s focus on rationalist suppositions about the population, whereas in reality most of us are “more Homer Simpson than Dr Spock” in how we live our lives. Thus, technologies like social media can and should be used more as a listening and learning exercise in policy making, and more generally, open data can if used right inform the policy process at a much more granular, grounded level. Following Coleman, Professor Peter Elias of the University of Warwick presented an informative overview of key reports concerning the use of data by government, as well as a typology of data that government has access to, ranging from registration records to satellite imagery. Professor Elias echoed earlier concerns over secondary use data, emphasising difficulties over access and flexibility. Whilst subsequent questions noted other concerns – over privacy and the ideological nature of the questions asked of data – overall the panel served to demonstrate the potential for a revolution in how governments engage with citizens and deliver public services.

The final discussion posed the question “Who counts?” and each presentation focussed on an aspect of public health provision. Professor Paul Martin sketched the transformation of the NHS, showing how in recent years it has been reimagined as a direct source of economic value. Following the centralisation of patient data and the new infrastructure around gene sequencing, for example, data has become a primary source of economic value. This raises plenty of concerns, including civil liberty questions around consent and privacy, and those regarding governance in terms of accountability and the commercial use of data. These concerns were expanded upon by Dr Paul Taylor, who noted the deep unpopularity and discomfort amongst the public around the commercial use of data, and by Professor Andrew Goffey, who noted the shift of technical expertise from government to business.

The final discussion drew together many of the strands of the day’s debate. The emergence of big data has brought into sharper relief both the commercialisation and the ‘technicalisation’ of academic research, and many of the audience responses were framed in opposition to these trends. The big data revolution is both a symptom and now a cause of these tendencies, and it is not surprising that the majority of attendees who have spent their academic lives mastering methods designed for the world of small data should take a sceptical view or develop a critical perspective. Nonetheless, as was pointed out, not engaging with an analytical approach or technique because of the difficulty or complexity associated with it is hardly a defensible argument, and the opportunities of big data, many of which were on display during the day, behove researchers of all stripes to contemplate big data critically but also with an open mind. This, then, was the central challenge laid down by the event.

As we were reminded throughout the day, establishing causality is a thorny issue, but if some social researchers come to adapt their substantive and methodological toolkit to incorporate big data, an event like this might fairly claim some responsibility. For the many others present who sounded less convinced, though, my suspicion is that big data approaches will remain at arm’s length.