Reflections on Making ‘Big Data’ Human

If there was one thing that the Making Big Data Human conference made clear, it was that ‘Big Data’, and indeed digital methodologies in general, provide some very exciting opportunities to advance historical research. From the ambitious and wide-ranging National Archives’ Traces Through Timeproject, which looks to create a generic method to look at historical individuals across enormous datasets, through to the more specific but equally exciting Casebooks Project, the conference participants were treated to a feast of ideas about how historical methods are adapting to the changing nature of data in a digital age.

But what exactly is ‘big data’, and what did the Doing History in Public team have in mind when we decided to explore how we could make it ‘human’? The basic definition of ‘big data’ is ‘extremely large data sets that may be analysed computationally’.[1]For historians this might, asJane Winters demonstrated in her keynote lecture, be a case of using the archived web as an historical source, or of exploring parliamentary proceedings from three different countries over a period of more than 200 years.

Such overwhelmingly large datasets present us with opportunities that were simply not available before the digital age. Even the most patient and committed of historians could not find enough time in a physical archive to read through and, crucially, derive some sort of meaning from the entirety of Hansard, let alone compare it with its equivalents in other countries. However, ‘big data’ does present some challenges for historians, and this is where the challenge of making it ‘human’ is so important. In her keynote lecture, Jane Winters highlighted the ‘black box problem’. We are collecting and analysing a lot of data through computational methods, but do we really know what the tools are showing us? Are we in danger of losing our discipline to the machine?

The papers given at the conference suggested probably not. The discussions after each panel, during the coffee breaks, and in the roundtable session at the end of the day, demonstrated just how aware those interested in ‘big data’ are of the dangers of playing to availability of evidence, of creating false categories to cope with the binary nature of digital systems, and of the need to remember that ‘big data’ is not sterile or apolitical. But, on a more positive note, it was clear that there are ways in which historians can handle ‘big data’ in a manner suitable to the discipline.

The first panel of the day emphasised that a critical approach to ‘big data’, as with most historical enterprise, is key. This can be achieved through reassessing terminology, as Alexander von Lünen demonstrated with his challenge to the very term “digital history” itself. Or, through a critical approach to methodology. Pim Huijnen critiqued the idea of ‘embracing messiness’ and ‘letting the data speak’ advocated by Mayer-Schonberger. In the humanities, it is necessary to adopt an approach where the scholar is in charge and uses data flexibly and transparently. Yet, it is also important for historians to maintain realistic expectations, as Richard Deswarte’s paper highlighted. The history of UK companies on the web presented by Marta Musso provided a working example of the opportunities and challenges brought by automatised research on web archives. Web archives are a good reminder of the radical changes brought about by Google in our approach to information retrieval. The discussion of the pre-google era of the web led into a number of questions about ethics which continued to crop up over the rest of the day, culminating in the roundtable session, where Alan Blackwell called for a more “humane computing”, in which human authorship is given credit and the global north does more than simply profit from the ‘free data’ of the global south.

That the humanities are becoming better at representing data visually using digital tools was clearly demonstrated in the second panel session, which also highlighted the collaborative nature of ‘big data’. Catherine Porter introduced the work being done on the Spatial Humanities project, which combines corpus linguistics and GIS to investigate ideas of place. This was followed by Cheng Yang’s presentation of findings on population density in Imperial China, again demonstrating that the use of a GIS-based method should be driven by the research not by the data. Across the whole conference, it was emphasised that collaboration is integral for historians to fully understand and make good use of ‘big data’. However, the nature that collaboration should take had been further unpicked by the end of the day. Collaboration cannot simply be a one-way process in which historians benefit from the expertise of others. Instead, and rather importantly, a multi-way process of collaboration is crucial in which the skillset of a historian is highly useful.

A further lesson which emerged from the day was that historians can, and should, build their datasets in ways which give users, both researchers and others, as much scope as possible to use the data as they see fit. The presentation given by Lauren Kassell and Michael Hawkins on the Casebooks Project showed how the creation of such a dataset could be challenging, but that the potential was huge, especially when combined with public engagement and outreach. The variety and range of datasets that can be created to suit specific case studies was seen in the third panel session, where Federico Nanni’s digital approach to assessing the interdisciplinarity of PhD dissertations contrasted with John Herson’s prosopographical study of Irish immigrant family history. Adam Crymble’s appeal in the roundtable for the conference attendees (and anyone following these discussions afterwards) not to build a “Tuesday hammer”, where data can only be used in a specific way, was perhaps the rallying cry of the conference.

Many important questions were raised during the conference which, although expounded by vibrant discussions, were by no means answered. How do we ensure non-English languages are included in discussions on ‘big data’ in the humanities, alongside the more ‘Anglophone’ web? What can be done to manage and curate data in the most ethical and valuable way? How can we ensure that ‘big data’ does not actually contribute to creating more data silences? And, is all data equal? However, the aim of the conference was never to reach ‘definite answers’ on topics which are daily revealing more conundrums and queries. However, what we hope that the conference itself, the live-tweeting on #makingbigdatahuman which allowed those who could not attend to follow and contribute, and the creation of a Storify of the event achieved, was to gather scholars together from a range of backgrounds and levels of research to discuss these topics and to spark interest in the graduate history community about engaging with ‘big data’. Perhaps the most human aspect of ‘big data’ is its ability to generate such active and lively debate.