Trust in digital tools is crucial for the success of Big Data research in the Social Sciences and the Humanities

Readers of this blog series are familiar with both the philosophy of the Seshat project and the tremendous potential of the Seshat Databank. What could be more clear is how we actually put together the databank. Inevitably we had to make some important conceptual and practical choices to ensure that the Seshat data at the same time offers as much value as possible for a wide range of users and represents a realistic output we knew we could deliver within our budget and timeframe.

As 2017 will see the release of a large amount of Seshat data and papers based upon the analysis of these data, the Seshat team wants to communicate as openly as possible about these conceptual and practical choices and the impact these choices had on the quality of the Seshat data. As both the credibility of our papers and the use of our data by the wider scholarly community ultimately depends on the extent to which our data gathering and curating workflows are solid and trusted, we want to provide our users with full and transparent information. What sets successful large-scale digital humanities projects of the past decade apart from similar projects with a more lacklustre involvement by the user community is the trust and openness these successful projects instill in their users. Documenting our workflows and the implications on data quality issues underpinning the Seshat project is a first step in generating this trust.

To generate this trust, two papers (which will be posted soon as part of our Working Papers series) will describe different aspects of our methodology. The first paper will provide a detailed overview of the entire Seshat workflow. Each stage of the workflow, all the way from the initial selection of a specific theory to test with Seshat data, over the data gathering process, to the write up of the analysis, will be explained. For each step the relevant issues pertaining to data quality issues will be addressed. This article will also introduce the metrics used to gauge the levels of productivity seen across the databank, as well as indices of data quality and the levels of system agility we are keeping track of throughout the lifespan of the Seshat project. In addition this article will serve as a catalogue of ‘lessons learned’, which will be of direct use to project managers of other large-scale digital humanities projects.

A second paper focuses specifically on the ‘human’ element of the Seshat Databank, including an assessment of how the academic background and the level of training received in the Seshat data gathering workflows impacts data quality. Our data gathering procedure involves not just entering information, but also complex decision-making processes that require knowledge, skill, and appropriate working conditions to ensure the information entered is correct, valid and appropriately referenced. Research assistants therefore must not only be highly trained and skilled, they must also operate in an environment that enables them to work to the highest standards.

A transparent and open discussion comparing issues of data quality and the delivery of practical ‘value for money’ is vital for the uptake of a project and its associated tools by the wider community, but this level of transparency also carries risks. However, we are committed to showing openly all the choices we have made as we want people to critique our methods and allow us to improve upon them. Seshat is growing continuously and we are keen to include a healthy crowd sourcing and even citizen science element to the project. With this in mind we hope that possible disagreements regarding important conceptual and practical considerations and their impact on data quality will turn into specific and actionable suggestions that can improve this project. To this end, on the website of the release of our first batch of Seshat data, we have included a ‘comment/augment’ button. We hope this will generate many responses and ultimately lead to even higher levels of data quality!

Seshat News

Data from the Seshat Databank (data.seshat.info) is under Creative Commons Attribution Non-Commercial (CC By-NC SA) (https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode) licensing. Do you agree to the reasonable and appropriate use of these data?