Dacura creates high-quality social science datasets for researchers

Innovations in computer science help open the field of possibilities for archaeologists; and support ambitious projects like the Seshat: Global History Databank. ALIGNED’s Dacura platform is an exciting new development in the field of computational archaeology. The Seshat project’s board of directors along with consultant and archaeologist Peter Peregrine and former director Rob Brennan have published a new working paper with the Santa Fe Institute on the Dacura program and its connections to the Seshat project.

The Internet is vast and disorganized. The average academic researcher deals with thousands of unrelated search results on Google, Google Scholar, or repositories of academic articles like JSTOR. Last week I was looking for information on specific Han dynasty uprisings for the Seshat project. A Google search for more information on the 197 BCE uprising led by Marquis of Dai, Chen Xi, gives a few relevant results but also brings up modern business owners with the same name and stories on a 2011 activist with the same name. There is also a lack of quality control with the relevant sources—have these articles been confirmed by experts? The Dacura platform hopes to help researchers with by creating an automated process to weed out unrelated information, leaving only the relevant and useful.

Researchers will be able to define the parameters of the data they’re looking for and the Dacura system will support the researcher by searching the Internet to compile high quality information for their dataset. Dacura improves open simple ‘keyword’-type searches by combining computer readability with human input. The Dacura workflow involves data harvesting from high quality Internet sources, dataset curation with human input, and expert analyses of the data. The workflow results in the creation of high quality datasets.

A graphic describing the Dacura workflow. The full working paper full working paper is available from the Santa Fe Institute.

Dacura was also designed with current standards in RDF and Linked Data. This means not only that it can help researchers find data on a topic, but once a dataset is compiled Dacura publishes it as Linked Data so that other interested people can get their hands on it without having to reproduce all the time and energy in starting from scratch. The RDF means that all the information is structured and layered. For instance, data in Seshat are ‘tagged’ with information about location (where did the information being expressed happen) and time (when did it happen). This helps relate the different data points together and, crucially, can be understood easily by both humans and computers.

So, if I study revolt in 197 BCE and I want to find more information, Dacura tags all of the information I input about the rebellion with the time 197 BCE and its location in Han-period China. This makes sense to me as a human, but the computer also knows that Han-period China is a subset of the territory of modern-day China, which is a subset of the region known as East Asia, and so on. This is important, because other online databases also tag their data with similar temporal and geographic information. This is how the Linked Data comes in—because Seshat data is well-structured and expressed in RDF, Dacura will help me grab up all of the other structured data that exists on the internet tagged with the location Han-period China and a time containing 197 BCE (from DBPedia or wikidata, or historical datasets like the ChinaHGIS project or pelagios, or an archaeological database like OpenContext). This dramatically increases the amount of information I have at my fingertips to answer key historical questions, with minimal effort. Then, when I’ve generated a lot of new data about the rebellion, Dacura pushes it all back on the web as structured data for the next researcher to come along and find. Together, we’re building an ‘internet of information’, each effort helping the next in our quest to understand our collective past.

The Seshat project is built upon and supported by the Dacura platform. Seshat data has undergone Dacura’s data harvesting and review process as shown in the workflow graphic. It is how we’ve been able to gather so much data—almost 200,000 data-points covering over 400 historical polities—over the past few years, and helps us every day curate this massive store of historical information. Dacura’s system is also what allows us to publish our data on our public website, as it harvests and collates information stored in the Databank in real time and shows it on the web. While we of course want scholars from all fields to use our data in their research, the real benefit of large, well-structured datasets like Seshat is that it provides ready-to-access data that can be combined with data from other sites, as well as all the new information that archaeologists, historians, and others are digging up every day. As the authors of this working paper note:

By providing a semi-automated means of harvesting, evaluating, and exporting archaeological data that has been evaluated for accuracy, Dacura provides both a means and a model for economists, political scientists, ecologists, geographers, and others to access and explore the rich and valuable record of the human past.

Seshat News

Data from the Seshat Databank (data.seshat.info) is under Creative Commons Attribution Non-Commercial (CC By-NC SA) (https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode) licensing. Do you agree to the reasonable and appropriate use of these data?