Lessons in Data Reuse, Integration, and Publication

On April 17, members of the Central and Western Anatolian Neolithic Working Group met at Kiel University to participate in the International Open Workshop: Socio-Environmental Dynamics over the Last 12,000 Years: The Creation of Landscapes III. Working group participants presented their hot-off-the-press analyses of various aspects of integrated faunal datasets from over one dozen Anatolian archaeological sites spanning the Epipaleolithic through the Chalcolithic (a range of 10,000+ years). Several more sites will add data to the project in the coming months to ensure that the resulting collaborative publications are as comprehensive as possible.

These presentations took place in the session Into New Landscapes: Subsistence Adaptation and Social Change during the Neolithic Expansion in Central and Western Anatolia. The session, which was chaired by Benjamin Arbuckle (Department of Anthropology, Baylor University) and Cheryl Makarewicz (Institute of Pre- and Protohistoric Archaeology, CAU Kiel), included a panel of presentations followed by an open discussion.
A bit of background: Over the past five months, with enabling funding from the Encyclopedia of Life (EOL), we have worked with participants in this project to prepare their datasets for publication. Each participant contributed a dataset that would be edited and published in Open Context, and integrated with the other datasets. Rather than ask all participants to analyze the entire corpus of datasets, we asked each participant to address a specific topic. These topics (“sheep and goat age data”, “cattle biometrics”) required access to a smaller set of relevant data, their analysis of which the participants presented at the Kiel conference.

The research community has very little experience with this kind of collaborative data integration. Archaeology rarely sees projects that go beyond conventional publication outcomes, to also emphasize the publication of high-quality, reusable structured data. After months of preparing datasets for shared analysis and publication, I was really looking forward to seeing the research outcomes unfold.

As an added bonus, our colleagues from the DIPIR project joined us there to document the data publishing and collaborative data reuse processes. We felt very fortunate that the DIPIR team members could apply highly rigorous methods to observing and studying how researchers grappled with integrating multiple datasets. We’re looking forward to learning from the DIPIR team as they synthesize their observations on how researchers collaborate with shared data.

In the meantime, we’d like to share some initial impressions and lessons on data reuse that emerged from this work:

Full data access can improve practice. We can learn a lot by looking at how others record data. Some may see sharing our databases and spreadsheets as opening ourselves up to criticism. Such practices can greatly improve the consistency in the way we record data, and therefore facilitate meaningful data integration. In this one-day workshop alone, we identified a few key areas where zooarchaeologists can improve their consistency in data recording.

An example of this from the workshop: Although all zooarchaeologists record age data based on the fusion stage of skeletal elements, some elaborate on their notations where others don’t. For example, an unfused calcaneus of a sheep might come from a newborn lamb or from a sheep up to about two years of age (when the calcaneus fuses). One researcher might put a note in a “Comments” field indicating that the bone is from a neonate. Another researcher, dealing with the same specimen, might leave the notation simply as “unfused.” Thus, two recording systems can lead to very different interpretations, one that recognizes the newborn lambs in the assemblage, and one that lumps them with the other “sub-adult” sheep. Such differences in aggregate can lead to vastly different interpretations of an assemblage. These recording discrepancies become apparent when data authors begin looking “under the hood” at each others’ datasets. Recognizing these discrepancies and their possible effects on interpretation can inform better practice in data recording, and thus work toward improving future comparability and integration of published datasets.

While data preservation is a good motivation for better data management, we think a professional expectation for data will help motivate researchers to create better data in the first place. The discussions provoked by this study helps us to better understand what “better data” may mean in zooarchaeology.

Documenting data in anticipation of reuse. I think we can all agree that datasets must contain certain critical information or they will not be useful to future researchers. But here’s the catch: Information deemed “critical” for one project is not the same for another project. Sure, there may be a baseline of key information that applies to all projects (location, date, author, etc.), but there is a much larger amount of discipline-specific or even project-specific information that needs to be documented to enable reuse. To complicate things, the absence of this documentation may only be noticed upon reuse. That is, the project may appear well-documented until an expert attempts to reuse the dataset.

An example: Some datasets in this study contained a large number of mollusks. From the perspective of a data re-user wanting to integrate multiple datasets, this poses a big question: Does an absence of mollusks at the other sites mean that the ancient inhabitants did not exploit marine resources? Or is their absence simply a result of the mollusks having not been included in the analysis (either not collected or perhaps set aside for analysis by another specialist)? Understanding this absence of data is critical for any reuse of the dataset.

This highlights the important role of data editors and reviewers, who can work with data authors to identify and gather this key information at the time the dataset is disseminated (rather than having questions come up years later upon reuse). Furthermore, not just anybody can review the dataset. Knowing if a dataset is documented sufficiently requires in-depth knowledge of the subject matter, and the ability to project potential applications of the data to anticipate questions that might arise with future use.

The benefits of peer-review via data reuse. Data publication is still in its infancy. There is a lot of exploration taking place as to what “data publication” means and how it should be carried out. If it mimics conventional publication, peer-review of datasets would occur before their publication. However, our data reuse studies are showing that, in fact, the most comprehensive peer-review of data occurs upon its reuse. It is only at the time of reuse that a dataset is tested and scrutinized to the point where key data documentation questions emerge. This may only be an issue in today’s data-sharing world. Perhaps future data authors, accustomed to full and expected data dissemination, will practice exhaustive documentation from the get-go. But what do we do now? How does post-publication peer-review, which appears to be so critical to documenting datasets properly, fit with models of data publication?

In our experience, many questions came up in data reuse that could have been answered with more extensive data documentation. However, a data creator will never be able to anticipate all the possible questions. Data documentation can be enriched and improved with a kind of “peer review through reuse.” Adding post-publication information to a data publication would not only help enrich/improve the documentation. It would also provide information about a dataset’s reuse and impact– feedback that many data authors would really like to see.