Aggregating High Energy Physics (HEP) literature in INSPIRE

Abstract

The leading information platform for High Energy Physics (HEP) literature, INSPIRE [1], provides users with high quality, curated content covering the entire corpus of HEP literature and the fulltext of all such articles that are Open Access. [2] INSPIRE is a collaboration between four major particle physics labs worldwide: CERN, DESY, Fermilab and SLAC. Being built upon open source technologies, the platform emerges as the natural successor to the over 40-years old SPIRES database. [3]

INSPIRE serves the global HEP-community with content from hundreds of scientific journals and online repositories. New and updated articles and meta-data from these sources are ingested daily, then indexed and made accessible through a series of both automatic and semi-automatic processes. These processes consist of many layers where text and data are mined to extract information on authors, references and scientific images alongside the meta-data - often tailored specifically for the data-source. Furthermore, it is made sure that the articles are matched with pre-existing articles in the database. Highly valued units of measure, such as number of citations are extracted from analyzing references in meta-data and fulltext documents - providing increased scientific value for our users.

During these processes, human intervention is often needed in order to properly classify, validate or edit the extracted data. In some cases, even community faced crowdsourcing help enrich the data. For example, in cases of author disambiguation and user submissions.

This way of distributing curation work and further augmenting the data with automated workflows can simplify the curation procedures while keeping the high level of quality desired. This presentation aims to briefly describe these techniques and go through some sample user experiences of both curators and physicists and look at the challenges of dealing with a wide variety of ingestion sources. It will be shown how human supervision is still a vital part when aggregating content for a subject repository as fully machine-driven curation is still far from being perfect.