THE ROLE OF LIBRARIANS IN WIKIDATA AND WIKICITE

June 6, 2017

The other week I participated in WikiCite 2017, a conference, summit, and hackathon event organized for members of the Wikimedia community to discuss ideas and projects surrounding the concept of adding structured bibliographic metadata to Wikidata to improve the quality of references in the Wikimedia universe. As a Wikidata editor and a librarian, I was pumped to be included in the functional and organizational conversations for WikiCite and learn more about how librarians and GLAMs can contribute.

The Basics (briefly and criminally simplified)

Galleries, Libraries, Archives, and Museums are institutions that collect, preserve, and make available information artefacts and cultural heritage items for use by the public. Before databases librarians managed card catalogs to facilitate access, which were translated into MAchine Readable Cataloging (MARC) data format digital records to create online catalogs (ca. 1970s-2000s). As items in collections are being digitized, librarians et al. add descriptive, administrative, and technical/structural metadata to records and provide access to digital surrogates via digital library or repository, depending on copyright. Metadata, however, is generally not subject to copyright and is often published by GLAMs for analysis and use via direct download, APIs, and in more and more cases, as Linked Open Data. As a field, we’re still at the beginning of this transformation to Linked Open Data and have significant questions still to answer and thorny issues still to resolve.

Wikidata is a source of machine-readable, multilingual, structured data collected to support Wikimedia projects and CC0 licensed under the theory that simple statements of fact are not subject to copyright. Wikidata items are comprised of statements that have properties and values. In the Linked Open Data world these items are graphs with statements expressed in triples. As Wikimedians and Wikidata editors add more of this supporting structured data to Wikipedia, the idea of adding bibliographic metadata to Wikidata started coming up. Essentially – “Here are some great structured data that are incredibly important to the functionality of Wikipedia; how can we add them to this repository that we’re creating in a useable way?” As many librarians (and really anyone that’s written a substantial research paper) are aware, citations are complicated.

Theory

The success of Wikipedia is built on the community’s requirement of verifiability. For Wikipedia this means that added material must have been previously published or is clearly supported by a reliable source. Currently, references are generated using a system of templates that are powerful but unsophisticated and bury citation data in the bodies of articles.

Wikipedia had built the largest curated bibliography in the world and it is largely not discoverable, very difficult to maintain, very hard to analyze, and unusable. The first WikiCite meeting in 2016 set goals and laid foundations for figuring out how to make citations meaningful in Wikimedia. There have been several open source projects created in the last year within this community looking to understand what can be done with citation data. Wikidata is driving the machine-readable, queryable aspect of data within the Wikimedia community and as more people understand what it is and what it’s capable of, more ideas pop up.

Sources are the foundation of Wikipedia’s claim to authority, which depends on the ability of Wikipedia to deliver source information, a.k.a. metadata, a.k.a. citations. Heather Ford’s talk “The Social Life of (Wikipedia) Sources” on Day 1 discussed the role of citations in Wikipedia and the paradigm of digital references for knowledge representation in order to familiarize WikiCite-ers with the theoretical foundations of information evaluation. Ford described her attempts to answer the question “What is a reliable source?” in a meaningful way by investigating Wikimedia policy and community recommendations and analyzing citations in Wikipedia to identify patterns. Because of information’s role as a mediated representation of knowledge (not knowledge itself), the selection and reference of sources is inherently biased. Therefore, it is the responsibility of designers and data modelers within the WikiCite and Wikidata community to “map the patterns of inevitable systemic bias and constantly calibrate the system.”

Librarians can offer some good guidelines on source selection and citation in the Wikimedia community. ACRL has a wonderful information literacy framework that identifies core concepts for understanding disseminated knowledge. FRBR was designed by librarians to create an entity-relationship model for databases (catalogs) to reflect the conceptual structure of information resources. Another Day 1 talk from Andrea Zanni, “Wikidata and Booooks”, opened with the slide “books are complex”. Indeed! Last year’s WikiCite 2016 included an attempt to build a data model for books on the presumption that a book is a knowable, simple item for a library or repository. Zanni described the group’s reliance on FRBR as a model through which to understand the structure and biases of a book, and how or whether to represent that in Wikidata. It is far more complex than simply determining a set of properties to describe a book because the metadata in a data model is insufficient to accurately represent the relationship between the item and the work. One way to manage this is to create multiple entities (Wikidata items) and place them within a hierarchy or graph of the work. But there are hundreds of millions of works – is it reasonable to expect WikiCite to add double or quadruple the amount of items for works? Alternatively, a 1:1 ratio inevitably omits a lot of significant information.

As a librarian I could spend several more pages writing about the theory of references, citations, bibliographies and the representation of knowledge, but I thought I would dive right in with some BHL metadata to see what is currently possible for your average Wikidata editor.

Practice

Creating a central repository for structured, separable, and open bibliographic metadata is essential for connecting the sum of all human knowledge. In this, librarians can be useful allies. WikiCite draws from the academic community in which citation data (generally from peer-reviewed articles) is crucial for creating and linking knowledge. This graph from Joe Wass’s slide deck for his talk “Crossref Event Data: Transparency First” shows Wikipedia as a significant entry point to scholarly literature.

Indeed, the focus thus far on adding journal articles to Wikidata has been driven by the scientific community and has implicated publishers more than libraries. GLAMs, however, have *tons* of bibliographic and collection metadata for heritage literature, historical texts, special collections, and institutional archives. These data can fill in gaps in citation graphs, create context for other items, and add valuable structured historical content to WikiCite and Wikidata.

In order to see exactly where we were with WikiCite I pulled some bib records from BHL and went ahead and added them to Wikidata. I created several manually and then tried the Source MetaData tool to import via DOI. I also downloaded the database tables as .tsv files, and hoped to add those to Wikidata. Unfortunately I can’t find a tool or method to support this. I’m also not so sure how well the tables’ structure will work with Wikidata. The big positive take away is that there is a “BHL Page ID” property in Wikidata that I can add a statement for. So in addition to adding a item I can link that item directly to BHL’s book viewer! This seems like a great way to open up collections!

On the other hand, adding bib records to Wikidata is not without work. All of the work to resolve author names has to be done manually. When creating a new item by hand, I had to research Wikipedia and Wikidata to locate the correct forms of author names to list in the P50 – Author (Wikidata item) property. Source MetaData adds authors as values for P2093 – Author String, not P50. Consider how “Alexander Agassiz” is the name for the Wikipedia article about the scientist, “Alexander Emanuel Agassiz” is the label for the Wikidata item, and BHL reconciles the author name as Agassiz, Alexander, 1835-1910 from the Library of Congress Name Authority File. Wikidata is also insufficient to accurately represent many information resources. For example: it’s difficult to connect books that are part of monographic series (not serials). Bibliographically it’s important to maintain the relationship between series and book, but I can’t figure out how to. There are also no data models to add and subsequently cite primary source material. One can add a field notebook as a book, but context for expeditions, other scientists, inclusive date ranges instead of a single publication date, the museum or special collections that holds the item, and other organizational relationships are lost.

To conclude then, after learning more about the theory behind WikiCite and where the community is at in terms of adding citation data to Wikidata, I think there are two important functions that Wikidata needs to develop in order to take advantage of GLAM metadata.

Batch importing – a way to generate Wikidata items (and from that, WikiCite citations in other Wikimedia projects) from CSV, TSV, XML, MARC files at scale.

Also, the opposite – as users manipulate and enrich Wikidata items, GLAMs need a way to harvest that new data/metadata and add it our repositories and catalogs.

As WikiCite continues to grow, it will be interesting to follow how (hopefully not whether!) the community integrates GLAM metadata. Generating Linked Open Data citations has the potential to connect objects and concepts with information resources to create context for more accurate and interesting digital representations of knowledge and cultural heritage. And more complete and complex graphs are better at supporting deeper investigations, queries, and visualizations of these data across repositories, collections, and knowledge bases.