Linked Data part 2: Where Is the Data?

Linked Data is a relatively new phenomenon in the World Wide Web, providing access to structured data. What is structured data? World Wide Web is now a universal vehicle for human-readable information - all websites, articles, apps give us information that we can read and interpret, for example an answer to the question “when is the next bus coming to this bus stop?” Such information is not easy for a computer to read - it does not know what “this stop” means, whether you are waiting a specific line or any bus, etc. Computers require information with a structure, which for example can take form of label:value pairs (“bus stop number:4398, bus line:Q11, distance from the stop:2.5 miles, etc.).

Information is commonly stored in databases, which have evolved to be very efficient in data storage and retrieval, but terrible in information sharing. Each database has lots of columns, each named differently and only the local computer system knows how to retrieve the data. This is where the new concept, Linked Data, comes to the rescue. Linked Data is an system that makes computers understand each other by labeling databases with metadata. Its metadata scheme, RDF (resource description framework), requires that data comes not in provincial tables, but in universally readable RDF sentences, consisting of subject, predicate and object. Instead of invented column names we use standard names arranged in ontologies, and instead of a textual description of the subject of the RDF sentence we use its identifier, URI (Universal Resource Identifier). Thus, instead of the trivial for the human reader information about the title of this blog (after all we can read it above, right?) we get a structured sentence or “triple” in RDF lingo, [http://www.pilsudski.org/en/news/blog/832 - dc:title - “Linked Data part 2: Where Is the Data?”]. The first part is the URI or unique “address” of this article, the second means “title” in a specific metadata standard (Dublin Core), and the third part is the actual title.

You can find more about details of Linked Data and RDF in the first part of in this series, “Introduction to Linked Data.” In this blog I would like to focus on actual data sources available today on the WWW, how to find them and what they contain. As Linked Data those sources are for the first time universally accessible not only to readers, but also to computers.

In digitizing archives we often seek a reference to names, places, organizations or events that would be stable (i.e stay the same for a long time) and network accessible. If a name Karol Anders is mentioned, can we find an authoritative source that will unequivocally point to the record of this person? For obvious reasons we will explore only those data sources that are publicly accessible, since the link that we publish on a website, open to every reader, cannot lead to a resource that is not accessible to this reader. We will seek not just Linked Data, but specifically Linked Open Data. The illustration above shows a small fragment of the vast Linked Open Data network, below are only selected sources, with the emphasis on those useful for an archivist or librarian.

Linked Open Data Sources

Data sites

DBPedia is the central hub of Linked Open Data. Its source of data is Wikipedia, but the information in DBPedia is structured, like in a typical database. The work on building DBpedia is ongoing, and it relies largely on volunteers who add infoboxes and other structured data to the Wikipedia entries, and which are automatically retrieved and organized by the DBpedia software. DBPedia can be thought of as another way to view the information in Wikipedia, one that allows the possibility of performing complex queries, and it has all the power and comprehensive coverage of Wikipedia.

VIAF (Virtual International Authority File) is based on the records from libraries in multiple countries, from Library of Congress (USA) to the Library of Alexandria (Egypt). Libraries have collected those records for centuries, and they are a natural source of authoritative information about people, organizations and places. “Authority control” is a way of organizing library catalogs in such a way, that each entry for an author, place etc. has a single, unique name - its “authority record”. The word “authority” is derived from initial use of identifying authors.

GeoNames is a geographical database available and accessible directly and through various web services, under a Creative Commons attribution license. The data are from official public sources and also provided as a crowdsourcing effort. It contains some 10 million names in many languages pointing to defined geographical locations. After finding the place, one can retrieve its full RDF record, see the map, review the administrative hierarchy of the place and more.

YAGO, developed at the Max Planck Institute for Computer Science in Saarbrücken, is an integrator of information collected from Wikipedia, WordNet (a lexical database of the English language maintained by Princeton University) and GeoNames. It provides a consistent interface and many useful tools for accessing the information e.g a visualization tool showing RDF data in graph form. It is also linked to DBPedia.

MusicBrainz is an open content music database. Started as an open alternative to the restricted proprietary Compact Disk database, it has grown to become an universal compendium of information about artists, their recorded works, and relationships between them.

Other Sources

There is a large number of other, more specialized data sources - it is enough to look at the whole diagram of which but a small fragment is depicted above. I will mention only a few:

UniProt is one of many scientific databases that contain a growing amount of raw scientific data. UniProt contains protein sequences, a basic tool for biologists.

KEGG is a database resource for understanding biological systems, such as the cell, the organism and the ecosystem, from molecular-level information, datasets generated by genome sequencing and other high-throughput experimental technologies.

Data.gov is a centralized location of access to datasets from many branches and topics in the US government (not all of them are open).

European Union Open Data Portal is an european equivalent of data.gov, the single point of access to a range of data from the institutional branches and other bodies of the European Union.

New York Times has opened its collection of subjects and underlying structured information: people, organizations, locations etc.

Open Science Data Cloud - a hosting service containing a number of scientific datasets, from Whole human genome sequence data sets to the Space Weather Prediction Center.

Meta Sites

Meta sites list the sites that provide actual data, show their access points, and provide an overview and statistics.

LinkedData.org is a site that collects and organizes data on the Linked Data datasets, provides information and collects staticstics. Data for today: 2122 datasets, 62 billion RDF triplets from 928 datasets (over 50% of the datasets are not yet of sufficient quality).

DataHub - a database of links that collects information on freely available datasets. It claims an impressive 9 thousand sets of data, including sets from World Bank and Federal Reserve Board. Unfortunately it contains a lot of spam, suggesting that the site is not regularly maintained. All the datasets discussed here are also registered in DataHub

Case study: MusicDB

While organizing my collection of mp3 files that I ripped from CD’s I was in quandary. The original ID3 metadata field allowed me to record the “artist”. Should I file the Etude op.10 no.12 under Frederic Chopin or Vladimir Ashkenazy? The newer ID3V2 allows tags for composer and performer, but a song may have an author of lyrics, based on a poem which has its own author, composer of music, singer who performs the song and musicians who play the music etc. Where to find and how to organize such complex data was a topic of a presentation at the METRO 2014 conference by Kimmy Szeto and Christy Cowl entitled “Building Authorities with Crowdsourced and Linked Open Data in ProMusicDB”. The authors discussed the issues involved in building an authoritative ProMusicDB database (in construction). The project requires complex metadata schema to record all the necessary information, or data and identifiers, which includes names and name variations of the people involved, their roles in creation and performing, recording studio details, classification, rights and more. The information is scattered among many different sources. Some of the data were in already mentioned DBPedia/Wikipedia and MusicBrains/Discogs. The personal websites of the performers were tapped for information, as well as performers’ unions. Other sources used were EIDR (Entertainer Identifier Registry), HFA, RIAA, MediaNet, data from Music Schools and Libraries and from streaming services. Combining data from those resources was followed by an authentication and verification process to finally arrive at an authoritative music database.

This example shows, that although detailed data can be often found in the electronic form, they are scattered among different sources and not easy to integrate. In addition, simple metadata schemas such as Dublin Core are insufficient to describe more complex relations. MADS/RDF, a relatively new metadata standard that deserves a separate discussion, is used by the creators of the ProMusicDB as an ontology schema for authority records. We should be looking forward to opening of the ProMusicDB website to see how its creators coped with such complex data gathering.

People, places and institutions

While indexing the archival resources in the Pilsudski Institute, we select the dates, people, places and institutions that deserve special attention. There is obviously more information in archival documents, and we hope researchers will eventually find and use them, but to aid discoverability, those categories have special appeal. Dates are the simplest, as long as they are presented in a standard form, they will be easily searchable. The names are more ambiguous, and it would be useful to link them to authoritative records.

People

Let us take an example of Polish general Władysław Anders. When the person is well known, and has written a book in his or her lifetime, VIAF has the appropriate record. So does YAGO, once you overcome its quirky interface which has problems with polish diacritics. However, search for his brother, colonel Karol Anders, who did not write a book, and both VIAF and YAGO draw a blank. It is not surprising, since both VIAF and YAGO use the same data sources related to book catalogs. On the other hand, Wikipedia has entries for Władysław Anders, and also for both his brothers Karol and Tadeusz. Predictably, the Pilsudski Institute founder and social activist (but not a writer) Stefan Łodzieski has an entry in the Wikipedia but not in VIAF nor YAGO.

Places

Places are usually well covered in VIAF and YAGO, especially larger entities. Therefore Łódź and Wolbórz will be found in VIAF (YAGO possibly have the records as well, but the interface does not recognize the names), but not Borowa where we have spent summer vacations. Here the GeoNames database shines. The entry for Borowa not only shows the map, satellite image, administrative hierarchy (Łódź Voivodeship, Łódź East County etc.), geographical coordinates, but also a link to a Wikipedia article, a complete RDF record and more. This helps especially when faced with historical names, which often changed throughout history. Yuzovka, Hughesovka, Stalino and Donetsk have the same record in GeoNames, while Wikipedia has a pointer to John Hughes, a Welsh engineer who founded the city.

Institutions

The VIAF database has records of institutions that have been listed as publishers of books, periodicals etc. In fact the Pilsudski Institute exists in VIAF several times, which makes it difficult to locate the ‘correct’ record. However, it appears that there is an effort to improve the database. In February 2013, when I was writing a blog on unique identifiers, VIAF had 4 record identifiers describing the Institute, 278200980, 277221969, 262858213 and 151002901. Today three of them point to a single record, and only one duplicate remains. Wikipedia is less likely to have such problems, since it is continuously updated, and indeed it has only one entry for the Institute (for each covered language version). Wikipedia also has disambiguation pages that help locate the appropriate entry from within synonyms or similar names.

Conclusion

The field of Linked Open Data is full of initiatives and data models, and a growing number of sources of actual, useful data. It needs more work to become a Semantic Web, though. The Semantic Web is a collaborative effort of converting or expanding the current www (“web for humans”) to include also structured data (“web of data”).

There are several sources of good quality data that can be used by an archivist. For identifying people and institutions my first recommendation is Wikipedia (and its sister project DBPedia). It is continuously improved and updated, and errors are corrected quickly. VIAF is another source that could be referenced, because it has data collected over a long period of time. Recently Wikipedia and VIAF are being cross-referenced, which should help in locating a correct record. For geographical locations, GeoNames is the resource of choice, being both complete and of high quality.

The Linked Open Data and Semantic Web have a lot of enthusiasts who continuously work on new, better ways to access the data. There are also significant efforts to open access to data that have been closed in government and company vaults. This effort can be especially fruitful in the sciences, some of which routinely generate terabytes of data. It is worthwhile to explore some of the resources mentioned above to experience the power of big data.