Linking data sources with Spinque

An important part of the COMSODE methodology is to make datasets available in an integrated fashion. One aspect of this integration is the alignment of datasets by linking entities from one dataset with the same entities from another dataset. In this blog post, we describe how we applied Spinque’s Search by Strategy approach to linking vocabularies in the cultural heritage domain.

Many heritage institutions describe their objects using terms from predefined vocabularies. Linking these vocabularies aids the integration of the collections. Hereto, The Netherlands Institute for Sound & Vision and Spinque teamed up to develop CultuurLINK. With CultuurLINK owners of a digital heritage collection can link their (internal) vocabularies to the Dutch cultural heritage hub. The hub contains several large vocabularies that are core for the community, such as the audiovisual thesaurus provided by the Netherlands Institute for Sound & Vision. Links to the vocabularies in the Hub enable richer descriptions for the collections, for example in the form of additional background information and labels in multiple languages. And, links between vocabularies connect formerly isolated collections, enabling richer and more contextualized access.

CultuurLINK: interface with strategy editor (top) and result table (bottom).

With CultuurLINK, the user builds a link strategy step-by-step out of basic building blocks; in a similar fashion as constructing a strategy for a Spinque search engine. The interface is shown in the figure above. The top part of the application contains the strategy editor and the bottom part shows the result table. The editor contains the building block library on the left side and the canvas to the right. The user builds the link strategy step-by-step by dragging building blocks on to the canvas and connect them with the other blocks. CultuurLINK contains blocks to filter data sources, match the items from those by comparing their attributes, filter matches using structural properties and partition the result sets for analysis.

When the user selects a building block, the output is computed and presented in the result table. Inspecting the intermediate results helps the user to decide which step to take next, enabling an interactive approach to vocabulary alignment. When the link strategy is completed, the links are exported as SKOS triples. The definition of the link strategy provides the provenance of the links and can be exported as well.

A Multimedia Heritage Example

Let us illustrate how a Spinque search engine may now easily exploit the links created in CultuurLINK, and thereby improve the access to open cultural heritage data. We link the collections of two Dutch institutions, the Open Images collection of the Netherlands Institute for Sound and Vision and the photograph and library collection of the NIOD Institute for War, Holocaust and Genocide Studies. Today, we focus on the linking process itself, where next week we will use these links to define an advanced search engine that ties the two collections together.

The collections are already described with subject terms from controlled vocabularies, but each institute uses their own controlled vocabularies; GTAA and the NIOD term list, respectively. Today’s task is to find the corresponding concepts among these two vocabularies. The screencast at http://www.youtube.com/watch?v=55Wvo1-DZpY illustrates the process in detail, but let us first discuss the main steps in the remainder of this post.

The GTAA and NIOD vocabularies have been modeled using the RDF vocabulary for Simple Knowledge Organization Systems (SKOS). The data sources are represented by the two blocks at the top of the strategy (see also the Figure above). By selecting a datasource block, the content is shown in the table at the bottom of the UI. In the Figure, the GTAA is selected and the table shows its SKOS Concepts, 177,987 in total. Columns show the typical SKOS attributes of the concepts, such as the preferred and alternative labels. Below the column header, a chart indicates how many of the concepts have a value for this attribute. This helps the user to understand how useful these attributes are when used to find links.

By inspecting the result table the user also finds out that the GTAA vocabulary contains different types of concepts organized in different concept schemes, e.g. person names, geographical names, subject terms etc. The NIOD term list only contains subject terms. Therefore we first filter the GTAA to subject terms, as matching also on the other types of concepts could introduce noise in the results. The filter is added by dragging a filter block into the strategy canvas, connecting to the datasource block and configuring appropriately.

Blocks are connected by dragging a line from the output connector of one block (shown at the bottom of a block) to the input connect of another block (at the top of a block). The configuration panel of a block is opened by the clicking the config icon at the top right of the block. The figure below shows the configuration of the string match block that we discuss next.

CultuurLINK: configuration panel of the string match block.

The next step is to match the concepts from the two vocabularies by comparing their attributes. The user can try different types of string match techniques and apply them to different types of attributes. At each step the instant feedback on the results allows the user to determine if the operation is suited, the configuration should improved, the strategy should be extended, or the block should be discarded in favor of a different strategy.

We first find the concepts that have the exact same preferred labels, by adding a string match block and configuring it match on preferred labels. This results in 557 links for 1343 of the concepts of the NIOD term list. By manually evaluating a small subset it becomes clear these matches can be accepted as good links. The figure below shows the result table with the links found by the string match block. It shows for each link at first row the concept from the NIOD vocabulary and at the second row the matching concept from the GTAA vocabulary. The links are manually evaluated by selecting a relation from the dropdown menu.

CultuurLINK: result table showing the links between concepts found by an exact string match on the preferred labels.

Because not all the concepts are mapped, we continue with these and try different string matching techniques and different attributes. For example, using the alternative labels, including stemming, and using fuzzy string matching (given a maximum edit distance). By considering labels of concepts that are not exactly the same we managed to find many additional links. Of course, this comes at a cost: more of these links contain errors. It depends on the use case how these errors are best detected. For example, hierarchical relations among the concepts can be used to filter out links for which the parents of the concepts are also linked. In this example the number of hierarchical relations contained in the vocabularies is too limited to effectively apply such a filter. Furthermore the number of links that are found with the additional blocks is in this case small enough to check by hand. The figure below shows the strategy after two additional string match blocks were added. Both blocks continue with the concepts for which no links were found in the previous block. This is done by connecting to the NOT output of the block.

CultuurLINK: strategy with three string match blocks.

At this stage more than half of the concepts from NIOD are linked to a concept from GTAA. By manually inspecting the concepts that are not linked it becomes clear that many of these do not occur in the GTAA as they are very specific topics for the NIOD Institute for War, Holocaust and Genocide Studies.

Take a look at the screencast for more details about the individual steps. Next week we discuss how to use the links between the vocabularies in an cross-collection search application.