Contents

Although Linked Open Data has increased enormously in volume over recent years, there is still no single point of access for querying the over 200 SPARQL repositories. The Balloon project aims to create a Meta Web of Data focusing on structural information by crawling co-reference relationships in all registered and reachable Linked Data SPARQL endpoints. The current Linked Open Data cloud, although huge in size, offers poor service quality and is inadequately maintained, thus complicating access via SPARQL endpoints. This issue needs to be resolved before the Linked Open Data cloud can achieve its full potential.

Today’s vision of a common Web of Data is largely attributable to the Linked Open Data movement. The first wave of the movement transformed silo-based portions of data into a plethora of open accessible and interlinked data sets. The community itself provided guidelines (e.g., 5 stars Open Data) as well as open source tools to foster interactions with the Web of data. Harmonization between those data sets has been established at the modelling level, with unified description schemes characterizing a formal syntax and common data semantic.

Without doubt, Linked Open Data is the de-facto standard to publish and interlink distributed datasets within the Web commonly exposed in SPARQL endpoints. However, a request considering the globally described data set is only possible with strong limitations:

The distributed nature of the Linked Open Data cloud in combination with the large amount of reachable endpoints hinders novice users from interacting with the data.

Following the Linked Data principle, specific URIs are in use to describe specific entities in the endpoints and are further resolvable to get further information on the given entity. The problem arises since each endpoint uses its own URI to describe the single semantic entities leading to semantic ambiguities.

One outcome of the EU FP7 CODE project is the Balloon framework. It tackles exactly this situation and aims to create a Meta Web of Data focusing on structural information. The basement for this is a crawled subset of the Linked Data cloud, resulting in a co-reference index as well as structural information. The main idea behind this index is to resolve the aforementioned semantic ambiguities by creating sets of semantically equivalent URIs to ease consumption of Linked Open Data. This is enabled by crawling information expressing the links between the endpoints. For this purpose, we consider a specific set of predicates, e.g., sameAs or exactMatch, to be relevant. The complete crawling process relies on SPARQL queries and considers each LOD endpoint registered at the CKAN platform. Here, RDF dumps are explicitly excluded. During the crawling, a clustering approach creates the co-reference clusters leading to a bi-directional view on the co-reference relationships and is the result of a continuous indexing process of SPARQL endpoints. In addition to properties defining the equality of URIs, the indexing service also takes into account properties that enable structural analysis on the data corpus, e.g., rdfs:subclass. On the basis of this data corpus, interesting modules and application scenarios can be defined. For instance, on-going research is focusing on the creation of the following two modules as starting point:

Intelligent and on the fly query rewriting by utilizing co-reference clusters and SPARQL 1.1 Federated Query.

Data analysis, e.g., retrieving common properties or super types for a given set of entities.

These modules are integrated in the overall Balloon platform and serve as a starting point for further applications. To foster community uptake and to increase available modules in the platform, the Balloon project along with the data corpus will soon be made available as open source project.

The idea of leveraging co-reference information is nothing new: The Silk framework [1], SchemEX [2] and the well-known sameAs.org project proposed similar techniques. Nevertheless, the Balloon co-reference approach further considers consistent data provenance chains and the possibilities of cluster manipulations to enhance the overall quality and correctness. Further, the explicit limitation to LOD endpoints sets a clear focus on the data that is (in principle) retrievable, in contrast to RDF dumps that are not searchable out of the box.

While creating the co-reference index, we encountered several issues in the current Linked Open Data cloud. Missing maintenance of endpoints over years as well as a lack of quality of service hinders the Linked Open Data cloud from reaching its potential. Our findings gathered during the crawling process are in keeping with the current statistics provided by the LOD2 project of the Linked Open Data cloud: From a total of 700 official data sets, only approximately 210 are enclosed in a SPARQL endpoint and registered at the CKAN platform. Further, more than half of the available endpoints had to be excluded due to insufficient support of SPARQL as well as unattainability. Finally, only 112 endpoints have been actively crawled for co-reference information leading to a total of 22.4M distinct URIs (approx. 8.4M synonym groups). During the crawling phase we also encountered the need for a SPARQL feature lookup service. The main intention is to describe the actually supported retrieval abilities of an endpoint in a standardized way. Currently, discussions on this topic are observable at community mailing lists.