SANSA – The Distributed Semantic Analytics Toolkit in SLIPO

Within the SLIPO project, we have to deal with large volumes of structured Point of Interest (POI) data. One of the frameworks, we use and extend for this purpose is SANSA. SANSA combines distributed computing frameworks (specifically Spark and Flink) with the semantic technology stack. Here is an illustration:

Figure 1: The SANSA vision combines distributed analytics (left) and semantic technologies (right) into a scalable semantic analytics stack (top). The colours encode what part of the two original stacks influence which part of the SANSA stack. The main objective of SANSA is to investigate whether the characteristics of each technology stack (bottom) can be combined to retain the respective advantages.

The combination of distributed computing and semantic technologies allows the exploitation of solutions and advantages from both fields. In particular, SANSA inherits the following advantages from the semantic technology stack:

a) Powerful Data Integration: Current analytics pipelines need to be able to handle increasing data variety and complexity. Moving from common short term, ad hoc solutions that require a lot of engineering effort, to standardised and well-understood semantic technology approaches had and will have significant impact.

b) Expressive Modelling: The vast majority of machine learning algorithms rely on simple input formats, such as feature vectors, rather than being able to use expressive modelling via the Resource Description Framework (RDF) and the Web Ontology Language (OWL). While this has been researched in fields such as Statistical Relational Learning and Inductive Logic Programming, these methods usually do not scale horizontally. Initial work on horizontally scalable machine learning on structured data has been performed, particularly in terms of adding graph processing capabilities to distributed computing frameworks, but those are not aimed at semantic technologies and currently provide limited capabilities.

c) Standards: The usage of W3C standards can generally reduce pre-processing time in those cases when data sources are used for more than one analytics task. This is the case for knowledge graphs, which are often combined with several applications including search, information retrieval, advanced querying and filtering of information, as well as visualisation. Beyond this, the standardisation allows to draw on generic approaches, e.g. for querying and merging data, rather than developing ad hoc solutions, which are less reusable and often less efficient and effective. The use of standards will also enable a clearer separation of the data pre-processing step, i.e. RDF modelling, and the actual analytics step. This allows experts in either step to focus efforts on their actual job and hence to increase the overall efficiency.

SANSA inherits the following advantages from machine learning research and distributed computing:

d) Measurable Benefits: A key driver for the success of machine learning is that its benefits are often directly measurable, e.g. an accuracy improvement can often be directly translated into a financial benefit. This is not really the case for semantic technologies where the benefits gained through the effort of modelling, editing and extracting knowledge are often not easily measurable. A seamless integration of semantic technologies and machine learning, as envisioned in SANSA, will also help to make the benefits of semantic technologies more visible, as they will translate to machine learning results which are more accurate and easier to understand.

e) Horizontal Scalability: Distributed in-memory computing can provide the horizontal scalability required by the high computation and storage demands of large-scale semantic knowledge graphs analytics. However, it does not magically result in higher scalability and requires a deep understanding of the underlying structures and models. For instance, distributed machine learning for expressive logics and the inclusion of inference in knowledge graph embedding models are challenging problems with many open research questions.

Recently, we released SANSA 0.2 as a preparation for analysis tasks in SLIPO. With this release, the following features are currently supported:

Reading and writing RDF files in N-Triples format

Reading OWL files in various standard formats

Querying and partitioning based on Sparqlify

RDFS/RDFS Simple/OWL-Horst forward chaining inference

RDF graph clustering with different algorithms

Rule mining from RDF graphs

Deployment and getting started:

There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.

The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.