MOCHA Open Challenge 2017-2018 – Tasks

Task 1: RDF Data Ingestion

Summary Description

The constant growth of the Linked Data Web in velocity and volume has increased the need for triple stores to ingest streams of data and perform queries on this data efficiently. The aim of this task is to measure the performance of SPARQL query processing systems when faced with streams of data from industrial machinery in terms of efficiency and completeness. The experimental setup will hence be as follows:, we will increase the size and velocity of RDF data used in our benchmarks to evaluate how well can a system store streaming RDF data obtained from industry. The data will be generated from one or multiple resources in parallel and will be inserted using SPARQL INSERT queries. This facet of triple stores has (to the best of our knowledge) never been benchmarked before. SPARQL SELECT queries will be used to test the system’s ingestion performance and storage abilities. The components of the benchmark for this task are implemented in Java.

Testing and Training Data

The input data for this task consists of data derived from mimicking algorithms trained on real industrial datasets (see Use Cases for details). Each training dataset will include RDF triples generated within a period of time (e.g., a production cycle) of production. Each event (e.g., each sensor measurement or tweet) will have a timestamp that indicates when it was generated. The datasets will differ in size regarding the number of triples per second. During the test, data to be ingested will be generated using data agents (in form of distributed threads). An agent is a data generator who is responsible for inserting its assigned set of triples into a triple store, using a SPARQL INSERT query. Each agent will emulate a dataset that covers the duration of the benchmark. All agents will operate in parallel and will be independent of each other. As a result, the storage solution benchmarked will have to support concurrent inserts. The insertion of a triple is based on its generation timestamp. To emulate the ingestion of streaming RDF triples produced within large time periods within a shorter time frame, we will use a time dilatation factor that will be applied to the timestamp of the triples. Our benchmark allows for testing the performance of the ingestion in terms of precision and recall by deploying datasets that vary in volume (size of triples and timestamps), and use different dilatation values, various number of agents and different size of update queries. The testing and training data are derived from public transport datasets and are available here:

Use Cases

This task will aim to reflect real loads on triple stores used in real applications. We will hence use the the public transport dataset. However, the benchmark itself can be used for other real time applications including (but not limited to):

provide a SystemAdapter class on their prefered programming language. The SystemAdapter is main component that establishes the communication between the other benchmark components and the participant’s system. The functionality of a SystemAdapter is divided in the following steps:

Initialization of the storage system

Retrieval of triples in the form of an INSERT SPARQL query and insertion of the aforementioned triples into the storage.

Retrieval of string representation of the graph name of each data generator.

Retrieval and execution of SELECT SPARQL queries against the storage system and then send the results to the EvaluationStorage component.

provide a storage system that processes INSERT SPARQL queries. The data insertion will be performed via INSERT SPARQL queries, that will be generated by different data generators. The triple store must be able to handle multiple INSERT SPARQL queries at the same time, since each data generator runs independently of the others. Firstly, an INSERT query will be created by a data generator using an Apache Jena RDF model (dataModel), that includes of a set of triples that were generated at particular point in time. The INSERT query will be created, saved into a file and then transformed into a byte array using the following Java commands. Please note that each data generator will perform INSERT queries against its own graph, so the INSERT query will include the GRAPH clause with the name of the corresponding graph.

The fileContent includes a UTF-8 String representation of the INSERT query. The function RabbitMQUtils.writeString(String str) creates a byte array representation of the given String str using UTF-8 encoding. The function RabbitMQUtils.writeByteArrays(byte[][] array) returns a byte array containing all given input arrays and places their lengths in front of them. Then, the insertData will be send to the SystemAdapter from the data generator using the function sendDataToSystemAdapter(byte[] insertData). The SystemAdapter must be able to receive the INSERT query as byte array and then transform them into a UTF-8 encoded String format. If a participant chooses Java as his/her programming language, the received byte array can be read using the ByteBuffer buffer = ByteBuffer.wrap(insertData) function, that wraps a byte array into a buffer, and then use the command RabbitMQUtils.readString( buffer) to get a UTF-8 encoded String for the INSERT query. Then, the query must be performed against the storage system.

provide a storage solution that can process the UTF-8 string representation of the name of a graph, in which the SELECT and INSERT queries will be performed. A graph name will be converted by each data generator using the command byte[] insertData = RabbitMQUtils.writeByteArrays(new byte[][] { RabbitMQUtils.writeString(graphName) }); and then sent to the SystemAdapter using sendDataToSystemAdapter(byte[] insertData).

provide a storage solution that can process SELECT SPARQL queries. The performance of a storage system will be explored by using SELECT sparql queries. Each SELECT query will be executed after a predefined set of INSERT queries are performed against the system (configurable platform parameter). A SELECT query will be conducted by the corresponding data generator and then it will be send to a task generator, along with the expected results, all combined in the form of a byte array[]. The task generator will read both the SELECT query and the expected results, and will send the SELECT query as a byte array to the SystemAdapter along with the task unique identifier, which is a UTF-8 encoded String. The expected results will be read also by the task generator and send to the storage system component as a byte array.

The SystemAdapter must be able to receive the SELECT query as a byte array and then transform it into a UTF-8 encoded String format (as described above in case of Java). Then, the SystemAdapter performs the SELECT query against the storage system and the retrieved results must be serialised into JSON format (that abides the w3c standards https://www.w3.org/TR/rdf-sparql-json-res/). The JSON strings should be UTF-8 encoded. Finally, the results in JSON will be transformed into a byte array and sent to the evaluation storage along with the task’s unique identifier. Please note that each SELECT query will include the GRAPH clause with the URI of the corresponding graph that the data must be inserted to.

provide any necessary parameters to their systems that grant access for inserting triples into their storage system.

The following example is a description of HOBBIT’s API for participants that use Java as their programming language.

Evaluation

Our evaluation consists of three KPIs:

Recall, Precision and F-measure: The INSERT queries created by each data generator will be send into a triple store by bulk load. Note that the insertion of triples via INSERT queries will not be done in equal time periods but based on their real time stamp generation, emulating a realistic scenario. After a stream of INSERT queries is performed against the triple store, a SELECT query will be conducted by the corresponding data generator. The SELECT query will be sent to the task generator along with the expected answers. Then, the task generator will send the SELECT query to the SystemAdapater and the expected results in the evaluation storage. As explained above, once the SystemAdapter performs the SELECT query against the the triple store system, it receives and sends the retrieved results into the evaluation storage, as well. At the end of each experiment, we will compute the recall, precision and F-measure of each SELECT query by comparing the expected and retrieved results, and the micro and macro average recall precision and F-measure of the whole benchmark. The expected results for each SELECT query will be conducted former to the system evaluation by inserting and query an instance of the Jena TDB storage solution.

Triples per second: at the end of each stream and once the corresponding SELECT query is performed against the system, we will measure the triples per second as a fraction of the total number of triples that were inserted during that stream divided by the total time needed for those triples to be inserted (begin point of SELECT query – begin point of the first INSERT query of the stream).

Average answer time: we will report the average answer delay between the time stamp that the SELECT query has been executed and the time stamp that the results are send to the evaluation storage. There is no additional effort need by the participants to calculate the aforementioned time stamps, since the first time stamp is generated by the task generator and the second time stamp is generated by the evaluation storage.

Transparency will be assured by releasing the dataset generators as well as the configurations.

Task 2: Data Storage Benchmark

Summary Description

This task consists of an RDF benchmark that measures how datastores perform with interactive, simple, read, SPARQL queries. Running the queries is accompanied with high insert rate of the data (SPARQL INSERT queries), in order to mimic the real use cases where READ and WRITE operations are bundled together. Typical bulk loading scenarios are supported. The queries and query mixes are designed to stress the system under test in different choke-point areas, while being credible and realistic.

Testing and Training Data

The LDBC Social Network Benchmark is used as a starting point for this benchmark. The dataset generator developed for the previously mentioned benchmark is modified in order to produce synthetic RDF datasets available in different sizes, but more realistic and more RDF-like. The structuredness of the dataset is in line with real-world RDF datasets unlike the LDBC Social Network Benchmark dataset, which is designed to be more generic and very well structured . The output of the dataset is splitted in three parts: the dataset that should be loaded by the system under test, a set of update streams containing update queries and a set of files containing the different parameter bindings that will be used by the driver to generate the read queries of the workloads.

Use Cases

The use case of this task is an online social network since it is the most representative and relevant use case of modern graph-like applications. A social network site represents a relevant use case for the following reasons:

It is simple to understand for a large audience, as it is arguably present to our every-day life in different shapes and forms.

It allows testing a complete range of interesting challenges, by means of different workloads targeting systems of different nature and characteristics.

A social network can be scaled, allowing the design of a scalable benchmark targeting systems of different sizes and budgets.

Requirements

For task 2, participants must:

provide his/her solution as a docker image (same as Task 1)

provide a SystemAdapter class that:

Receives generated data that come from Data Generator
The implemented SystemAdapter has to retrieve triples, representing the dataset, that has to be bulk loaded, in RDF files (in standard Turtle format).

Receives tasks that come from Task Generators:
There are two different types of tasks:

SPARQL SELECT query: There are 21 different query types, with a lot of query parameters, that should be executed in specific order, with specified query mix

SPARQL UPDATE query: There are 8 different types of updates, inserting different types of entities to the triple store

All the the queries are written in SPARQL 1.1 standard.

Sends the results to the evaluation storage as a byte array

provide a storage solution that can handle SELECT and UPDATE SPARQL queries, as well as the bulk loading of the dataset files

Evaluation

After generating the dataset of desired size, the whole dataset will be bulk loaded, and the time will be measured. Afterwards, the queries (SPARQL INSERT and SPARQL SELECT) will be executed against the system under test, their results will be sent to the evaluation storage, that is responsible for measuring their execution times, and comparison the actual answers to the expected ones used as a gold standard.
The KPIs that will be relevant are:

Throughput (queries per second): The execution rate per second for all queries. The execution time is measured for every single query, and for each query type the average query execution time will be calculated as well.

Query failures: The number of returned results that are not as expected.

Task 3: Versioning RDF Data

Summary Description

The evolution of datasets would often require storing different versions of the same dataset, so that interlinked datasets can refer to older versions of an evolving dataset and upgrade at their own pace, if at all. Supporting the functionality of accessing and querying past versions of an evolving dataset is the main challenge for archiving/versioning systems. In this sub-challenge we will propose a benchmark that will be used to test the ability of versioning systems to efficiently manage evolving datasets and queries evaluated across the multiple versions of said datasets.

Testing and Training Data

The Semantic Publishing Benchmark (SPB) generator will be used to produce datasets and versions thereof. SPB was developed in the context of the Linked Data Benchmark Council (LDBC) and is inspired by the Media/Publishing industry, and in particular by BBC’s “Dynamic Semantic Publishing” (DSP) concept. We will use the SPB generator that uses ontologies and reference datasets provided by BBC, to produce sets of creative works. Creative works are metadata represented in RDF about real world events (e.g., sport events, elections). The data generator supports the creation of arbitrarily large RDF datasets in the order of billions of triples that mimic the characteristics of the real BBC datasets. Data generation follows three principles:

data clustering in which the number of creative works produced diminishes as time goes by

correlation of entities where two or three entities are used to tag creative works for a fixed period of time and last

random tagging of entities where random data distributions are defined with a bias towards popular entities created when the tagging is performed.

The data generator follows distributions that have been obtained from real world datasets thereby producing data that bear similar characteristics to real ones. The versioning benchmark that will be used in this sub-challenge include datasets and versions (constructed after adding a set of triples to the previous version) thereof that respect the aforementioned principles.

The training data are available here. Please first read the documentation in the README.txt file.

Use Cases

The use cases that are considered by this benchmark are those that address versioning problems. Such use cases span from different domains and applications of interest such as the energy domain, semantic publishing, biology, etc. For this task we will employ data from the semantic publishing domain.

Requirements

For task 3, participants must:

provide his/her solution as a docker image. First install docker using the instructions found here and then follow the guide on how to create your own docker image found here.

upload their systems to the HOBBIT platform using the instructions found here.

provide a SystemAdapter class on their prefered programming language. The SystemAdapter is main component that establishes the communication between the other benchmark components and the participant’s system. The functionality of a SystemAdapter is divided in the following steps:

Initialization of the storage system

Retrieval of UTF-8 string representation of the graph name that determines the version in which the received data have to be loaded.

Retrieval of generated data in the form of RDF files.

Retrieval and execution of SELECT SPARQL queries against the storage system and then send the results to the EvaluationStorage component.

Shut down of the storage system.

More details of some steps can be found later.

For more information on how to create a SystemAdapter please follow the instructions found here.

Here you can find an example SystemAdapter for Virtuoso system implemented in JAVA. To let Virtuoso manage evolving data we considered that each version is stored in its own named graph with name http://graph.version.{version_num}, so the Full Materialization archiving strategy was followed.

Retrieval of string representation of the graph name:

Graph names may have one of the following form:

http://datagen.version.0.{data_file_name}, where data_file_name is the name of the sent data file. The data files of version 0 correspond to the required ontologies and the generated creative works that have to be loaded into the initial version of the dataset.

http://datagen.added.set.{version_num}.{data_file_name}, where version_num determines the version that we will end up after the addition of current data. Note that in the current version of the benchmark only additions, with respect to the previous version, are supported. data_file_name is the name of the sent data file.

E.g. assuming a system, that hold independent copies of different versions, and chose the generated data to divided into 3 versions, for having all received data loaded into the correct versions it has to:

Load to V0 received data with graph names:

http://datagen.version.0.{data_file_name}

Load to V1 received data with graph names:

http://datagen.version.0.{data_file_name}

http://datagen.added.set.1.{data_file_name}

Load to V2 received data with graph names:

http://datagen.version.0.{data_file_name}

http://datagen.added.set.1.{data_file_name}

http://datagen.added.set.2.{data_file_name}

Retrieval and execution of SELECT SPARQL queries.

The performance of a storage system will be explored by using 8 different types of SELECT SPARQL queries. The SystemAdapter must be able to receive the SELECT query as a byte array and then transform it into a UTF-8 encoded String format. The received queries are written in SPARQL 1.1 assuming that each version is stored in its own named graph. Systems that follow different storage implementation or uses their own enhanced versions of SPARQL to query versions, have to rewrite the queries accordingly. After the appropriate adjustments the SystemAdapter performs the SELECT query against the storage system and the retrieved results must be serialised into JSON format (that abides the w3c standards). The JSON strings should be UTF-8 encoded. Finally, the results in JSON will be transformed into a byte array and sent to the evaluation storage along with the task’s unique identifier.

Evaluation

To test the ability of a versioning system to store multiple versions of datasets, our versioning benchmark produce versions of an initial dataset using as parameters (a) the number of required versions (b) the total size of triples (including triples of all versions). The number of versions is specified by the user of the benchmark who is also able to specify the starting time as long as the duration of generated data. In this manner the user can check how well the storage system can address the requirements raised by the nature of the versioned data. The benchmark tests the ability of the system to answer eight different type of versioning queries, as those described in Section 2.2 of D5.2.1_First_Version_Versioning_Benchmark. These queries are specified in terms of the ontology of the Semantic Publishing Benchmark and written in SPARQL 1.1.

In our evaluation we will focus on the following KPIs:

Query failures: The number of queries that failed to be executed are measured. By failure we mean that the returned results are not those that expected.

Throughput (in queries per second): The execution rate per second for all queries.

Initial version ingestion speed (in triples per second): The total triples that can be loaded per second for the dataset’s initial version. We distinguish this from the ingestion speed of the other versions because the loading of the initial version greatly differs in relation to the loading of the following ones, where different underlying procedures as, computing deltas, reconstructing versions, storing duplicated information between versions, may take place.

Applied changes speed (in triples per second): The average number of changes that can be stored by the benchmarked system per second after the loading of all new versions. Such KPI tries to quantify the overhead of underlying procedures, mentioned in Initial version ingestion speed KPI, that take place when a set of changes are applied to a previous version.

Average query execution time (in ms): The average execution time, in milliseconds for each one of the eight versioning query types (e.g. version materialization, single version queries, cross-version queries etc.

Task 4: Faceted Browsing

Summary description

Faceted browsing stands for a session-based (state-dependent) interactive method for query formulation over a multi-dimensional information space. It provides a user with an effective way for exploration through a search space. After having defined the initial search space, i.e., the set of resources of interest to the user, a browsing scenario consists of applying (or removing) filter restrictions of object-valued properties or of changing the range of a number-valued properties. Using such operations aimed to select resources with desired properties, the user browses from state to state, where a state consists of the currently chosen facets and facet values and the current set of instances satisfying all chosen constraints.

The task on faceted browsing checks existing solutions for their capabilities of enabling faceted browsing through large-scale RDF datasets, that is, it analyses their efficiency in navigating through large datasets, where the navigation is driven by intelligent iterative restrictions. In several browsing scenarios we measure the performance relative to different choke points.

1. Property based transition(Find all instances which realize a certain property with any property value)
2. Property value based transition(Find all instances which have a certain property value)
3. Property path value based transition(Find all instances which have a certain value at the end of a property path)
4. Property class value based transition(Find all instances which have a property value lying in a certain class)
5. Transition of a selected property value class to one of its subclasses(For a selected class that a property value should belong to, select a subclass)
6. Change of bounds of directly related numerical data(Find all instances that have numerical data lying within a certain interval behind a directly related property)
7. Change of numerical data related via a property path of length strictly greater than one edge (Similar to 6, but now the numerical data is indirectly related to the instances via a property path)
8. Restrictions of numerical data where multiple dimensions are involved(Choke points 7 and 8 under the assumption that bounds have been chosen for more than one dimension of numerical data)
9. Unbounded intervals involved in numerical data (Choke points 7,8,9 when intervals are unbounded and only an upper or lower bound is chosen)
10. Undoing former restrictions to previous state(Go back to instances of a previous step)
11. Entity-type switch changing the solution space(Change of solution space while keeping the current filter selections)
12. Complicated property paths or circles(Choke points 3 and 4 with advanced property paths involved)
13. Inverse direction of an edge involved in property path based transition(Property path value and property value based transitions where the property path involves traversing edges in the inverse direction)
14. Numerical restriction over a property path involving the inverse direction of an edge(Numerical data restrictions at the end of a property path where the property path involves traversing edges in the inverse direction)

Testing and training data

For this task, the transport dataset of linked connections will be used. The transport dataset is provided by a data generator and consists of train connections modelled using the transport ontology following GTFS (General Transit Feed Specification) standards – see here for more details. The datasets may be generated in different sizes, while the underlying ontology remains the same – see here for a visualization of the ontology relevant to the task.

A participating system is required to answer a sequence of SPARQL queries, which simulate browsing scenarios through the underlying dataset. The browsing scenarios are motivated by the natural navigation behaviour of a user (such as a data scientist) through the data, as well as to check participating systems on certain choke points. The queries involve temporal (time slices), spatial (different map views) and structural (ontology related) aspects.

For training we provide a dataset of triples in Turtle coming from our generator as well as as list of SPARQL queries for sample browsing scenarios. Two scenarios are similar to the ones used in the testing phase, while a third is meant to illustrate all the possible choke points that we aim to test on. The training data are available at: http://hobbitdata.informatik.uni-leipzig.de/MOCHA_OC/Task4/

Use Cases

Intelligent browsing by humans aims to find specific information under certain assumptions along temporal, spatial or other dimensions of statistical data. “Since plain web browsers support sessional browsing in a very primitive way (just back and forth), there is a need for more effective and flexible methods that allow users to progressively reach a state that satisfies them”, as Tzitzikas et al point out in their recent survey on faceted browsing (DOI: 10.1007/s10844-016-0413-8). The ability to efficiently perform such faceted browsing is therefore important for the exploration of most datasets, for example in human-controlled information retrieval from topic oriented datasets. We will include a use case in which a data analyst wants to explore the characteristics of a train network (e.g. delays in a particular region and certain times of a day) based on Linked Connections dataset (see here for details)

Requirements

For task 4, participants must:

provide his/her solution as a docker image (as Task 1)

provide a SystemAdapter recieving and executing SELECT and SELECT COUNT SPARQL queries and subsequently send the results to the EvaluationStorage in formats defined below. The platform sends out a new query to the system adapter only after a reply to the former query has been recorded.

the incoming SPARQL queries have to be read from incoming byte arrays as defined by the task queue, where the ‘data’ part of the byte array contains the SPARQL query in a String in UTF-8.

for instance retrieval (SELECT) queries, the result list should be returned as a byte array following the result queue standard with the ‘data’ part of the byte array containing a UTF-8 String with the results as a comma separated list of URIs. The byte array needs to be sent to the evaluation storage.

for facet count (SELECT COUNT) queries, the result should be returned as a byte array following the result queue standard with the ‘data’ part consisting of a String that contains the count (integer) value as UTF-8 encoded String.

A participating system needs to answer SPARQL `SELECT’ and `SELECT COUNT’ queries. To answer the queries, the system in particular needs to support

Systems have to correctly interpret the notion rdfs:subClassOf* denoting a path of zero or more occurrences of rdfs:subclassOf .

Evaluation

During the simulated browsing scenario through the dataset two types of queries are to be answered correctly

Facet counts (in form of SPARQL SELECT COUNT queries):
For a specific facet, we ask for the number of instances that remain relevant after restriction over this facet. To increase efficiency, approximate counts (e.g. obtained by different indexing techniques) may be returned by a participating system.

Instance retrieval (in form of SPARQL SELECT queries):
After selecting a certain facet as a further filter on the solution space, the actual remaining instances are required to be returned.

One browsing scenario consists of between 8 to 11 changes of the solution space (instance retrievals), where each step may be the selection of a certain facet, a change in the range value for a literal property (which may be indirectly related through a complex property-path), or the action of undoing a previously chosen facet or range restriction.

The evaluation is based on the following performance measures:

Time: The time required by the system is measured for the two tasks facet count and instance retrieval separately. The results are returned in a score function computing number of returned queries per second. For the instance retrieval queries, we additionally compute the query per second score for several choke points separately.

Accuracy of counts: The facet counts are being checked for correctness. For each facet count, we record the distance of the returned count to the correct count in terms of absolute value and we record the error in relation to the size of solution space (relative error). We both sum and average over all steps of the browsing scenario, resulting in four overall error terms:

overall absolute error (sum of all errors)

average absolute error

overall relative error (sum of all errors over sum of all counts)

average relative error (average of relative over all count queries)

Accuracy of instance retrievals: For each instance retrieval we collect the true positives, the false positives and false negatives to compute an overall precision, recall and F1-score. Additionally, we compute precision, recall and F1-score for several choke points separately.