Objective

Linked Data has gained significant momentum over the last years. It is now used at industrial scale in many sectors in which an increasingly large amount of rapidly changing data needs to be processed. HOBBIT is an ambitious project that aims to push the development of Big Linked Data (BLD) processing solutions by providing a family of industry-relevant benchmarks for the BLD value chain through a generic evaluation platform. We aim to make open deterministic benchmarks available to test the performance of existing systems and push the development of innovative industry-relevant solutions. The underlying data will mimic real industrial data assembled during the course of the project. At the beginning of the project, HOBBIT will work on roughly 1PB of real industry-relevant data from 4 different domains. The data will be extended through collaborations during the project. To push the use of the benchmarks, we will organize or join challenges that aim to measure the performance of technologies for the different steps of the BLD lifecycle. In contrast to existing benchmarks, we will provide modular and easily extensible benchmarks for all industry-relevant BLD processing steps that allow to assess whole suites of software that cover more than one step.The infrastructure necessary to run the evaluation campaigns will be made available. Our architecture will rely on web interfaces and cloud infrastructures to ensure scalability. The open HOBBIT platform will make human- and machine-readable, public periodic reports available. As exit strategy, the project will create an association after the second project year that will be sustained by the means of subscriptions from industry and academia and associated with existing benchmarking associations. The clear portfolio of added value for the members will be defined in the early project stages and disseminated throughout the evaluation campaigns.

Select language

Big Link Data benchmarking gains ground in industry

Making ‘Big Linked Data’ a bankable solution for industry requires appropriate benchmarking tools to ensure that the developed solutions meet use cases’ requirements. Such tools are now available thanks to work conducted under the HOBBIT project.

Ever heard of Linked Data? If not, you probably should have or will have soon enough. Just as Big Data is an evolution of data mining, Linked Data is an evolution of the Semantic Web which is itself the cornerstone of the Web 3.0 – an Internet where all information is categorised in such a way that computers and humans are made equal in their capacity to understand it. In a nutshell, Linked Data consists in using the web to connect related data that previously wasn’t.
Industry already uses Linked Data, but its integration with Big Data has so far been hindered by the cost and difficulty of using the latter in a value chain. ‘Big Linked Data’ is facing obstacles related to the lack of standardised implementations of performance indicators – making it difficult to decide which tool to use and when to use it – and the fact that some of the dimensions of Big Data (velocity, volume, variety, veracity, value) are poorly supported by existing tools.
“For example, managing billions of RDF triples (ed. note: a set of three entities that codifies a statement about semantic data in the form of subject–predicate–object expressions, such ‘John Doe loves CORDIS’) is still a major problem, volume-wise,” explains Prof. Dr Axel Ngonga of Paderborn University and the Institute for Applied Informatics in Leipzig. “Besides, the different streaming semantics and the lack of scalability of existing solutions make semantic stream processing at scale rather challenging (velocity issue). Finally, current learning approaches for structured data often don’t scale to large knowledge bases, making the detection of insights difficult (value).”
Prof. Dr Ngonga has been leading a nine-strong consortium under the HOBBIT (Holistic Benchmarking of Big Linked Data) project to address these problems. Focusing on Industry 4.0 geo-spatial data management, smart cities and IT management, the team carried out surveys with over 100 participants before and during the project to determine key areas for benchmarking linked data. “Our surveys suggest that the benchmark families we created address some of the key domains of interest for European companies and researchers,” he explains.
HOBBIT created a total of five benchmarking families to evaluate current software: knowledge extraction, storage, versioning, linking, and machine learning and question answering. On storage, they found that some of the solutions that performed best actually did so because the results they returned were partially incomplete. This alone proves that HOBBIT’s benchmarking covers previously unconsidered aspects and that there is a need for benchmarks all around Linked Data.
Other findings include the fact that easily distributable solutions for knowledge extraction are still needed; that versioning is poorly supported and requires a standard; that open question-answering platforms still perform poorly in the wild; and that machine learning algorithms specific to Linked Data don’t scale too well.
In this context, HOBBIT provides the first open, scalable and FAIR (findable, achievable, interoperable and retrievable results) benchmarking for Linked Data: “The HOBBIT platform is the first generic scalable benchmark for Big Linked Data. Its most innovative aspects include: distributed benchmarking of distributed systems; its portable nature for benchmarking both locally and in distributed environments; a one-command installation both locally and on Amazon Web Services; the reuse of standards for maximal interoperability and flexibility; and clearly defined interfaces for easy adaption to other data types and use cases,” says Dr Ngonga.
The platform has been well received by industry, with ca. 40 clones being created each month and some industrial partners willing to take benchmarking services internally to improve the quality of their tools.
The HOBBIT project will only end in November, as a second round of benchmarks is currently being run. The association created under the project will then take over, serving as a hub for benchmarking in Europe, supporting the further development of the HOBBIT platform and similar benchmarking frameworks, and providing benchmarking services to European stakeholders.

The availability of diverse solutions for processing Big Data is a mixed blessing. While the available solutions were designed to cater for a variety of needs, it is often unclear to users whether or to which degree their needs were taken into consideration. Hence, users have the need to evaluate the performance of these solutions objectively so as to select, integrate and use Big Data software solutions in an informed manner. The major problem addressed by HOBBIT is the lack of uniform solutions to benchmarking Big Linked Data across its lifecycle (see Figure 1). HOBBIT addresses the need for better solutions for benchmarking Big Linked Data through the following objectives:- An open task-driven benchmarking platform to evaluate the performance of Big Linked Data processing systems. The platform is designed to be a scalable distributed solution designed to benchmark large-scale solutions. This benchmarking solution needs to be highly portable and run both on single machines and computer clusters to ensure that it supports benchmarking at any scale. Its main features must also include the generation of open, human- and machine-readable reports on the evaluation campaign results. The published data should include configuration data, experimental results, and fine-grained results for the different KPIs. In addition, the platform is to provide diagnostics mechanisms to support both developers and users in their quest for better solutions and tools. - Benchmarks of industrial relevance in Europe. Data are one of the key assets of an increasing number of European companies. Making industrial data public is hence a difficult and partly counterproductive endeavor. To ensure that our platform still returns results of industrial results, we need to circumvent the hurdle of making real industrial data public by deploying mimicking algorithms. These will allow configuring synthetic data generators so as to compute data streams that display the same characteristics as industry data while being open and available for evaluation without restrictions.- Reference implementations for industry-relevant key performance indicators (KPIs). Open-source implementations of widely accepted measures are to be provided to ensure that the results generated within the project can be understood and checked by any organisation.

Over the last 36 months, the HOBBIT consortium has aimed to fulfil the objectives aforementioned by implementing the plan of action illustrated in Figure 2. - Data and measure collection: We gathered input on relevant datasets and quality measures from members of the European industry landscape within surveys. To this end, we (1) joined the EU project DataBench in the creation of the HOBBIT association and (2) co-organized and participated in meetups around Europe (including, e.g. EBDVF 2018). During these meetups, we presented the idea behind HOBBIT as well as engaged with the participants to gather their requirements to a Big Linked Data benchmarking platform. The main results of HOBBIT’s dissemination and engagement were (1) the creation of a HOBBIT association as Special Group 7 of Task Force 6 of the Big Data Value Association, (2) surveys to gather information from European companies and academic pertaining to their use and evaluation of Big Linked Data and corresponding platform and (3) datasets for the HOBBIT data repository. Overall, HOBBIT compiled a contact list with more than 300 members. The 25 datasets and dataset generators available through the HOBBIT CKAN repository at https://hobbit.ilabt.imec.be/ encompass industrially relevant datasets partly provided by HOBBIT. - Benchmark creation: The measures and the datasets collected formed the basis for the 8 HOBBIT benchmarks, which were made available in 2 versions over the project. Each benchmark comprises the following three components: a deterministic data source, a number of tasks and a set of KPIs. In addition, 5 scalable mimicking algorithms (which generate data of industrial relevance) were created in the project to ensure that the benchmarks reflect realistic use cases as well as to circumvent the problem of not being given access to real datasets from industry. A number of evaluations showed that the mimicking algorithms provided by the project generate synthetic data close to real data w.r.t. features such as temporal and spatial distribution.- The HOBBIT evaluation platform (see Figure 3) is the third core result of HOBBIT. It is built to support the benchmarking of Big Linked Data solutions at both small and large scale. The platform is developed as an open-source solution (see https://github.com/hobbit-project/platform) and support 14 challenges over the project runtime. Extensions to remote computation facilities such as AWS and an SDK complete the package. A mix of contributions from HOBBIT and from external users has now led to the platform containing 52 benchmarks and more than 300 docker images. The more than 200 users and 12,600 experiments ran over the runtime of the project suggest that the HOBBIT platform is turning into a crystallization point for benchmarking Big Linked Data.- Evaluation campaigns: HOBBIT ran evaluation campaigns for all benchmarks within 14 challenges (including the Mighty Storage Challenge -MOCHA-, the Question Answering on Linked Data Challenge -QALD- and the Open Knowledge Extraction Challenge -OKE- at ESWC2017 as well as the DEBS grand challenges 2017 and 2018). The results show that the HOBBIT benchmarking platform scales to the requirements of large-scale benchmarking. Limitations of existing solutions at scale (e.g. completeness for storage, recall for question answering, F-measure for machine learning) could be unveiled through the scalable benchmarking provided by HOBBIT. Moreover, the lack of scalability of a large number of Linked Data solutions was made evident.

HOBBIT produced a large number of innovations as witnessed by the more than 80 HOBBIT-related papers published by the project partners during the course of the project (see https://www.bibsonomy.org/search/projecthobbit). The core innovation of HOBBIT is the HOBBIT benchmarking platform, a FAIR and open-source solution for the comparable benchmarking of Big Linked Data solutions across the Big Linked Data lifecycle. In addition, HOBBIT has developed 8 novel benchmarks for the evaluation of Linked Data solutions at large scale (see http://project-hobbit.eu/outcomes). These include 1) Benchmark for RDF data backends, including benchmarks for data ingestion, data storage and querying which all measure how fast and correctly systems deal with streams of data at industrial scales;2) Benchmarks for knowledge extraction3) Entity matching and linking benchmarksThe innovation behind the solutions generated by this project is underpinned by the 28 HOBBIT publications which have already been accepted at high-ranking conferences. HOBBIT has collaborated and is collaborating with a large number of related research projects (e.g. BigDataEurope, BigDataOcean, SLIPO, SAKE, GEISER, etc.). The project aims to support the benchmarking efforts carried out in these projects through the HOBBIT platform. The expected societal impact of the project is mainly in the support of the development of more efficient Big Data Processing platforms to address societal challenges such as energy and transport.

Flow diagram of the HOBBIT platform

Mapping of Linked Data Lifecycle steps (bullet points) to the four steps of the Big Data value chain

Deliverables

This document details the preliminary state of the community, listing which parties are involved and which parties have expressed interest. It categorizes possible use cases, mentioning the interested parties. Use cases are linked to a list of datasets, which describes their contents, purpose, size, and expected growth. This document serves as a basis for further discussion and growth of the community, and as a precursor of the intermediary document D1.1.2.

Report on the project’s progress, targeting the general public. The report will focus on the impact of the conducted work as well as exemplifying the contribution and importance of the HOBBIT project towards facing these challenges during the first year.

After the first half of the project we will summarize the dissemination work and evaluated our efforts against the defined KPIs. In case we do not achieve our goals, we strive to develop suitable methods to reach our objects until project end.

When the formation of the association is coming to a conclusion, a final version of the mission statement is to be released, detailing a) how the association will continue after the end of the project and b) how the association intends to be self-sustaining. In order to substantiate point b), 3 to 5 business scenarios will be analyzed with a concrete implementation plan using the benchmarking platform.

This deliverable will be a report that will describe the results of T2.1. The architecture of the HOBBIT platform as well as the technologies used to ensure variable and scalable deployment will be described in detail in this paper. Moreover, the structure of the code repository for the platform will be explained. This will ensure that third-party contributions can be made early on in the project

Setting up the association in T1.3 requires finding the right members, which will be contacted early on in the project. This document will describe concrete strategies the partners will use to attract members for the association. In particular, we consider two types of strategies. On the one hand, we want to participate in and promote existing challenges, in which medium and large companies are involved. On the other hand, we want to integrate challenge datasets into the HOBBIT platform, and launch open calls towards industry and academia to stimulate the usage of these datasets. Potential users can be convinced by providing temporary free access to the platform, as well as technical support.

Report on the project’s progress, targeting the general public. The report will focus on the impact of the conducted work as well as exemplifying the contribution and importance of the HOBBIT project towards facing these challenges during the second year.

At the end of T1.1, the community will have matured sufficiently to generate an updated version of D1.1.2 with the final list of members, use cases, and datasets. Based on the feedback of the first review, we will compile the finalized version of the document. A dedicated section will specify differences from D1.1.2, in particular whether parties who expressed interest have joined and what their motivations were for (not) doing so. Any new use cases will satisfy the same requirements as in D1.1.2.

The second maintenance reports will summarize the execution of the second series of challenges from the perspective of the deployed platform. The focus will be on a careful analysis of the measures for the platform and the suggestions for future extensions of the platform.

One of the initial tasks of the association is to decide to what extent the requirements from D1.2.1 are relevant for the next phase. Furthermore, the existence of a prototype will allow to verify how the requirements of D.1.2.1 have been translated into practice, allowing a) adjustments where necessary b) the identification of additional requirements that were not considered at first by the community.

Based on the discussion on the preliminary version of the document (D.1.1.1), we incorporate feedback from the different partners to produce an intermediary community list that will serve as direct input for the first review. Each of the use cases will be validated and supported by at least two other members beside the use case submitter, in order to identify those use cases with maximum impact. All considered datasets will be validated in at least two use cases, and it will be clear how they can be obtained, and what the considerations are (privacy, license, practical usage aspects).

This deliverable presents the state of the preparations for the organization of the second series of evaluation campaigns and the second series of workshops (i.e., provide the number of interested parties for the challenge, baseline results etc.) including the preparations for the proceedings of the second series of workshops. The report will also cover an overview of second evaluations of the platform for its use in the challenges.

The community documents the desired specifications for the benchmarking platform, based on joint discussions, and in particular the F2F meeting around M4. This serves as a direct input for T2.2 in which the first iteration of the platform is built.

The core management handbook and structure that will provide guidelines for the project manager and the project members to follow during the full cycle of the project. The handbook will set out the governance, project calendar, cluster and work package description and reporting roles and responsibilities for all partners. Moreover, it will define the communication mechanisms for the project’s partners and the templates for reporting and formal communication of the project’s activities and outcomes.

Before the association starts in M25, an initial mission statement is to be released, outlining a strategy for how the association intends to continue after the end of the project. At least 3 business scenarios will be analyzed, and a direction will be indicated for how the association can handle them. Each scenario will be analyzed pertaining to strengths, weaknesses, opportunities, and threats. The scenarios are to be supported by datasets in the HOBBIT platform (D1.1.2).

Report on the project’s progress, targeting the general public. The report will focus on the impact of the conducted work as well as exemplifying the contribution and importance of the HOBBIT project towards facing these challenges during the third year.

It will incorporate details on the quality assurance processes adopted within the HOBBIT project. It will define all processes and instruments to be used for the regular quality monitoring and risk assessment in the form of a handbook for project partners.

This deliverable presents the state of the preparations for the organization of the first series of evaluation campaigns and the first series of workshops (i.e., provide the number of interested parties for the challenge, baseline results etc.) including the preparations for the proceedings of the first series of workshops. The report will also cover an overview of first evaluations of the platform for its use in the challenges.

The first of two maintenance reports will summarize the execution of the first series of challenges from the perspective of the deployed platform. Bottlenecks (if any) will be identified and mitigation solutions will be presented. Analysis results (number of tasks carried out, throughput for different benchmarks, etc.) will be presented and summarize the performance of the platform.

This deliverable will describe the first version of the faceted browsing benchmark for combinations of geospatial, temporal and structured data streams. Moreover, baseline implementations provided by HOBBIT will be presented.

This report will be the second of two software reports for the HOBBIT platform. Like in the first report, we will make the software and its code available online as well as provide a text report presenting the updated structure of the platform. The user manual for the integration of novel benchmarks will be extended with an updated evaluation. The report will be used as a manifesto by the HOBBIT organization (see WP1) to invite more parties into the organization.

This deliverable will provide an extension of the first benchmark in which more complex interdependencies between the generated data items will be modelled (e.g., second-order collocations between entities). This benchmark will also allow running typical data analytics queries as well as bulk loading and querying scenarios on a variety of storage solutions.

This deliverable will provide a first version of the storage benchmark that will allow running typical data analytics queries on a variety of data stores (incl. NoSQL databases and native triple stores). Typical bulk loading scenarios will also be supported.

This deliverable will describe the second version of the question answering benchmark for structured and statistical data streams. Moreover, improved baseline implementations provided by HOBBIT will be presented.

This deliverable will describe the second version of the faceted browsing benchmark for combinations of geospatial, temporal and structured data streams. Moreover, improved baseline implementations provided by HOBBIT will be presented.

Based on the feedback on the initial plan, more concrete descriptions are added for specific types of incoming and outgoing data. The plan will be enriched with examples of datasets already in the system. We will additionally regulate reuse and exploitation, and specify how data curation and preservation will happen, and detail which data is shared with what external parties, and how such a process takes place.

As part of the pilot for open access to research data, we will provide a Data Management Plan in accordance to the H2020 Guidelines on Data Management, which will detail how we deal with data throughout the project. In particular, we need to explain what (types of) datasets will be incorporated into the HOBBIT platform, and how we will ensure anonymization and licensing issues. An overview of the different types data that will be generated by the platform will be included, as well as the data formats, standards, and conventions to store and exchange this data. This initial plan will highlight the important categories and challenges in a generic way.

The intermediary data management plan will be validated against the needs and practices that have become clear during the project. Any updates to the data management plan of D8.5.2 will be clearly indicated and motivated. Particular emphasis will be given to data generated by the system, as the amount of this data will have increased significantly since the first version of this document. Finally, we will consider whether a deletion strategy is appropriate for parts of the incoming or outgoing data.

This deliverable comprises the initial promotion material, e.g., the fact sheet, the update website and first social media channels that could be used for the marketing of HOBBIT. A second press release (PR) will be prepared that will be translated to the partners’ national languages. The PR text will inform about the actual status, milestones and inform about the online channels. By M3 the full functionality of the website will be ready and connected with the other social media channels. A short report will summarize the initial efforts and document the dissemination KPIs.

This report will be the first of two reports for the HOBBIT platform. In addition to the software and its code being made available online, we will provide a text report presenting (1) the structure of the platform as well as a user manual for the integration of novel benchmarks and (2) an evaluation of the platform itself. The report will be used to request contributions from third parties.

Joint Proceedings of BLINK2017: 2nd International Workshop on Benchmarking Linked Data and NLIWoD3: Natural Language Interfaces for the Web of Data co-located with 16th International Semantic Web Conference