Syndicate

Squerall: Virtual Ontology-Based Access to Heterogeneous and Large Data Sources

Submitted by Mohamed Nadjib Mami on 07/21/2018 - 05:28

Tracking #: 1957-3170

Authors:

Mohamed Nadjib Mami

Damien Graux

Simon Scerri

Hajira Jabeen

Sören Auer

Responsible editor:

Guest Editors Knowledge Graphs 2018

Submission type:

Full Paper

Abstract:

During the last two decades, a huge leap in terms of data formats, data modalities, and storage capabilities has been made. As a consequence, dozens of storage techniques have been studied and developed. Today, it is possible to store cluster-wide data easily while choosing a storage technique that suits our application needs, rather than the opposite. If different data stores are interlinked and queried together, their data can generate valuable knowledge and insights. In this study, we present a unified architecture, which uses Semantic Web standards to query heterogeneous Big Data stored in a Data Lake in a unified manner. In a nutshell, our approach consists of equipping original heterogeneous data with mappings and offering a middleware able to aggregate the intermediate results in a distributed manner. Additionally, we devise an implementation, named Squerall,that uses both Apache Spark and Presto as an underlying query engines. Finally, we conduct experiments to demonstrate the feasibility, efficiency and solubility of Squerall in querying five popular data sources.

The submission presents an architecture for querying data stored in a Semantic Data Lake, which is similar to the virtual ontology-based data access. The second contribution is a description of an implementation (Squerall) and its evaluation on a modified BSBM benchmark (why does the abstract mention 5 popular data sources?). These contributions fit the Knowledge Graphs 2018 special issue call for papers.

First of all, the notion of the Semantic Data Lake is a bit unclear - how is it different from the ontology-based data integration (OBDI)? Note that although the focus in OBDI has mostly been on relational data sources, the OBDI framework does not restrict the type of data sources in any way. Also, the novelty of the Semantic Data Lake architecture is unclear - the general architecture is presented in Section 2.2 but the authors, however, claim that they introduced the term in [10]. So, what is new in the submission?

Second, the solution presented in the main Section 3 for "enabling data joinability" can hardly be considered satisfactory: every time a pair of variables deemed to contain URIs from different datasources but referring to the same objects, the user has to add a special TRANSFORM clause to the SPARQL query that would describe the modifications of the URIs before they are "matched". This means, in particular, that if the same objects come from, for example, 3 different sources, then the user has a choice of specifying transformations between any of the three pairs of the datasources (that is, s1-s2 & s2-s3 or s1-s3 or s2-s3 or s1-s2 & s1-s3) or transformation between all three pairs (that is, s1-s2 & s2-s3 & s1-s3). In the latter case, the user can only pray that the transformations are compositional. What makes the whole approach truly cumbersome and error-prone is that those transformation would have to be repeated in *each* query to the datasources. Would this "matching" not be better placed at the level of mappings? Something like canonical URIs might also be an option.

Third, the submission is quite poorly written, with lots of omissions and hidden assumptions.

\\\\\ DETAILED COMMENTS /////

page 1, line 21: It is not clear what the "opposite" is - did the authors mean rather than "choosing application needs that suite the storage technique"? if so, then it does not sound plausible

page 1, right column, line 49: The authors mention "local-as-view", but I strongly suspect they actually mean "global-as-view", when the terms of the global schema are defined as views (that is, queries) over local schemas. Would the authors check the definitions?

page 2, left column, line 26: It is not clear who "we" in "we have previously introduced in [10]" are - the list of authors is not exactly the same (not even a subset).

page 2, right column, line 3: The emphasis on "declaratively" is unclear - this aspect has not been properly explained.

page 2, right column, line 16: Is the ontology really a taxonomy? Or is it just a vocabulary? In fact, the ontology is not described in the submission (apart from a short paragraph in Section 5.1).

page 2, right column, line 40: Cassandra appears out of nowhere - does the reader need to know what that is? Is it the most canonical example of relational DB?

page 2, right column: ParSet is too much of a jargon term and POZ is not properly explained - the meaning of "ParSets live and evolve" is unclear. Also, "data source is any source of data" is obvious and thus redundant.

page 3, left column, line 39: What are "mapping ontologies"?

page 3, left column, lines 46-50: It is not clear why a Query Catalyst would decompose BGPs into stars - what is the purpose? Also, the notion of stars is not defined. And why is it called Query Catalyst? What is so catalytic about decomposing BGPs?

page 4, right column, lines 30-43: The meaning of the paragraph escapes me - the example has different variable names, and the snippet of code should be explained in more detail (what is 12? what is the effect of toInt?).

page 5, left column, lines 37-51: Why not use the same example as in Fig 2?

page 5, Figure 4: The meaning of the diagram is unclear (and the explanation is not satisfactory). What is a and b? Are they values or sets?

page 5, right column, lines 26-29: I do not understand the formula - on the one hand, s_1 and s_2 are parameters of Join and so, one assumes they are given along with pred, for example; on the other hand, s_1 and s_2 also occur under the existential quantifier in the braces. So, what are they? And why are there braces at all? Is the result a set of joins of stars?

page 6, right column, line 18: "interface to the outside" is unclear - what is "outside"?

page 6, right column, line 30: What is the meaning of "We do not intend t generate RDF triples, neither physically nor *virtually*"? Virtually generating means not generating, does it not?

page 7, left column, line 18: "prefix nosql" makes no sense - the user may prefer to use a different shortcut for http://purl.org/db/nosql# (the name of the shortcut is irrelevant)

page 7, left column, lines 49-51: The sentence is too long and complex, and it takes a couple of attempts to link the two sides of the "the compromise" (and no : is needed).

page 8, left column, Table 1: Is it really 2.6M persons for 5M tuples? It looks like a 3-fold increase from 26K to 77K, but a 30-fold increase from 77K to 2.6M.

page 8, left column, line 42: Why is ACID important here? Did the authors run updates in parallel?

page 10, left column, line 5: The authors mention ontology-based data access, but I could not find anything related to ontologies in the submission - the only exception is the vocabulary used to describe mappings (which can hardly be called ontology-based data access).

page 10, right column, line 24: The authors claim that the source code of the Optique Platform is not publicly available. However, Ontop, the query transformation system of the Optique Platform, is publicly available (including the source code).

pages 12-14: Appendix A should really be available only online (most of it is standard).

\\\\\ TYPOS /////

page 1, line 26: "solubility of Squerall" reads as though Squerall is a problem and the authors are finding a solution for it

page 9, left column, lines 26-30: it looks like a cut-and-paste gone wrong

page 11, left column: is the URL in [1] official?

pages 11-12: are the URL in [8], [15], [19], [20] and others useful at all?

pages 11-12: check the name spelling in [12], [15], [27]

page 11, right column: [16] is in capitals for no good reason

page 12: the list of authors in [34] is shortened, yet a similarly long long in [10] is given in full

Review #2

By Oscar Corcho submitted on 08/Oct/2018

Suggestion: Reject

Review Comment:

This paper presents relevant work towards making data from a data lake available using an ontology-based data access approach, so as to produce what the authors name as a semantic data lake. This is one of the first approaches in the state of the art into this direction, since so far only individual access mechanisms for different types of data sources had been proposed in the literature (e.g., for NoSQL databases).

While relevant, and while there is an early implementation available for the architecture and approach that is presented, the paper (and work presented) is not sufficiently strong, in my opinion, for a journal publication. I would encourage the authors to continue working on it until a more robust, well-designed and well-tested approach is available. I really appreciate the effort on trying to make a wider range of data sources and formats available through OBDA, and providing an implementation that tries to take advantage of existing Big Data architectures (I have myself worked on this as well, with Flink), but there are relevant shortcomings in the current version of the work that suggest that more work should be done before the paper can be accepted for a journal publication.

I will try to summarise next the main aspects that I consider that are still a bit weak in the current contribution, with the aim of helping the authors improve further versions of this paper (and of the underlying implementation).
- There is no clear explanation of why the authors have selected such a simple SPARQL fragment, and the implications of adding other primitives from SPARQL. It is well known that the OPTIONAL clause is very relevant in distributed data querying scenarios, and its usage would have important implications on the design of the join tables, for instance. So I have the feeling that this is needed for a journal paper on this area. Same happens for UNION. Besides, in relationship to the SPARQL fragment used, it is unclear why you have LIMIT but do not allow for result paging. It is also unclear why you do not allow variables in the property part of the triples, or why you only allow for very simple filters, when this may be probably already well implemented by the underlying technologies.
- It is unclear how data source selection is being performed in this architecture. There is no clear mention to the algorithm used for this purpose, although clearly the used RML mappings may be used for this task.
- The paper is not clear about why we need ad-hoc transformations to be specified in the queries. The idea of using declarative mappings in OBDA approaches is precisely to ease maintenance and avoid those users generating queries to worry about the underlying data sources. By adding the transformations as part of the SPARQL queries you are tying the queries to the data sources, specifically, and hence your queries stop being general enough. For instance, it would not be possible to add easily more data sources to be considered without adding more queries, and if several data sources can provide results, then combinatorial combinations of those would need to be specified in the queries, what seems to be wrong and a bit against the usual mediator and mapping-based architectures that have been traditionally used in OBDA approaches. Furthermore, some of these transformations may be specified in RML extensions that have been presented in the state of the art.
- The choice for nested-loop joins and deep-left evaluation is not sufficiently discussed. Why not other options that may exploit better the potential parallelisation of the architectures that you are using? I would have expected more work into discussing optimisations taking into account the underlying infrastructure that you are using.
- There is a discussion in section 5.1 about the usage of RML for physical vs virtual data access. I cannot see the point that the authors are trying to make there, since RML can be used as well for virtual data access, and not only for RDF generation. Other approaches also do the same (e.g., the morph suite or ontop). Indeed, in order to justify this comment even more, clear algorithms should be presented in the paper on how query translation is being performed.
- Finally, it is not clear at all how the system works, after reading carefully section 5. The authors seem to suggest that users should take care of loading data into the Spark DataFrame structure. How is this to be done? Why is this step needed? Isn’t it handled by the system? What about those data structures that are not easily translatable to relational data models?

There are also several aspects of the experimentation that would need more work, in my opinion:
- Dataset distribution across different types of underlying systems. I think that there is a need for more heterogeneity in how data is represented. Having one technology per type of entity may generate biases on the evaluation that you have not considered. This is mostly showing the feasibility of your approach, but no further conclusions should be extracted from this type of distribution, I think.
- On the query side, you are discarding some important aspects of BSBM queries because of the limitations of the SPARQL query fragment that you make use of. I think that this should be improved.
- In accuracy, you make some claims based on the comparison of results between two of your implementations when the results are large. I do not think that this is enough to claim accuracy of your results, even though it seems that your design and implementation is correct.
- It is not clear what you are trying to evaluate with your experiments on performance. I understand that you cannot compare easily with other approaches, but you may have wanted to test your approach against a native implementation of the same queries, as BSBM does with SQL, given the fact that you are doing several transformations that may include penalties in query evaluation. You may also evaluate size of intermediate results being generated, throughput, etc. The part of the evaluation of TRANSFORM is also very unclear to me.

Finally, there is a point that may summarise clearly why there is a need to work more in this area for a journal article. It is in section 8, where the authors say that as future work they will explore how the semantics of the query languages of each underlying data sources will impact on the SPARQL query fragment that can be supported. In any OBDA approach, this is extremely relevant in order to understand whether the query translation semantics are correct, and this should normally come before an actual implementation (this has not been always the case in literature, but it is extremely important).

Editorial comments and minor comments:
- In page 1 you claim that semantic technologies have been used for two decades for providing local-as-view support. This is not entirely correct, since many approaches have focused on providing a global-as-view approach.
- There are a few pieces of work that are presented as part of the contributions in this paper, but seem to be a bit out of scope. Namely the NoSQL ontology and the query generator. These are not evaluated either, nor their quality discussed appropriately in the paper, so I suggest that they should not be included and only added in a discussion section.
- When you discuss the data mapping part in section 4, I would suggest introducing the fact that you use RML upfront, since for some time I was making annotations myself in the paper about why you were not using this or R2RML, and only I realised that you were using them when you went into the implementation section.
- It would be good to add a bibliographic reference to RML when you first name it, instead of only providing a URL.
- In section 6.2 you talk about user interfaces but do not evaluate them. I would suggest removing that section.
- In the related work there is a lot of discussion on how to access NoSQL data sources, but too little on approaches for ontology-based data access. From Optique, you claim that the source code is not available, although the main OBDA implementation (ontop) is actually available as open source.

The submitted manuscript describes the system Squerall for query answering over heterogeneous data sources. The system adopted the virtual ontology-based data access approach, which rewrites the user SPARQL queries to queries over the underlying the data sources. Since there are heterogeneous sources involved, a federation engine, e.g., Presto or SPARK, is needed for the query execution. The system has been evaluated using a modified version of the BSBM benchmark.

The topic discussed in this paper is timely and important. The system also shows its potentials in practice. However, there are also many places in the paper a bit sloppy, and further details need to be clarified. In particular, in OBDA, there are ontology language, mapping language, query language to be chosen, and also the language used for the source data source. These are not sufficiently discussed in the paper.

1. Ontology language. The paper stays silent about which ontology language it supports. Only in Section 4, the paper starts to mention "hierarchies between classes". The ontology language must be clarified. Then, the paper should have a corresponding part about how to handle this ontology language.

2. Mapping language. The mapping language is also unclear. In the introduction, the authors mentioned "local-as-view" (LAV) approach and it seems that the paper adopted the LAV approach. However, this is not true. W3C R2RML mapping language and also RML mapping are basically "global-as-view" mapping language. Only in Sec 5, the paper started to say it uses RML. However, it still did not say which fragment of RML is supported.

3. Query language. In Section 3, the paper introduces the TRANSFORM operator in SPARQL. Unfortunately, this extension is not very convincing for several reasons:
a) Why cannot we use simply the FILTER operator together with SPARQL functions to achieve the effect? for the given example TRANSFORM(?book?author.l.replc("id","1").toInt.skip(12), cannot we write
FILTER ( skip(toInt(replc( ?book, "id", "1")), 12) = ?author) ? What is the fundamental difference between TRANSFORM and FILTER in this case?
b) Also, the expressivity of the TRANSFORM seems to be limited. The fragment ([leftJoinVar][rightJoinVar].[l|r] suggests that only one variable can be modified. If we need to modify both the values of both variables. When they have different prefixes, what should we do?
c) The dot operator style of applying functions is not common in the SPARQL language

4. Databases. The paper is dealing with heterogeneous datasources. Different datasources have different expressiveness in their query language. For instance, MongoDB stores JSON documents, which has a nested structure. Are we using/Can we these features in Squerall? Is there dedicated optimization techniques for each kind of supported database?

Further detailed comments:

- Section 2.1. Preliminaries. This part is not sufficient. The authors should explain the OBDA framework, and discuss the ontology/mapping/query language, and the features of databases.

- ParSet should be better explained. What are the operators for ParSets?

- Query catalyst. Query catalyst is critical in the system, but not sufficiently discussed. In particular, what is the outcome of the query catalyst? Is it a SPARQL query, or a SQL query?

- Either use 'rdf:type' or 'a'. Please stick to one.

- In Section 4, subsection titles like '4.0.1. I. Data Mapping' does not look good. remove either '4.0.1' or 'I.'

- p5. Joining ParSets. The definition of the join is ill-defined. What is the result? What are s1 and s2? Why we have pred on the left, and pred1 and pred2 on the right?

- p5. "ParSet pairs of each identified join, with their join variables: ((parset1,join_variable1) -> (parset2,join_variable2))." This needs to be better explained.

- p6. In Sec 5.1 Data Mapping. The authors should make it clear which fragment of RML is supported.

- "We do not intend to generate RDF triples, neither physically nor virtually. We rather use them to map entities, and use these mappings to query relevant data given a SPARQL query. " This sentence is particularly confusing. By definition, the semantics of mapping language is declared the relationship between the datasource and triples. Whether generating triples or not is an implementation detail. The authors are mixing these two aspects.

- How the NoSQL ontology is used in the implementation?

- The availability of the source code (Github) and license (Apache 2) should be made explicit in Sec 5.

- I believe Sec 6.2 User interface should not be part of the evaluation. Instead, It should be in Section 5.

- Evaluation. Is it possible to compare the performance with existing systems? For instance, Can we compare Squerall with SPARQL federation engines?

- Related work.
- For OBDA, in addition to [22], a recent survey [a] should be mentioned.
- There are a set of existing OBDA systems, e.g., Mastro, Ontop, Ultrawrap, Morph, which should also be mentioned. See the citations in [a].
- For Ontology-based data integration, [b] may be mentioned as well.
- For benchmarking and data generation, the work of VIG [c] is also relevant.
- For OBDA over NoSQL datasources, check [d] as well.
- "For ontology-based access, the Optique Platform [34] emphasizes the velocity aspect of Big Data by supporting streams in addition to dynamic data." This is not correct: velocity is only one aspect of the Optique platform.

- Citations:
- there is no venue in citation [21]
- Linking to dblp in the references is not needed.