Auskunft zu diesem Dagstuhl-Seminar erteilt

Dokumente

Summary

The Semantic Web is an extension of the World Wide Web in which structured data and its meaning is represented in a form that can be readily accessed and exploited by machines. The foundation of this representation is a graph-based data model defined by the Resource Description Framework (RDF). This framework allows for data management approaches that focus on manipulating and using data in terms of its meaning. We refer to this type of data management as semantic data management.

In addition to centralized access to RDF datasets, Web-based protocols such as the SPARQL protocol enable software clients to access or to query RDF datasets made available by remote servers. By integrating such remote data sources as members of a federated system, software clients may answer cross-dataset queries without having to retrieve various datasets into a single repository. Given such a federation, the complexity of problems of query processing and semantic data management increases due to additional parameters such as variable data transfer delays, a changing availability of federation members, the size of the federation, and distribution criteria followed to place and semantically link data in different datasets of the federation. Moreover, whenever data is replicated across federations, synchronization is required to ensure that all changes are propagated and the semantics of data is preserved. Despite a large number of technologies developed by the Semantic Web and Database communities to address problems of semantic data management, we still observe a significant lack of efficient and effective solutions to the problems of federated semantic data management (FSDM), which prevents the development of real-world applications on top of Semantic Web technologies. Additionally, existing proposals to evaluate such solutions do not sufficiently cover the large number of parameters that affect FSDM and the complexity of tradeoffs. More specifically, variables and configurations that considerably affect the federated semantic data management problems are not sufficiently defined or even considered in state-of-the-art testbeds~(e.g., network latency, data fragmentation and replication, query properties, or frequency of updates).

The aim of the Dagstuhl seminar was to gather experts from the Semantic Web and Database communities, together with experts from application areas, to discuss in-depth open issues that have impeded FSDM approaches to be used on a large scale.

The following crucial questions were posed as a basis for the discussions during the~seminar:

Q1Can traditional techniques developed for federations of relational databases be enriched with RDF semantics, and thus provide effective and efficient solutions to problems of FSDM?

Q2What problems of FSDM present new research challenges that require the definition of novel techniques?

Q3What is the role of RDF semantics in the definition of the problems of FSDM?

To discuss these questions the participants of the seminar were grouped according to their areas of expertise and interests. In particular, the seminar focused on four main topic areas~(see below). The results of the group discussions were presented in plenary sessions and will be compiled into manuscripts with which the seminar outcomes will be disseminated. As a basis of the group work, and to establish a common understanding of key concepts and terminology, the seminar included a few short, survey-style talks on a number of related topics. In particular, these talks covered:

In addition to these survey talks, every participant was given the chance to briefly highlight their research as relevant for the seminar. Moreover, in a demo session, some of the participants showcased their FSDM-related systems and tools, which gave interested attendees of this session an opportunity to play with and better understand these systems and tools. The systems and tools demonstrated in this session were the following:

Triple Pattern Fragments client that runs in a browser and executes queries over a federation of Triple Pattern Fragment (TPF) interfaces (demonstrated by Joachim Van~Herwegen),

As mentioned before, besides the short survey talks, the demos, and the participants’ presentations, the major focus of the seminar was on discussions in four working groups, where each of these groups addressed a different topic area. The remainder of this section provides a brief overview of the four topic areas covered by the groups and the respective results. More detailed summaries provided by each of the four groups can be found in a separate section of this report.

Graph Data Models. Graph data models such as the RDF data model allow for a representation of both data and metadata using graphs of nodes that represent entities, and edges that model connections between entities. Graph data management encompasses techniques for managing, querying, and analyzing graph data by utilizing graph-oriented operations. SQL-like query languages have been defined for evaluating declarative queries over graph data; additionally, well-known algorithms are utilized for computing graph invariants (e.g., triangle counting or degree centrality) and for solving typical graph problems (e.g., finding shortest paths, traversals, or dense subgraphs). Furthermore, several real-world applications have been built on top of existing graph-based tools (e.g., community detection, centrality analysis, and link prediction). Graphs naturally represent a wide variety of domains (e.g., social networks, biological networks) in which data, interconnectivity, and data topology all are first-class citizens, with RDF data being one example of graph data.

During the Dagstuhl seminar, a working group was formed to discuss whether tools for graph data management are sufficient to model and to manage the semantics in RDF data, taking into account that characteristics of the RDF data model (e.g., blank nodes and SPARQL operators) may affect tractability of the graph-based tasks in a federation of RDF graphs. As a first result of this discussion, the working group made the following observation. In contrast to other graph data models and query languages, the RDF data model is a "universal" data model in the sense that it is designed for sharing data and knowledge in an unbounded space such as the Web. To continue the discussion, the group introduced a definition of the notion of FSDM and identified five principles that characterize FSDM: universality, unboundedness, dynamicity, network protocols, and semantics. Based on further discussion that took into account these principles, the group made two conjectures that they plan to elaborate on in a future publication and that can be summarized as follows. First, it is impossible to build a FSDM system that fully achieves universality, unboundedness, and dynamicity, all at the same time. Second, the concepts of federation and semantics are interdependent and must be tackled together to develop effective and efficient solutions for building FSDM systems.

Federated Query Processing. A vast number of approaches have been developed to provide a unified interface for querying federations of data sources. In the context of federations of RDF datasets, existing approaches focus on two problems: the problem of selecting the RDF datasets required to execute a federated query, and the problem of executing the resulting sub-queries efficiently against the selected data sources. Although federated query processing has been studied extensively, a number of important problems are still open, and more challenges are likely to come up as the complexity of federations increases (e.g., by increasing numbers of federation members, by replication and fragmentation of RDF data, and by federation members that update their RDF data autonomously).

During the Dagstuhl seminar, a working group was formed to discuss the problem of federated query processing over RDF data sources. Challenges imposed by the semi-structured nature of RDF, unpredictable behavior and dynamicity of Web-accessible RDF sources, and the role of the entailment regimes guided the group discussions and allowed for enumerating the main differences with the problem of federated query processing against relational databases. The group focused on the formal definition of the problem, as well as on the formalization of the subproblems of source selection, query decomposition, and query execution. As a first result, the group identified that the entailment regimes to be performed over a federation of RDF sources, as well as data replication and dynamicity, access control policies, and SPARQL query capabilities, play a crucial role in source selection, query decomposition, and query execution. State-of-the-art techniques implemented by existing approaches (e.g., FedX, ANAPSID, or Linked Data Fragments) were discussed and compared based on this formalization; the group concluded that none of existing approaches takes into account all these characteristics of RDF data sources, being required further analysis and work to empower them to solve the formalized problems. Finally, the impact of these characteristics on the performance of SPARQL operators (e.g., join, union, or optional) was discussed. The group concluded that although physical operators implemented by existing approaches are capable of adjusting query execution schedulers to RDF source availability, they are unable to adapt their execution to other RDF source characteristics, e.g., supported entailment regimes or data evolution. These issues remain open as well, and require further study from the semantic data management community.

Access Control and Privacy. Solutions to the problem of modeling access control policies for Web resources have been benefited from Semantic Web technologies. Existing rule-based logic languages rely on ontology-based reasoning tasks to represent reactive policies for access control, and to enforce and to propagate trusted and policy-compliant interactions across resources in RDF datasets. For instance, the Open Digital Rights Language (ODRL) is a rule-based approach that allows for a description of policies to access and to exchange Web resources. Nevertheless, as per the Linked Data publishing principles, RDF properties associated with any resource can be accessed by de-referencing their corresponding URL. In applications of domains of FSDM such as personalized medicine or finances, only authorized and privacy-respecting access is allowed. Thus, novel approaches are required to bridge the gap between access-control models and unrestricted access to RDF resources.

A working group with a focus on access control and privacy discussed the following open issues: a) formalisms to specify access-control and privacy policies of federation resources and to reason over the meaning of these resources; and b) techniques that enable systems to enforce privacy-aware and security-aware policies whenever a resource is accessed. After concluding that there are too many open challenges to be all solved immediately, the group agreed to focus on access control. Next, the group discussed conceptual access control models and achieved a better understanding of requirements of a conceptual framework to analyze policy-aware federated Semantic Web architectures. Finally, the group defined such a framework and made plans for a publication about it.

Use Cases and Applications. In addition to the first three working groups that focused on various more technical aspects of building FSDM systems, a fourth working group looked into applications of FSDM and use cases in which adopting FSDM would be beneficial. Specifying such use cases, as well as documenting the usage of FSDM systems in existing applications, is important to better understand the requirements and the challenges of FSDM and to derive realistic testbeds for approaches to build FSDM systems.

A key observation of the work group was that approaches to apply FSDM can be categorized into two classes depending on whether they focus i) on explorative, open-domain querying or ii) on controlled, close-domain querying. Then, the working group identified a broad set of general use cases of FSDM. Thereafter, the group defined a framework for developing specific use cases. This framework introduces a set of requirements for the specification of a use case. Finally, the group applied their framework to develop a number of example use cases.