UnifiedViews: An ETL Framework for Sustainable RDF Data Processing

The advent of Linked Data [1] accelerates the evolution of the Web into an exponentially growing information space where the unprecedented volume of RDF data offers consumers a level of information integration that has up to now not been possible. Suppose a Linked Data/RDF consumer, who has a data processing task to build a data mart integrating information from various RDF and non-RDF sources. Lots of tools for RDF data processing emerged in the last few years, thus, the consumer may use these tools to realize his task.

Unfortunately, the consumer cannot focus merely on the proper configuration of these tools, but he has to also, e.g., write a script executing these tools in the required order, forward logs produced by the tools to a single location, think about the location of the configurations for the tools. Further, the consumer does not have any support for debugging intermediate RDF data created by the tools as the task is executed. The consumer cannot reuse configurations created by other consumers; what is more, as the amount of his configuration increases, maintenance of configurations may easily become a nightmare.

To address the problem of sustainable RDF data processing a typical Linked Data/RDF consumer is facing, we propose UnifiedViews, an Extract-Transform-Load (ETL) framework, where the concept of data processing task is a central concept and another central concept is the native support for RDF data format and ontologies. A data processing task (or simply task) consists of one or more data processing units. A data processing unit (DPU) encapsulates certain business logic needed when processing data (e.g., one DPU may extract data from an RDF database or apply a SPARQL query [2,3]). Every DPU has its inputs, outputs, business logic and configuration.

UnifiedViews is a framework, thus, consumers may create custom DPUs; any tool used by RDF/Linked Data community can be easily wrapped as a DPU. UnifiedViews allows consumers to define and adjust data processing tasks, using graphical user interface (an excerpt is depicted in Figure 1).

UnifiedViews takes care of task scheduling. A consumer may configure UnifiedViews to get notifications about errors in the tasks’ executions; the consumer may also get daily summaries about the tasks being executed. UnifiedViews ensures that DPUs are executed in the proper order, so that all DPUs have proper required inputs when being launched. UnifiedViews provides consumers with the debugging capabilities – a consumer may browse and query (using SPARQL query language) the RDF inputs to and RDF outputs from any DPU.

UnifiedViews is used in COMSODE project as a core component of Open Data Node, where it ensures extraction, transformation, and publishing of (Linked) Open Data. As part of COMSODE project, we also prepare new DPUs needed to process and publish 150 datasets as (Linked) Open Data. We will describe the concept of DPUs and introduce examples of new DPUs in one of the next blog posts.

Tomas Knap received his Ph.D. from Faculty of Mathematics and Physics, Charles University, Czech Republic, for his research on trustworthy Linked Data integration and consumption. In 2013, he co-founded company Semantica.cz s.r.o, an SME entrepreneurship focused on consulting Linked Data and semantic web solutions for data integration and publishing.