ED3: Enabling analytics over Diverse Distributed Datasources

Abstract

Enterprises and government entities have a growing need for systems that provide decision support based on descriptive and predictive analytics over large volumes of data. Examples include supporting decisions on pricing and promotions based on analyses of revenue and demand data; supporting decisions on the operation of complex equipment based on analyses of sensor data; and supporting decisions on website content based on analyses of user behaviour. Such support may be critical for safety and regulatory compliance as well as for competitiveness.

Current data analytics technology and workflows are well-suited to settings where the data has a uniform structure and is easy to access. Problems can arise, however, when performing data analytics in real-world settings, where as well as being large, datasources are often distributed, heterogeneous, and dynamic.

Consider, for example, the case of Siemens Energy Services, which runs over 50 service centres, each of which provides remote monitoring and diagnostics for thousands of gas/steam turbines and ancillary equipment located in hundreds of power plants. Effective monitoring and diagnosis is essential for maintaining high availability of equipment and avoiding costly failures. A typical descriptive analytics procedure might be: "based on sensor data from an SGT-400 gas turbine, detect abnormal vibration patterns during the period prior to the shutdown and compare them with data on similar patterns in similar turbines over the last 5 years".

Such diagnostic tasks employ sophisticated data analytics tools, and operate on many TBs of current and historical data. In order to perform the analysis it is first necessary to identify, acquire and transform the relevant data. This data may be stored on-site (at a power-plant), at the local service centre or at other service centres; it comes in a wide range of different formats, ranging from flat files to XML and relational stores; access may be via a range of different interfaces, and incur a range of different costs; and it is constantly being augmented, with new data arriving at a rate of more than 30 GB per centre per day.

Acquiring the relevant data is thus very challenging, and is typically achieved via a combination of complex queries and bespoke data processing code, with numerous variants being required in order to deal with distribution and heterogeneity of the data. Given the large number of different analytics tasks that service centres need to perform, the development and maintenance of such procedures becomes a critical bottleneck.

In ED3 we will address this problem by developing an abstraction layer that mediates between analytics tools and datasources. This abstraction layer will adapt Ontology Based Data Access (OBDA) techniques, using an ontology to provide a uniform conceptual schema, declarative mappings to establish connections between ontological terms and data sources, and logic-based rewriting techniques to transform ontological queries into queries over the data sources. For OBDA to be effective in this new setting, however, it will need to be extended in several different directions. Firstly, it needs to provide greatly extended support for basic arithmetic and aggregation operations. Secondly, it needs to deal more effectively with heterogeneous and distributed data sources. Thirdly, it will be necessary to support the development, maintenance and evolution of suitable ontologies and mappings.

In ED3 we will address all of these issues, laying the foundations for a new generation of data access middleware with the conceptual modelling, query processing, and rapid-development infrastructure necessary to support analytic tasks. Moreover, we will develop a prototypical implementation of a suitable abstraction layer, and will evaluate our prototype in real-life deployments with our industrial partners.

Planned Impact

We foresee two classes of non-academic beneficiaries: data owners struggling to "make sense of their data", and a growing subset of the information technology industry for which data analytics represents an important component of their products and/or services.

Regarding data owners, we have already described the difficulties facing energy services companies such as Siemens and EDF. Similar challenges can be found in domains ranging from government and healthcare to the aerospace, energy and finance industries, and it is our belief that ED3 has the potential to have wide impact in all these sectors of the economy.

Regarding the technology industry, the needs of data owners has created a great interest in developing more flexible information management layers. We are already working with several of the major players in this area, including IBM, and Oracle, and also with LogicBlox, a new and rapidly growing company whose customers include retailers such as Home Depot, Walgreens, and Toys R Us in the US, Harrods in the UK, and M-Video in Russia.

ENGAGEMENT, DISSEMINATION AND EXPLOITATION

Engagement with non-academic beneficiaries is an integral part of ED3, with industry partners making a significant contribution to the project. This engagement will provide a direct pathway to impact via dissemination and possible exploitation.

Regarding dissemination, we will be making regular visits to Siemens and EDF, during which we will give presentations and demonstrations, not only to those parts of the company who are directly involved in the project, but also to other divisions for which the proposed technology could be of interest. LogicBlox will provide another set of opportunities for dissemination to their customer base in the retail domain.

We will also exploit our wider network of non-academic collaborators, including the partners in our DBOnto platform grant, for dissemination and exploitation activities. The platform grant can support visits and exploratory collaborations, which will provide an ideal mechanism for exploring applications of ED3 technology.

Regarding exploitation, we will actively pursue opportunities arising from all of the above engagements, and explore a range of mechanisms, including both licensing and spin-offs. Exploitation of IP resulting from the project will be managed by Isis Innovation, a wholly-owned subsidiary of Oxford University, founded to exploit know-how arising out of Oxford's research activities.

We will additionally undertake a range of more broadly focussed activities in order to ensure the widest possible dissemination of our results and engagement with potential beneficiaries.

Firstly, we will showcase the achievements of the project to industry and research leaders via dedicated workshops; these will include both events specific to ED3, and broader showcase events organised as part of DBOnto.

Secondly, we will continue our established pattern of publishing the results of our research in leading conferences and journals. In order to maximise the impact on non-academic partners, we will target "in-use" and "industry" tracks at conferences such as ISWC, SIGMOD, VLDBB and WWW, wherever possible co-authoring papers with industry partners.

Thirdly, we will participate in relevant international coordination and standardisation efforts within groups and organisations such as the World Wide Web consortium (W3C) and the OWL Experiences and Directions Group (OWLED). Through these activities we can help to foster awareness of our work and ensure that it has the maximum possible impact on any future standards.

Finally, we will continue to make all research outputs freely available from our web site, including papers, presentations, tutorials and software.

TRACK RECORD:

Our research has already been highly influential outside academia, and has been the basis for international standards, widely used and/or commercialised software systems, and spin-off companies

Motivated by the need for OBDA systems supporting database-style aggregate queries, we have proposed a bag semantics for OBDA, where duplicate tuples in the views defined by the mappings are taken into account. We have shown, however, that bag semantics makes query answering coNP-hard in data complexity. To regain tractability, we have proposed the rather general class of anchored queries and have shown that such queries are first-order rewritable under bag semantics over DL-Litecore ontologies.