Smart Compilation Techniques for Synthesizing Optimized Big Data Code

Context

Modern query languages provide high-level constructs for facilitating information extraction and big data analytics. Their development is undergoing two revolutions.

The first revolution is the coexistence of rich heterogeneous data models (graphs, trees, tuples, key-values, etc.). The richness of these data models largely outreached what we learned from decades of research on e.g. relational databases. It triggered the development of new foundations and tools adapted for e.g. labeled structures with ordering, types and values. Over the last years, these foundations have matured allowing more robust and efficient programming techniques with e.g. (XML) trees and (RDF) graphs. For example, query type inference and type-based projection aim at optimizing query evaluation by pruning unnecessary data through a joint static analysis of queries and type constraints. Advances in logical reasoning have proved instrumental in developing broader analyses with queries and types [1, 2].

The second revolution is the emergence of novel data-centric paradigms for programming distributed architectures. The MapReduce programming model is an example of ideas taken from functional programming that has directly influenced the way distributed big data applications are written. Architectures for large-scale data-intensive computing have been developing at a very fast pace since then. A huge gap still remains between high-level query language constructs (as in e.g. SPARQL for querying RDF graphs) and low-level primitives for distributed data processing found in state-of-the-art generic frameworks for big data (as in e.g. Spark [3]). The former are rather mature: they have reached a high-level of standardisation, with well-understood semantics, and well-characterized properties in the non-distributed setting. The goal of this research topic is to investigate their optimized compilation into the latter.

Objectives

This research topic consists in exploring how query analysis can be leveraged for optimizing query compilation into code executable on big data frameworks. A particular emphasis will be put on logical methods for query analysis, in the presence of additional information such as constraints. The overall objective is to improve the performance of distributed query evaluation by leveraging logical reasoning at compile-time. This includes:

proposing techniques for optimizing the placement of data and computations based on additional information available at compile-time (such as constraints, schemas, keys, statistics, etc.)

implementing and validating the results (depending on the advances made)

This is an opportunity to explore theories in programming languages, logics and compilation, with the possibility of experimenting in a concrete setting with code running on a cluster, and an existing architecture designed using various languages (such as OCaml, Scala, bash, Spark, Hadoop, etc.).

Environment and starting blocks

The Tyrex team (http://tyrex.inria.fr) is a common research team (CNRS, Inria, UGA, Grenoble INP) located in the Inria Grenoble Rhône-Alpes research center. The Tyrex team has developed logical analysis methods for queries (see e.g. [1]), and a system for the distributed evaluation of queries by compilation into code executed on Apache Spark [4]. The candidate will benefit from the developments made by the team. He will be encouraged to develop his own ideas, and will be provided with the means to experiment if he wants. This is a topic in which we can easily set a custom balance between theoretical and applied research.