20/20: Human-in-the-Loop Data Exploration

Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Overview

We propose to build a new class of database systems designed for Human-In-the-Loop (HIL) operation. We target an ever growing set of data-centric applications in which data scientists of varying skill levels manipulate, analyze and explore large data sets, often using complex analytics and machine learning techniques. Enabling these applications with ease of use and at “human speeds” is key to democratizing data science and maximizing human productivity.
Traditional database technologies are ill-suited to serve this purpose. Historically, databases assumed (1) text-based input (e.g., SQL) and output, (2) a point (i.e., stateless) query-response paradigm, (3) batch results, and (4) simple analytics. We will drop these fundamental assumptions and build a system that instead supports visual input and output, ”conversational” interaction, early and progressive results, and complex analytics. Building a system that integrates these features requires a complete rethinking of the full database stack, from the interface to the ”guts”, as well as incorporating pertinent algorithms.

The proposed work will make three novel technical contributions:

Visual interactive data manipulation: Interactive visualizations are effective and user friendly means of accessing and manipulating data. In our model, users interact with visualizations to express data operations (input) and also consume visualizations as results (output), requiring profound changes in the traditional data systems design and optimizations. On the input side, the PIs propose a visual interaction language that operates over visualizations, taking a visualization as input and producing another as output. On the output side, the PIs will study the notion of visual approximate query answering, which brings approximate query processing to the world of visualization to enable real-time analysis of large data sets. The PIs will also develop techniques for optimizing visualization-specific operations that are not well supported by existing databases.

Conversational query processing: In exploration, users interact with the system using a sequence of queries (aka a query session), each building on the previous one. This radically departs from the point interaction model of traditional query processing in which there are no assumed relationships between queries. The PIs will introduce a session-aware processing model where the system expects that users will engage in a long running “conversation” and optimizes its execution accordingly.

Interactive query steering: A fluid, engaging conversation requires that users have the ability to get representative results early and progressively, and as they learn from these results, to exercise control by rapidly interrupting queries or changing their behavior (e.g., query parameters). The PIs propose novel online query processing and steering techniques that are designed for visual data manipulation involving complex analytics and machine learning.

Below, we discuss different initial projects towards achieving these goals.

People

Vizdom: Interactive Analytics through Pen and Touch

Machine learning (ML) and advanced statistics are important tools for drawing insights from large datasets. However, these techniques often require human intervention to steer computation towards meaningful results. In this project, we build a new system for interactive analytics through pen and touch called Vizdom. Vizdom’s frontend allows users to visually compose complex workflows of ML and statistics operators on an interactive whiteboard, and the back end leverages recent advances in workflow compilation tech niques to run these computations at interactive speeds. Additionally, we are exploring approximation techniques for quickly visualizing partial results that incrementally refine over time. Different from existing approx. query processing techniques, Vizdom takes the perception of the user into account to avoid unnecessary computation where the results are not perceivable by the user.

Video

Publications

Yue Guo, Carsten Binnig, Tim Kraska: What you see is not what you get!: Detecting Simpson’s Paradoxes during Data Exploration. HILDA@SIGMOD 2017: 2:1-2:5

EchoQuery: Chatting with Your Relational Database

Recent advances in automatic speech recognition and natural language processing have led to a new generation of robust voice-based interfaces. Yet, there is very little work on using voice-based interfaces to query database systems. In fact, one might even wonder who in her right mind would want to query a database system using voice commands!
With this project, we make the case for querying database systems using a voice-based interface, a new querying and interaction paradigm we call Query-by-Voice (QbV ). The aim of this project is to demonstrate the practicality and utility of QbV for relational DBMSs using a using a proof-of-concept system called EchoQuery. To achieve a smooth and intuitive interaction, the query interface of EchoQuery is inspired by casual human-to-human conversations.

The main features of the voice-based interface of EchoQuery are:

Hands-free Access: EchoQuery does not require the user to press a button or start an application using a gesture or a mouse-click. Instead, users can interact with the database by solely using their voice at any time.

Dialogue-based Querying: While traditional database systems provide a one-shot (i.e., stateless) query interface, natural language conversations are incremental (i.e., stateful) in nature. To that end, EchoQuery provides a stateful dialogue-based query interface between the user and the database where (1) users can start a conversation with an initial query and refine that query incrementally over time, and (2) EchoQuery can seek for clarification if the query is incomplete or has some ambiguities that need to be resolved.

Personalizable Vocabulary: Domain experts often use their own terms to formulate queries, which might be different from the schema elements (i.e., table and column names) of a database. Learning the terminology of a user and its translation to the underlying schema is similar to the problem of constructing a schema mapping in data integration.EchoQuery constructs these mappings incrementally on a per-user basis by issuing clarification questions using its dialogue-based query interface.

HashStash – Reuse for Interactive Data Exploration

Modern database workloads present ample opportunities for intermediate result reuse. For example, exploration-oriented applications such as Vizdom typically generate workloads where each query serves as a jumping-off point for the next, which is obtained through incremental modifications (e.g., by refining filters, adding joins, drilling down etc.). Various techniques have been developed to profitably reuse intermediates in DBMSs. These solutions typically require intermediate results of individual operators be materialized into temporary tables to be considered for reuse later. However, these approaches are fundamentally ill-suited for use in modern main memory databases. Modern main memory DBMSs are typically limited by the bandwidth of the memory bus and query execution is thus heavily optimized to keep tuples in the the CPU caches and registers. Adding additional operations to a query plan that materialize intermediates into a temporary data structure in memory not only add additional traffic to the memory bus but more importantly prevent the important cache- and register-locality, which results in high performance penalties.

To that end, the goal of this project is to revisit “reuse” in the context of modern main memory databases. The main idea is to leverage internal data structures that are materialized anyway by pipeline breakers during query execution. This way, reuse is possible without any additional materialization costs. The focus of this work is on the most common data structure, hash tables (HTs), as found in hash-join and hash-aggregate operations. We leave other operators and data structures (e.g., trees for sorting) for future work. Our experiments show performance gains up to 50X speed-up compared to the execution strategies without reuse and up to 10x speed-up compared to traditional materialization-based reuse approaches without adding additional materialization overhead.

Publications:

Kayhan Dursun, Carsten Binnig, Ugur Cetintemel and Tim Kraska. Revisiting Reuse in Modern Main Memory Databases. Research Paper, to appear in SIGMOD 2017.