The keynote address, “The DataHub: A Collaborative Data Analytics and Visualization Platform,” will introduce DataHub, a hosted interactive data processing, sharing, and visualization system for large-scale data analytics that is now being built at MIT. Key features include: flexible ingest and data cleaning tools; a scalable, parallel, SQL-based analytic data processing engine optimized for extremely low-latency operation on large data sets; and an interactive visualization system that is tightly coupled to the data processing and lineage engine. Finally, Datahub is a hosted data platform, designed to eliminate the need for users to manage their own database.

The authors analyze Hadoop workloads from three different research clusters from an application-level perspective, with two goals: (1) explore new issues in application patterns and user behavior and (2) understand key performance challenges related to IO and load balance. The authors’ analysis suggests that Hadoop usage is still in its adolescence. Overall, they find significant opportunity for simplifying the use and optimization of Hadoop, and make recommendations for future research. (For a more detailed summary of this paper, see this excellent blog post by Magda Balazinska.)

The authors address the problem of selectivity estimation in a crowdsourced database. Specifically, they develop several techniques for using workers on a crowdsourcing platform like Amazon’s Mechanical Turk to estimate the fraction of items in a dataset (e.g., a collection of photos) that satisfy some property or predicate (e.g., photos of trees). The authors find that for images, counting can reduce the amount of work necessary to arrive at an estimate that is within 1% of the true fraction by up to an order of magnitude, with lower worker latency.

Monomi securely executes analytical workloads over sensitive data on an untrusted database server. It works by encrypting the entire database and running queries over the encrypted data. Monomi introduces split client/server query execution, which can execute arbitrarily complex queries over encrypted data, as well as several techniques that improve performance for such workloads, a designer for choosing an efficient physical design at the server for a given workload, and a planner to choose an efficient execution plan for a given query at runtime.

Scorpion is a system that takes user-specified outlier points in an aggregate query result as input and finds predicates that explain the outliers in terms of properties of the input tuples used to compute the selected outlier results. This explanation identifies predicates that, when applied to the input data, cause the outliers to disappear from the output. To find such predicates, the authors design algorithms that efficiently search for maximum influence predicates over the input data. The authors show that these algorithms can run orders of magnitude faster than a naive search algorithm while providing comparable quality on a synthetic data set.

The authors introduce AscotDB, a new tool for the analysis of telescope image data. AscotDB results from the integration of Ascot, a web-based tool for the collaborative analysis of telescope images and metadata from astronomical telescopes, and SciDB, a parallel array processing engine. The authors demonstrate the novel data exploration supported by this integrated tool on a 1-TB dataset comprising scientifically accurate, simulated telescope images.

About VLDB

VLDB is a premier annual international forum for data management and database researchers, vendors, practitioners, application developers, and users. Data management and databases remain among the main technological cornerstones of emerging applications of the 21st-century.