Data Science Theory, Methods and Tools

Researchers in this cluster work on theoretical foundations of Data Science, design machine learning algorithms with provable guarantees, develop methods and tools for the practitioners that are broadly useful in combating the “deluge” of data caused by ever growing sources of data. Researchers with core expertise in algorithms, mathematics, and statistics work with domain experts in areas where there is a perceived benefit to collecting large amounts of data. The constant interplay between the particulars of a domain and generality of methods is essential to the advances we seek in algorithmic data sciences.

Analysis of high dimensional data is a foundational pillar of modern data science and applications, and behind much of the recent advances in “Artificial Intelligence” applications. This group of researchers has expertise in various methods including multivariate analysis (and its recent growth as unsupervised learning), clustering, dimensionality reduction, reconstruction. Our focus is on analytic methods and tools that enable data professionals to efficiently navigate and analyze real-world data that is influenced by a large number of parameters. Many approaches to high-dimensional data require additional assumptions on the data in order to be successful; from sparsity in some representation, to assumptions of manifold-type behavior, to behavior of the tail distribution of the data. These assumptions have been motivated by physical and scientific properties inherent to the application area.

A subgroup of researchers focuses on the computational challenges in optimization and tensor computation. Algorithms for optimization and tensor computation have widespread application in signal processing (blind source separation, phase retrieval, low-rank matrix completion), machine learning (latent variable analysis, clustering), hypergraph theory and high-order statistics. These data-driven applications rely on the formulation and analysis of efficient methods for a range of problems, including nonconvex and global optimization, tensor decomposition, low-rank approximation, and the estimation of tensor eigenvalues.

Computer-intensive and non-parametric statistical methods

With the advent of the personal computer at the latter part of the 20th century, statisticians have been gradually moving away from parametric models that often rely on restrictive and/or unreliable assumptions, and going towards nonparametric models that are more flexible. These include resampling/bootstrap, subsampling/jackknife, cross-validation which provide practitioners with a general way to conduct statistical inference (e.g. hypothesis tests, confidence intervals, and prediction) under a nonparametric context. Short term goals of this cluster include: a) bootstrap prediction intervals for the volatility of financial data; b) permutation tests applied to modern detection problems; c) improved estimation of conditional distributions in regression; d) model-free bootstrap for nonparametric regression; and e) multiple hypothesis testing and control of false discovery rate via subsampling.

Accelerated Learning Methods: Hardware and Software

As learning methods continue to find new applications and enable new system-level capabilities such as automated driving, efficient implementation of these methods into customized hardware/software solutions becomes essential for continued proliferation to new platforms. This group of researchers explores algorithmic, architectural and hardware accelerator designs and co-design methods to provide orders of magnitude increases in performance and energy efficiency of machine learning systems.

Experimental Design and Hypothesis Testing

Backed by large data sets and sophisticated reasoning tools, poorly designed experiments can easily lead researchers to false conclusions, only with more confidence. To reduce false discovery, automated exploration of large data sets to establish a scientific fact, prove or disprove an assertion requires a careful design of data experiments and statistical analysis especially in online settings. We explore mathematical foundations, formal methods and tools to help Data Science practitioners design sound experiments and make deductions against specified confidence levels.

Causality and Inference

The inference of causality from empirical data is one of the deepest and most central goals in science. Arguably, virtually all aspects of scholarly inquiry involve the search for the causal forces that shape physical, social, and mental phenomena (among others, undoubtedly). The question is one that has vexed and challenged scholars across a wide-range of disciplines. A variety of approaches have been proposed, each of which has strengths as well as limitations. This group will bring together faculty who deal with diverse sets of data and phenomena across different disciplines, but are joined by a common interest in exploring and applying existing methods for inferring causality, as well as in developing new approaches.

Streaming and Sub-linear Linear Learning Algorithms

Traditional algorithms need to read and manipulate the entire input set given to them in order to compute a solution to the problem they are designed to solve. The amounts of data in machine learning make many of these traditional algorithms too costly to be used in practice. One approach to resolve it is the design of sub-linear algorithms, which use sampling techniques to only consider a small portion of the input, with the guarantee that with high probability over the sample, it is representative of the full input. Examples include algorithms on large sparse graphs, such as social networks or networks arising in biology. Another approach is to use streaming algorithms, which process the entire input, but at any given point only remember a concise representation of the important information about the input so far. This is also critical when the data is observed on the fly, and cannot be stored in memory due to its volume.

Models and Analysis of Multimedia Data Members

Data Security and Privacy

We will explore system designs, new programming languages and paradigms, and ML techniques that can provide strong security and data privacy guarantees. At the same time, we will design new scalable program analysis and ML techniques to find bugs and vulnerabilities in large systems (e.g., browsers and operating systems).