Search form

Search

Graduate Summer School 2016

The Mathematics of Data

The Graduate Summer School bridges the gap between a general graduate education in mathematics and the specific preparation necessary to do research on problems of current interest. In general, these students will have completed their first year, and in some cases, may already be working on a thesis. While a majority of the participants will be graduate students, some postdoctoral scholars and researchers may also be interested in attending.

The main activity of the Graduate Summer School will be a set of intensive short lectures offered by leaders in the field, designed to introduce students to exciting, current research in mathematics. These lectures will not duplicate standard courses available elsewhere. Each course will consist of lectures with problem sessions. Course assistants will be available for each lecture series. The participants of the Graduate Summer School meet three times each day for lectures, with one or two problem sessions scheduled each day as well.

In order to derive insight from data, one needs to perform computations, and one needs to perform statistical inference. Both of these tasks raise important and fundamental mathematical questions, especially when one considers realistic sparsity and noise properties, realistic size scales, realistic temporal properties, etc. These questions are often considered outside traditional mathematics departments, and they present challenges to the theoretical foundations of related methodological areas such as computer science and statistics. This requires revisiting traditional and novel areas of applied mathematics to determine which subset of those areas can be used to help establish the theoretical foundations of modern large-scale data analysis. Topics will include Randomized Linear Algebra, Topological Data Analysis, Theoretical Computer Science, Theoretical Statistics, Functional Analysis, Scientific computing, and Optimization. The goal of the program is to present these ideas in perspective, covering the necessary background and leading up to the recent progress and open problems.

Student preparation: We seek motivated students interested in the mathematical, e.g., algorithmic and statistical, aspects of modern large-scale data analysis, including theoretically-inclined students from computer science, statistics, applied mathematics, and related areas. Though familiarity with some of the topics listed above would be helpful, the formal prerequisites are limited to the content of standard introductory courses in linear algebra, probability, and optimization.

The 26th Annual PCMI Summer Session will be held June 30 – July 20, 2016.

The introduction of randomization in the design and analysis of algorithms for matrix computations (such as matrix multiplication, least-squares regression, the Singular Value Decomposition (SVD), etc.) over the past 15 years provided a new paradigm and a complementary perspective to traditional numerical linear algebra approaches. These novel approaches were motivated by technological developments in many areas of scientific research that permit the automatic generation of large data sets, which are often modeled as matrices.

We will outline how such approaches can be used to approximately solve problems ranging from matrix multiplication and the Singular Value Decomposition (SVD) of matrices to the Column Subset Selection Problem and the CX decomposition. Application of the proposed algorithms to data analysis tasks (with a particular focus in population genetics) will also be discussed

We study several problems of online learning and stochastic optimization, which have become more important in the face of large scale data analyses. In particular, these lectures will give an introduction to a variety of online, stochastic, and bandit optimization techniques, making connections with a variety of statistical and inferential tasks, such as prediction problems and discovery of causal mechanisms. Additionally, while the current "deluge" of data suggests that with large enough samples, we can investigate anything, this is often not the case and false discoveries are common. Consequently, the lectures will also develop information-theoretic tools that allow us to understand the limitations of different data analyses, and which allow us to certify the optimality of a number of the procedures we develop.

To realize the full potential of big data for societal benefit, we must also find solutions to the privacy problems raised by the collection, analysis, and sharing of vast amounts of data about people. To fully enable an industry fueled by personal data, we must understand privacy and its limits. The traditional approach of anonymizing data by removing identifiers does not provide adequate privacy protection, since it is often possible to re-identify individuals by linking the seemingly innocuous data that remains in the dataset to auxiliary information known to an attacker and/or present in publicly available datasets. Simply put, “de-Identified data isn’t” – either because it is not anonymous or it no longer functions as data. Differential privacy, a definition of privacy tailored to analysis of large data sets, offers the possibility of avoiding such vulnerabilities. It provides a satisfying resolution to the problem of disentangling learning about a data set as a whole from compromising the privacy of individual data records, crystalized in a mathematically rigorous “stability” requirement that the distribution on outputs produced by a (randomized) analysis should be essentially unchanged by the addition or deletion of the data of a single individual. There is a rich literature on differential privacy, and recent results have shown the stability of differentially private algorithms to provide benefits when privacy is not a concern in and of itself. A notable example is the inoculation against overfitting in adaptive data analysis.

This program combines a short course on differential privacy, emphasizing the construction of differentially private algorithms, with a sequence of more advanced lectures on important new directions.

This course will cover the background, techniques, and applications of Topological Data Analysis. Beginning with an introduction to the classical tools of algebraic topology, we will progress through applications to point clouds, persistence, networks, and more, with far-ranging applications. No background in topology will be assumed.

The discrete Fourier transform (DFT) is a fundamental component of numerous computational techniques in signal processing and scientific computing. The most popular means of computing the DFT is the fast Fourier transform (FFT). However, with the emergence of big data, the “fast” in FFT is often no longer fast enough. In addition, in many applications it is hard to acquire a sufficient amount of data to compute the desired Fourier transform in the first place.

The Sparse Fourier Transform (SFT) is based on the insight that many real-world signals are sparse –i.e., most of the frequencies have negligible contribution to the overall signal. SFT exploits this insight by computing a compressed Fourier transform in time proportional to the data sparsity, not the data size. Furthermore, it uses only a subset of the signal.

The goal of this talk is to survey recent developments in this area and explain the basic techniques with examples and applications. Further resources are available at: http://groups.csail.mit.edu/netmit/sFFT/.

We discuss a variety of techniques for working with data that while being high-dimensional has intrinsically low-dimensional structure. We provide motivations for this assumption, ask how to measure such intrinsic dimension, in a robust fashion, and then discuss both linear and nonlinear dimension-reduction techniques, as well as a host of techniques that work directly in high-dimensional space, while have both computational cost and statistical accuracy dependent on the intrinsic dimension, and not the ambient one. These techniques are the basis for novel dictionary learning methods, as well as for attacking classical machine learning tasks, such as classification and regression. We will also discuss the basic statistical problem of estimating the probability measure generating data, when such measure is high-dimensional but concentrates around low-dimensional sets. Finally, we present connections with the emergent field of signal processing on graphs, and its connections with spectral graph theory. In particular we will discuss multiscale analysis of functions on graphs. Many open problems and research directions will be discussed. Computational techniques used will be discussed; experimentation will be made easy by software toolboxes.

Randomized algorithms for matrix computations and analysis of high dimensional data

The lectures will describe powerful new techniques based on randomized projections, with applications in linear algebra and data analysis. We will investigate both the mathematical properties of random projections and how they can be exploited to construct computationally efficient algorithms. The lectures will focus on techniques for computing matrix factorizations of very large matrices, which directly leads to fast algorithms for computing approximate eigenvalues and eigenvectors, performing Principal Component Analysis, solving linear regression problems, finding spanning vectors, etc. We will also touch on related questions in data mining such as, e.g, performing a nearest neighbor search on a cloud of points in a high dimensional space.

These lectures will explore some classical and recently developed concentration techniques for random matrices and their applications in data science. The course will be focused on two
application domains: (a) analysis of networks using stochastic models and (b) signal recovery from few observations, with ramifications in compressed sensing and statistical regression.

Optimization formulations and algorithms are essential tools in data analysis and machine learning. A comprehensive knowledge of the optimization "toolbox" --- its "contents" and how to deploy it ---- is of great value to any researcher or practitioner working with data. In these sessions, we survey some of the most important and relevant topics in this space. The precise selection of topics will be coordinated with other speakers, but may include: first-order methods, acceleration, regularized optimization, forward-backward methods, stochastic gradient methods, coordinate descent methods, conditional gradient / Frank-Wolfe methods, asynchronous parallel implementations, matrix optimization (including matrix completion), and Augmented Lagrangian methods / ADMM. The aim is to discuss the applicability of these methods to data analysis and to describe their essential theoretical properties.