TOC Seminar '13-'14

A unified approach to dimensionality reduction with subgaussian matrices

The Johnson-Lindenstrauss embedding says that one can embed a set of n points in a (high dimensional) Euclidean space into a lower, say m-dimensional, space using a subgaussian matrix, while approximately preserving the Euclidean distances between the points in the set. The embedding dimension m does not depend on the original dimension of the data, but scales logarithmically in terms of the number of points. This result has found many applications, e.g. in nearest-neighbor search.

Over the years J-L type embeddings have appeared for various data sets of infinite size, such as low-dimensional subspaces (motivated by numerical linear algebra), sparse vectors (used in compressed sensing) and smooth manifolds (manifold learning). In my talk I will present a J-L type embedding with subgaussian matrices valid for an arbitrary data set. In particular this result implies all of the aforementioned results for specific data structures. If time permits I will sketch how to prove this embedding using ideas from generic chaining.

Shannon's monumental 1948 work laid the foundations for the rich fields of information and coding theory. The quest for *efficient* coding schemes to approach Shannon capacity has occupied researchers ever since, with spectacular progress enabling the widespread use of error-correcting codes in practice. Yet the theoretical problem of approaching capacity arbitrarily closely with polynomial complexity remained open except in the special case of erasure channels.

In 2008, Arikan proposed an insightful new method for constructing capacity-achieving codes based on channel polarization. In this talk, I will begin by surveying Arikan's celebrated construction of polar codes, and then discuss our proof (with Patrick Xia) that, for all binary-input symmetric memoryless channels, polar codes enable reliable communication at rates within epsilon > 0 of the Shannon capacity with block length (delay), construction complexity, and decoding complexity all bounded by a *polynomial* in the gap to capacity, i.e., by poly(1/epsilon). Polar coding gives the *first explicit construction* with rigorous proofs of all these properties; previous constructions were not known to achieve capacity with less than exp(1/epsilon) decoding complexity.

We establish the capacity-achieving property of polar codes via a direct elementary analysis of the underlying martingale of conditional entropies. This yields effective bounds on the speed of polarization, implying that polar codes can operate at rates within epsilon of capacity at a block length bounded by poly(1/epsilon). The generator matrix of such polar codes can also be constructed in deterministic polynomial time by algorithmically computing an adequate approximation of the polarization process.

The Entropy Soon-To-Be-Method

In the last couple of decades, several combinatorial bounds and algorithmic results were obtained using techniques that are derived from Information Theory. In this talk we will survey a few of them, demonstrating just how widely applicable such techniques are. A tentative list of topics:

Causal Erasure Channels

We consider the communication problem over binary causal adversarial erasure channels. Such a channel maps n input bits to n output symbols in {0, 1, e}, where e denotes an erasure. The channel is causal if, for every i, the channel adversarially decides whether to erase the i-th bit of its input based on inputs 1, ..., i, and before it observes bits i+1 to n. Such a channel is p-bounded if it can erase at most p fraction of the input bits over the whole transmission duration. Causal channels provide a natural model for channels that obey basic physical restrictions but are otherwise unpredictable or highly variable. For a given erasure rate p, our goal is to study the optimal rate (the capacity) at which a randomized (stochastic) encoder/decoder can transmit reliably across all causal p- bounded erasure channels.

In this talk, I will present the causal erasure model and give new upper and lower bounds on the capacity of such channels. Our bounds separate the causal erasure channels from two related models: random erasure channels (strictly weaker) and fully adversarial erasure channels (strictly stronger). Specifically, we show:

- A strict separation between random and causal erasures for all constant erasure rates p ∈ (0, 1). In particular, we show that the capacity of causal erasure channels is 0 for p ≥ 1/2 (while it is nonzero for random erasures).

- For p ∈ [φ, 1/2), we show construction of codes for causal erasures that achieve strictly higher rates than the best known codes for fully adversarial channels.

Our results contrast with existing results on correcting causal bit-flip errors (as opposed to erasures) [Dey et. al 08, 09], [Haviv-Langberg 11]. For the separations we provide, the analogous separations for bit-flip models are either not known at all or much weaker.

This talk is based on a joint work with Adam Smith (Pennsylvania State University). It has appeared in the proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA) 2014.

Approximating Hereditary Discrepancy via Small Width Ellipsoids

The Discrepancy of a hypergraph is the minimum attainable value, over two-colorings of its vertices, of the maximum absolute imbalance of any hyperedge. The Hereditary Discrepancy of a hypergraph, defined as the maximum discrepancy of a restriction of the hypergraph to a subset of its vertices, is a measure of its complexity. Lovasz, Spencer and Vesztergombi (1986) related the natural extension of this quantity to matrices to rounding algorithms for linear programs, and gave a determinant-based lower bound on the hereditary discrepancy. Matousek (2011) showed that this bound is tight up to a polylogarithmic factor, leaving open the question of actually computing the bound. Recent work by Nikolov, Talwar and Zhang (2013) showed a polynomial time O(log^3 n)-approximation to hereditary discrepancy, as a by-product of their work in differential privacy. In this paper, we give a direct and simple O(log^{3/2} n)-approximation algorithm for this problem. We show that up to this approximation factor, the hereditary discrepancy of a matrix A is characterized by the optimal value of simple geometric convex program that seeks to minimize the largest infinity norm of any point in a ellipsoid containing the columns of A.

Joint work with Kunal Talwar.

Hitting Sets for Multilinear Read-Once Algebraic Branching Programs, in any Order

It is an important open question whether we can derandomize small space computation, that is, whether RL equals L. One version of this question is to construct pseudorandom generators for read-once oblivious branching programs. There are well-known results in this area (due to Nisan, and Impagliazzo-Nisan-Wigderson), but they fail to achieve optimal seed-length. Further, it has been observed that these pseudorandom generators depend strongly on the "order" of the "reads" of the branching program. When this order is allowed to vary, only much weaker results are known.

In this work, we consider an "algebraic" version of this question. That is, we seek to fool read-once algebraic branching programs, regardless of the variable order. By rephrasing and improving the techniques of Agrawal- Saha-Saxena, we are able to construct hitting sets for multilinear polynomials in this unknown-order model that have polylogarithmic "seed-length". This constitutes the first quasipolynomial-time, deterministic, black-box polynomial identity testing (PIT) algorithm for this model.

In this work, we show how to use indistinguishability obfuscation (iO) to build multiparty key exchange, eﬃcient broadcast encryption, and eﬃcient traitor tracing. Our schemes enjoy several interesting properties that have not been achievable before:

-Our multiparty non-interactive key exchange protocol does not require a trusted setup. Moreover, the size of the published value from each user is independent of the total number of users.

-Our broadcast encryption schemes support distributed setup, where users choose their own secret keys rather than be given secret keys by a trusted entity. The broadcast ciphertext size is independent of the number of users.

-Our traitor tracing system is fully collusion resistant with short ciphertexts, secret keys, and public key. Ciphertext size is logarithmic in the number of users and secret key size is independent of the number of users. Our public key size is polylogarithmic in the number of users. The recent functional encryption system of Garg, Gentry, Halevi, Raykova, Sahai, and Waters also leads to a traitor tracing scheme with similar ciphertext and secret key size, but the construction in this paper is simpler and more direct. These constructions resolve an open problem relating to diﬀerential privacy.

Our proof of security for private broadcast encryption and traitor tracing introduces a new tool for iO proofs: the construction makes use of a key-homomorphic symmetric cipher which plays a crucial role in the proof of security.

On Interactivity in Arthur-Merlin Communication and Stream Computation

We introduce online interactive proofs (OIP), which are a hierarchy of communication complexity models that involve both randomness and nondeterminism (thus, they belong to the Arthur--Merlin family), but are *online* in the sense that the basic communication flows from Alice to Bob alone. The complexity classes defined by these OIP models form a natural hierarchy based on the number of rounds of interaction between verifier and prover. We give upper and lower bounds that (1) characterize every finite level of the OIP hierarchy in terms of previously-studied communication complexity classes, and (2) separate the first four levels of the hierarchy. These results show marked contrasts and some parallels with the classic Turing Machine theory of interactive proofs.

Our motivation for studying OIP is to address computational complexity questions arising from the growing body of work on data stream computation aided by a powerful but untrusted helper. By carefully defining our complexity classes, we identify implicit assumptions in earlier lower bound proofs. This in turn indicates how we can break the mold of existing protocols, thereby achieving dramatic improvements. In particular, we present two-round stream protocols with logarithmic complexity for several query problems, including the fundamental INDEX problem. This was thought to be impossible based on previous work.

Approximating Large Frequency Moments with O(n^{1-2/k}) Bits

We consider the problem of approximating frequency moments in the streaming model. Given a stream D = {p_1,p_2,...,p_m} of numbers from {1,...,n}, a frequency of i is defined as f_i = |{j: p_j = i}|. The k-th frequency moment of D is defined as F_k = \sum_{i=1}^n f_i^k.

In their celebrated paper, Alon, Matias, and Szegedy (STOC 1996) introduced the problem of computing a (1 +/- epsilon)-approximation of F_k with sublinear memory. We give upper bound of O(n^{1-2/k}) bits that matches, up to a constant factor, the lower bound of Woodruff and Zhang (STOC 12) for constant epsilon and k > 3.

Joint work with Jonathan Katzman, Charles Seidell and Gregory Vorsanger.

Smoothed Analysis and Uniqueness of Tensor Decompositions

Low-rank tensor decompositions, the high dimensional analog of matrix decompositions, are a powerful tool that arise in statistics, signal processing, data mining and machine learning. However, tensors pose significant algorithmic challenges and tensors analogs of much of the matrix algebra toolkit are unlikely to exist because of hardness results. For instance, efficient tensor decompositions in the overcomplete case (where rank exceeds dimension) are particularly challenging.

In this talk, I will address this by describing two recent results about tensor decompositions, with applications to learning generative models:

1. I will present a robust version of a classic theorem of Kruskal which shows uniqueness of tensor decompositions under mild rank conditions i.e. the decomposition is unique even if the entries of the tensor have inverse polynomial error. This powerful uniqueness property of tensor decompositions gives a significant advantage over matrix decomposition methods in learning parameters of many latent variable models.

2. I will introduce a smoothed analysis model for studying tensor decompositions and give efficient algorithms for decomposing tensors, even in the highly overcomplete case (rank polynomial in the dimension). This gives new efficient algorithms for learning probabilistic models like mixtures of gaussians and multi-view models, in the natural smoothed analysis setting. We believe this an appealing way to analyze realistic instances of learning problems, since this framework allows us to overcome many of the usual limitations of using tensor methods.

Based on joint works with Aditya Bhaskara, Moses Charikar and Ankur Moitra.

Fourier Principal Component Analysis

Fourier PCA is Principal Component Analysis of the covariance matrix obtained after reweighting a distribution with a random Fourier weighting. It can also be viewed as PCA applied to the Hessian matrix of functions of the characteristic function of the underlying distribution. Extending this technique to higher derivative tensors and developing a general tensor decomposition method, we derive the following results: (1) a polynomial-time algorithm for general independent component analysis (ICA), not requiring the component distributions to be discrete or distinguishable from Gaussian in their fourth moment (unlike in the previous work); (2) the first polynomial-time algorithm for underdetermined ICA, where the number of components can be arbitrarily higher than the dimension; (3) an alternative algorithm for learning mixtures of spherical Gaussians with linearly independent means. These results also hold in the presence of Gaussian noise.

Testing Properties under Lp Distances

We present sublinear algorithms for approximately testing properties of real-valued data under Lp distance measures (for p = 1,2). Our algorithms allow one to distinguish datasets which have a certain property from datasets which are far from having it with respect to L_p distance. While the classical property testing framework developed under the Hamming distance has been studied extensively, testing under Lp distances has received little attention.

For applications involving noisy real-valued data using Lp distances is natural because unlike Hamming distance it allows to suppress distortion introduced by the noise. Moreover, we show that it also allows one to design simple and fast algorithms for classic problems, such as testing monotonicity, convexity and Lipschitz conditions (also known as “sensitivity”). Our algorithms require minimal assumptions on the choice of the sampled data (either uniform or easily samplable random points suffice). We also show connections between our Lp-testing model and the standard framework of property testing under the Hamming distance. In particular, some of our results improve existing bounds for Hamming distance.

Joint work with Piotr Berman and Sofya Raskhodnikova.

Privacy, Stability and High-dimensional Sparse Regression

I will discuss recent results on how different notions of stability of learning algorithms can be used to design differentially private algorithms. We focus on designing algorithms for statistical model selection. Given a data set and a discrete collection of models, each of which is a family of probability distributions, the goal is to determine the model that best ``fits'' the data. This is a basic problem in many areas of statistics and machine learning.

We give two classes of results: generic ones, that apply to any function with discrete output set; and specific algorithms for the problem of sparse linear regression. The algorithms we describe are efficient and in some cases match the optimal nonprivate asymptotic sample complexity.

Our algorithms for sparse linear regression require analyzing the stability properties of the popular LASSO estimator. We give sufficient conditions for the LASSO estimator to be robust to small changes in the data set, and show that these conditions hold with high probability under essentially the same assumptions that are used in the literature to analyze convergence of the LASSO.

Adaptive Seeding in Social Networks

The challenge of identifying individuals who can efficiently disseminate information through social networks has been heavily studied throughout the past decade. Despite considerable progress and an impressive arsenal of techniques developed for this problem, state-of-the-art algorithms often perform poorly in practice due to a combination of various restrictions on the input and the structure of social networks.

In this talk we will introduce a new framework which we call Adaptive Seeding. The framework is a stochastic two-stage (combinatorial) optimization model designed to leverage a key structural property of social networks. The main result we will discuss is a constant factor approximation algorithm for all standard models of information spreading in social networks. The result follows from new techniques and concepts that may be of independent interest for those curious about submodular maximization, stochastic optimization, and machine learning.

Based on joint work with Lior Seeman

Approximate Near Neighbor Search: Beyond Locality Sensitive Hashing

The c-approximate near neighbor problem (c-ANN) is defined as follows: given a set P of n points in a d-dimensional space, build a data structure such that, given a query point q, if there exists a point within distance r from q, then it reports a point within distance cr from q. Here c is the approximation factor of the algorithm.

We present a new data structure for c-ANN in the Euclidean space. For n points in R^d, our algorithm achieves O(dn^ρ) query time and O(n^{1+ρ} + nd) space, where ρ ≤ 7/(8c^2) + O(1/c^3) + o(1). This is the first improvement over the result by Andoni and Indyk (FOCS 2006) and the first data structure that bypasses a locality-sensitive hashing lower bound proved by O'Donnell, Wu and Zhou (ITCS 2011). By a standard reduction we obtain a data structure for the Hamming space and ℓ1 norm with ρ ≤ 7/(8c) + O(1/c^{3/2}) + o(1), which is the first improvement over the result of Indyk and Motwani (STOC 1998).

Joint work with Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn.

Approximate Constraint Satisfaction Requires Large LP Relaxations

We prove super-polynomial lower bounds on the size of linear programming relaxations for approximation versions of constraint satisfaction problems. We show that for these problems, polynomial-sized linear programs are exactly as powerful as programs arising from a constant number of rounds of the Sherali-Adams hierarchy.

In particular, any polynomial-sized linear program for Max Cut has an integrality gap of 1/2 and any such linear program for Max 3-Sat has an integrality gap of 7/8.

Let E be a d-dimensional linear subspace of R^n. A subspace embedding for E is a linear map that preserves the l_2 norms of all vectors in E up to 1+eps. We improve a recent result of [Clarkson-Woodruff, STOC'13] and with a much simpler proof: we show that the Thorup-Zhang sketch (a random sign matrix with one non-zero per column) is a subspace embedding with good probability when the number of rows is O(d^2/eps^2). Once the theorem is formulated the right way, the proof is a simple second moment computation. We then show our main result: that when one has m rows and s non-zeroes per column (the sparse Johnson-Lindenstrauss matrices of [Kane-Nelson, SODA'12]), one can take m = O(d^{1.01}/eps^2), s = O(1/eps). These are the first subspace embeddings to have m = o(d^2) with s = o(d). Our bounds imply improved running times for several numerical linear algebra problems, including approximating leverage scores, least squares regression, l_p regression for p in [1, infty), and low-rank approximation.