Monthly Archive

Next up in the mini-course on data streams (first two lectures here and here) were lower bounds and communication complexity. The slides are here:

The outline was:

Basic Framework: If you have a small-space algorithm for stream problem , you can turn this into a low-communication protocol for a related communication problem . Hence, a lower bound on the communication required to solve implies a lower bound on the space required to solve . Using this framework, we first proved lower bounds for classic stream problems such as selection, frequency moments, and distinct elements via the communication complexity of indexing, disjointness, and Hamming approximation.

Information Statistics: So how do we prove communication lower bounds? One powerful method is to analyze the information that is revealed about a player’s input by the messages they send. We first demonstrated this approach via the simple problem of indexing (a neat pedagogic idea courtesy of Amit Chakrabarti) before outlining how the approach would extend to the disjointness problem.

Hamming Distance: Lastly, we presented a lower bound on the Hamming approximation problem using the ingenious but simple proof [Jayram et al.]

The goal of the project is to develop novel compressive sensing algorithms for geoscience problems in the area of hydrocarbon field exploration and surveillance. The appointment is initially for one-year, with the possibility of renewal for up to 3 years. The appointment should start either during the summer (the preferred option) or the fall of 2012.

Duties and Responsibilities:

Provide expertise on and contribute to the development of compressive sensing and sparse recovery algorithms for geoscience applications

Help MIT faculty in coordination of research projects, including periodic reporting and documentation as required by the program timeline

In the second lecture, we discussed algorithms for graph streams. Here we observe a sequence of edges and the goal is to estimate properties of the underlying graph without storing all the edges. The slides are here:

We covered:

Spanners: If you judiciously store only a subset of the edges, you can ensure that the distance between any two nodes in the subgraph is only a constant factor larger than the distance in the entire graph.

Sparsifiers: Here you also remember a subset of edges but also apply weights to these edges such that the capacity of every cut is preserved up to a small factor. We show how to do this incrementally.

Sketching: All this talk of remembering a subset of the edges is all very well but what if we’re dealing with a fully dynamic graph where edges are both added and deleted. Can we still compute interesting graph properties if we can’t be sure the edges we remember wouldn’t subsequently be deleted? You’ll have to see the slides to find out. Or see this SODA paper.

Lunch Break! If you want to fully recreate the experience of a French workshop, I suggest you now find yourself a plate of oysters and break for lunch. Lectures will resume shortly.

Just back from EPIT 2012 where I gave four lectures on data streams and a bit of communication complexity. Other lecturers discussed semi-definite programming, inapproximability, and quantum computing. Thanks to Iordanis Kerenidis, Claire Mathieu, and Frédéric Magniez for a great week.

Anyhow, I wanted to post my slides along with some editorial comments that may or may not be of interest. Here’s the first lecture:

The goal was to cover some of the basic algorithmic techniques such as different forms of sampling and random linear projections. The basic narrative was:

Sampling: Sure, you can uniformly sample from the stream and try to make some inferences from the sample. But there are better ways to use your limited memory.

Sketching: You can compute random linear projections of the data. Hash-based sketches like Count-Min and Count-Sketch are versatile and solve many problems including quantile estimation, range queries, and heavy hitters.

Sampling Again: A useful primitives enabled by sketching, is non-uniform sampling such as sampling where the probability of returning a stream element of type is proportional to the -th power of the frequency of .

I bookended the lecture with algorithms for frequency moments, first a suboptimal result via AMS sampling (from Alon et al. STOC 1996) and then a near-optimal result via sampling (from Andoni et al. FOCS 2011 and Jowhari et al. PODS 2011). I was tempted to use another example but that seemed churlish given the importance of frequency moments in the development of data streams theory. To keep things simple I assumed algorithms had an unlimited source of random bits. Other than that I was quite happy that I managed to present most of the details without drowning in a sea of definitions and notation.