Tuesday, January 06, 2009

Preparing for Class

It's that time of the year where it's time to get my next class ready. I could use your help.

For various odd reasons, in the coming semester I'll be teaching both my undergraduate Algorithms (and Data Structures) class [it's not parenthesized like that in the course catalog, but it probably should be], and my graduate "big data" class, dubbed Algorithms at the End of the Wire. For the graduate class, I cover a variety of topics -- last time, the main areas were search engine algorithms, compression, stream-based algorithms and data structures, and coding (network coding and otherwise), and essentially each class the students are supposed to come in having read a couple of papers. One of the goals of the class is to present algorithmic theory and practice, so I emphasize trying to find practical papers that really make use of underlying theory, or theory papers that really try to impact how a practical problem is thought about. Another goal of the class is to introduce students to the ways of research, by having them read both ancient classics (like, say, Kleinberg's HITS paper) and more recent (e.g. the last year or two) papers.

I'd be interested in any suggestions for good papers for students to read -- particularly recent papers. Don't feel tied to the topics above -- I should be trying to keep up with good papers bridging the algorithm/theory divide in all areas!

19 comments:

How about the seminal paper on DNA computation. It demonstrates that one can use DNA to handle incredible amounts of computation in parallel, and solve NP-complete problems. This isn't exactly "big data" but rather "big computation that creates big intermittent data."

The paper is available from Sciencehttp://www.sciencemag.org/cgi/reprint/266/5187/1021.pdf

...but if institutional access is lacking, you can also get a copy herehttp://www.cs.duke.edu/courses/cps296.4/spring04/papers/Adleman94.pdf

If this is something of interest to you, I can also recommend some of the more recent papers on the subject, including how the nature's process of self-assembly can be used to solve NP-complete problems. Some of these papers are more theory oriented, with theorems, proofs, etc.

I suggest the recent Kronecker Graph model for large networks, proposed by Leskovec et al. It includes theory and experimentation, and seems to suggest/generate good models for many types of large networks.

That's a good suggestion. I'm having the class read my power law survey article for "background", and although the class isn't specifically about network models, that would be a good paper to add if there's time (or make optional reading in any case).

The largest real-world problem solved on a DNA-computer, to my knowledge is a 20-variable 3-SAT problem [1]. Also, some people have taken the general-purpose approach with DNA computation and created some 2-input logic gates using DNA, though they've had trouble computing anything with more than just a couple connected gates [2].

Most DNA-computation researchers would agree that DNA computation will likely not scale to problems, say, larger than 20-variable 3-SAT instances because of inherent error rates that tend to add up to destroy information in large computations. However, DNA computation has given rise to the field of self-assembly, that allows control over error rates. In self-assembly using DNA, people have built (computed) fractals [3] and some really nice designs [4] (pictures in [4] are worth a look).

On the systems side, people (including myself) have used ideas from self-assembly to distribute computation onto Internet nodes to solve NP-complete problems. The largest real-world distribution to date is 186 nodes [5].

How about the Earlybird paper? The paper lays out the challenges in doing worm detection at wire speed, then it shows how to use the incremental update property of Rabin's polynomial fingerprinting to meet these challenges. The technology was commercialized by NetSift, then later acquired by Cisco, which is about as practical as it gets.

This in turn led to a paper I've always found pretty. The authors use reductions from set disjointness to show that certain kinds of intrusion detection problems cannot be solved by a streaming algorithm with constant space. Then they show how relaxations of those problems admit streaming algorithms while achieving many of the same intrusion detection goals.

In the past I've stayed away from worm detection type stuff -- since I've been focusing on Algorithms at the End of the Wire, and it seems to me that worm detection is pretty much on the wire. (Similarly, I haven't done algorithms related to routing, etc.) However, it's a bit of a blurry line, and the topic does intersect nicely with streaming algorithms and such. (And of course I am a huge fan of the work of George Varghese.) Maybe it's time to throw some of that in.

I've read a bit of the self-assembly literature, and the macroscale self-assembly coming out of robotics research seems more consistent with the spirit of the original post: big data and network theory connecting to practical work -- or, at least, connecting to work that appears to have direct practical applications soon.

In Klavins, Programmable Self Assembly, IEEE Control Systems Magazine 2007, the author builds self-assembling robots, and programs them with graph grammars he calls "distributed algorithms." He also explicitly compares his own impossibility results to the famous FLP impossibility result in distributed computing.

Klavins et al, Graph Grammars for Self-Assembling Robotic Systems, ICRA 2004, is a more technical, "mathy" treatment of the subject. It also connects the theory of self-assembly to problems similar to "geese flocking," which is significant in control systems theory.

Neither of those papers makes extravagant claims about the potential power of the field.

I'm sure you've already included or considered these, but they come to mind since Michael Nielsen has been blogging about them lately: PageRank and MapReduce. I don't know how interesting those are from a purely theoretical point of view, but you certainly don't get much bigger or more practical than that.

I've seen that you've used "On the Implementation of Minimum Redundancy Prefix Codes" in your Algorithms at The Ends of the Wire class, which is good. Other papers on similar topics include "An Attribute Grammar Based Framework For Machine-Dependent Computational Optimization Of Media Processing Algorithms," which is a slight extension (probably not important enough to include, but worth a skim); the 1976 "On the construction of Huffman trees," about the linear-time "two-queue" Huffman coding algorithm, which is often ignored since unsorted data has the same complexity using priority queue; and the 1993 "Is Huffman coding dead?" which empirically compares Huffman coding with "better" methods such as arithmetic coding. Other papers that might be worth a look have to do with quickly finding an optimal code for huge alphabets (e.g., "Efficient construction of minimum-redundancy codes for large alphabets" or "Distribution-Sensitive Construction of Minimum-Redundancy Prefix Codes") or are a part of the vast literature on alphabetic codes (Hu-Tucker, Hu-Kleitman-Tamaki, Hu-Morgenthaler, Hu-Larmore-Morgenthaler), which bridges coding and search.

I suppose that's probably way more into the Huffman realm than your students would normally go, but, hey, it's my area. The take-home messages of the "two-queue" and "dead?" papers are probably the best ones to impart: finding the Huffman code can be done in linear time with a simple algorithm, and Huffman coding is often preferable to newer algorithms.

PageRank (and HITS) are always my starting point for the class. Interesting, relevant, and with plenty of mathematical foundations behind them. (Gives me an excuse to make sure people know Markov chains.)

Maybe I'll have the read MapReduce. I don't think it's very theoretical in and of itself, although I guess if that's your paradigm it does affect how you think of algorithm design.

Anonymous -- State of the art for shortest paths doesn't quite fit in to any of the themes, but is certainly a potential worthwhile subject. Any recommendations for the current state of the art in papers? (I could just look up whatever Andrew Goldberg has done lately, but perhaps someone has a more concrete suggestion.)

I don't want to spend too much time on Huffman coding -- it's just one compression scheme I have them learn -- but I do put in the implementation paper to make sure they know:

1) There are faster ways to implement Huffman coding than the "walk down a binary tree" approach they probably learned. (I wish undergraduate algorithms textbooks would make clear that while that's a great way to THINK about Huffman coding, it's a terrible way to IMPLEMENT Huffman coding.)2) Huffman coding is far from dead. :)