Tuesday, July 31, 2007

The state of journal reviewing in CS and EE journals is generally pretty dismal. It often takes an unforgivably long time to get reviews back, and often reviews offer less in the way the feedback and constructive criticism than one might hope.

Here I'll focus on aspects related to timing. (I'm unconvinced that reviews that take longer to get back are any better in terms of feedback. Often, just the opposite -- the reviewer realizes the paper has been on their desk too long and just writes a quick paragraph to be done with it.) Let me suggest a set of basic pledges -- rules that everyone should follow. Feel free to suggest your own additions....

When asked to review a paper, I will respond as quickly as possible, even if the answer is no.

I will review at least 1 journal paper for every journal paper I submit. (This does not mean "I will give my graduate students 1 paper to review for every journal paper I submit." Though graduate students should also do reviewing...)

I will skim through a paper I plan to review within one month of receiving it, so that if it is a clear reject for any reason, or there is an important question that needs to be raised, I can promptly write a short note to the editor (and thereby hopefully avoid a more detailed review).

I pledge to review papers within six months of receiving them. I acknowledge that six months is an upper bound, not a target deadline.

I accept that there is reviewing karma, and I should not be surprised if editors pass my papers on to slow reviewers if I am a slow reviewer.

Monday, July 30, 2007

This post is dedicated to a researcher most CS people might not know about (although most information theorists probably do), David MacKay. I encourage you to go take a look at his web page.

David and I got to know of each other's work because we were both working on Low Density Parity Check codes around the same time. Since then, he's written one of the great textbooks on information theory. He's one of the key people behind the Dasher project, which provides text entry methods for the handicapped based on information theoretic principles. (Really. It's amazing what a well-thought out system based on arithmetic coding and a clever UI can do.)

And now he's finishing up a book on sustainable energy. The book is for laypeople, showing how to think in rough quantitative terms about how much energy is available, how much is being used, and what it all means. No matter what you think about the energy crisis, the book serves as an interesting example of how to try to encourage people to think quantitatively about policy-related issues. He's made a draft freely available. I encourage you to have a look.

Thursday, July 26, 2007

Over at the geomblog, Suresh and Piotr have taken my initial data-gathering on citations counts for the theory conferences to the extreme. They point out issues in gathering data (e.g., how does one handle summing conference/journal versions; sometimes the journal version has a different title!), consider the impact of short/long SODA papers, and call for help checking their results. Excellent work!

The implications, whatever they are, remain a point of future discussion.

Wednesday, July 25, 2007

In my last post, I described a problem that's bothered me for some time: why do simple hash functions work so well -- that is, as well as the analysis for truly random hash functions? The natural answer is that there is some sort of randomness in the data interacting with the randomness in how the hash function is chosen that combines to give these results. In our paper, Salil Vadhan and I try to give a solid foundation for this approach.

To prove things about this interaction, we need models. For hash functions, we naturally look primarily at 2-universal (or pairwise independent, or sometimes 4-wise independent) hash functions. Such families are well-studied in theory and have practical implementations.

To model the data, we assume that it comes from a block source. This means that each new data item, conditioned on the values of the previous items, still has some entropy to it. In other words, each new piece of data is at least somewhat unpredictable. It can depend on the previous data in essentially arbitrary ways; for example, it might be possible to narrow the possibilities for the next data item to a small range based on the past, or the distribution of what the next item will be could be highly skewed. But as long as there is still enough unpredictability, we are fine. This seems like a perfectly reasonable model for network or streaming data to me, although as far as I know the block source model has not been used in this context previously.

Once we have the model in place, in some sense, we're done. The Leftover Hash Lemma of Impagliazzo, Levin, and Luby essentially says that, if there's enough entropy in your block stream, you get a near-uniform output over the entire stream when using even just a 2-universal hash function. Our paper improves on and specializes this flavor of result for the setting we're looking at -- hashing items into tables -- and examines the implications for many applications.

Finally, I feel I have a clear and clean justification for why truly random analysis is suitable even if only simple hash functions will be used. I expect there's still more that one can do to improve this framework, and it raises a number of interesting open questions that lie at the intersection of algorithms, information theory, and networking. For more details, I refer you to the paper.

I'll leave off with an interesting open question. In the "power of two choices" (or Balanced Allocations) scenario, a sequence of n items is hashed into a table with n bins. Each item is hashed twice, with 2 different hash functions (more generally, hashed d times, with d different hash functions), with each hash giving a possible location for the item. The item is actually placed in the least loaded of the 2 choices (breaking ties arbitrarily). Now on a lookup one has to check 2 places for the item, but the load distribution is much more even. Specifically, the maximum number of items in a bucket with d choices is only log log n/log d + O(1), instead of (1 +o(1)) log n/ log log n one gets from one choice. (You might check out this somewhat outdated survey if you're not familiar with the problem.)

As far as I know, we don't have a worst-case analysis (that is, as opposed to our paper, with no assumptions on the input data) for say pairwise-independent or even k-wise independent (for some constant k) hash functions, for some constant number of hash functions d. This has been open for some time.

Monday, July 23, 2007

On my publications page there is a recently finished new paper I'm quite excited about. Part of that excitement is because it's (finally!) my first paper with my remarkable colleague Salil Vadhan. The rest is because I think it answers a question that's been bothering me for nearly a decade.

A fair bit of my research has been on hashing, and often I end up doing simulations or experiments to check or demonstrate my results. For many analyses of hashing processes -- including Bloom filters or the power of two choices, we (or at least I) assume the hash function is truly random -- that is, each element is independently mapped to a uniform location by the hash function. This greatly helps in the analysis. Such an assumption is downright painful to strict theorists (like Salil, who has questioned me about this for years), who rightly point out that hash functions being used are not anything close to truly random, and in fact truly random functions would be far too expensive in terms of space to even write down in practice.

Of course theorists have developed ways to deal with this -- with the seminal work being Carter and Wegman's introduction of universal hashing. The key idea is that you choose a hash function at random from a small family of hash functions that are easy to compute and require little space to express, and this choice guarantees some random property, even it is is not as strong as choosing a truly random hash function. For example, the basic universal hash function families guarantee that for any elements x and y in the universe, if you choose a hash function uniformly from the family, then the probability that x and y land in the same bucket is 1/#of buckets. That is, the collision probability of any specific pair is what it should be. Once you consider 3 elements at the same time, however, the probability that all 3 land in the same bucket need not be (1/# of buckets)^2, as it would be for a truly random hash function.

The most used variation is probably k-wise independent hash families, which have the property that when a hash function is chosen from that family, any collection of k elements are independently and uniformly distributed.

A negative with using these weakened hash families is that usually, naturally, you get weaker results from the analysis than you would using truly random hash functions. A fine example of this is the recent work by Pagh, Pagh, and Ruzic on Linear Probing with Constant Independence. With truly random hash functions, linear probing takes expected search time O(1/(1-alpha)) where alpha is the load on the hash tale. They show that using pairwise (2-wise) independence linear probing can take time O(log n) on average -- it no longer depends on the load. Using 5-wise independence they get expected time polynomial in 1/(1-alpha). That's a great result, although there's still a gap from truly random.

The real problem though, is the following: pretty much whenever you do an experiment on real data, even if you use a weak hash function (like a pairwise independent hash function, or even just some bits from MD5), you get results that match the truly random analysis. I've found this time and time again in my experiments, and it's been noted by other for decades. That's part of the reason why when I do analysis, I don't mind doing the truly random analysis. However, it leaves a nagging question. What's going on?

The obvious answer is that there is also randomness in the data. For example, if your data items were chosen uniformly from the universe, then all you would need from your hash function is that it partitioned the universe (roughly) equally into buckets. Such a broad assumption allows us to return to the truly random analysis. Unfortunately, it's also amazingly unrealistic. Data is more complicated than that. If theorists don't like the idea of assuming truly random hash functions, you can imagine practitioners don't like the idea of assuming truly random data!

For years I've wanted an explanation, especially when I get a paper review back that questions the truly random analysis on these grounds. Now I think I have one. Using the framework of randomness extraction, we can analyze the "combination" of the randomness from choosing the hash function and the randomness in the data. It turns out that even with just pairwise independent hash functions, you only need a small amount of randomness in your data, and the results will appear uniform. The resulting general framework applies to all of the above applications -- Bloom filters, multiple-choice hashing, linear probing -- and many others.

Friday, July 20, 2007

In the world of extremes, one could imagine tenure decisions being based solely on letters without really looking at citation data. A motivation for this approach would be that letters give you a richer picture of how a person is viewed by their peers in the research community, what their work has been about, and what the potential impact of this work will be in the future. On the other hand, the system can be gamed, by making sure positively inclined people get chosen to write letters. (The person up for tenure might not exactly be able to game the system themselves, but certainly a friendly department chair could...) I have to admit, the letter-based approach feels rather "old-boy network" to me, which leaves me a bit uncomfortable.

As another extreme, one could imagine tenure decisions being based solely on numerical data gathered from Google scholar or other sources. A motivation for this approach would be that the numbers supposedly give you an unbiased picture of the impact of a researcher's work, allowing comparisons with other comparable researchers. The data could also be used to gauge the derivative -- how one's work is changing and growing in impact over time. On the other hand, the system can be gamed, by developing groups who purposefully cite each other whenever possible or by working on projects that give better numbers without really giving high impact. I have to admit, I like the numbers, and in some respects I trust them more than letters, but I still don't entirely trust them, either.

My limited experience with promotion decisions is that it makes sense to gather both types of data and make sure they are consistent. When they are not consistent, then the departmental arguments can begin. When asked to write letters, I know I look at the citation data, and would include it in the letter if I felt it was appropriate.

Tuesday, July 17, 2007

With the SODA deadline just past, and FOCS coming up nearby at Brown, I got to thinking about bringing up the old SODA vs. FOCS/STOC debate. For those new to the subject, FOCS (Symposium on Foundations of Computer Science) and STOC (Symposium on the Theory of Computing) are generally considered the top two conferences each year in theoretical computer science. SODA (Symposium on Discrete Algorithms) is considered by many (including myself) to be better suited for algorithmic results, and many have argued that the big two theory conferences should really be the big three.

Now, being a quantitative sort of person, I decided the best way to bring something new to the discussion would be to gather some data. So I picked a non-random year -- 2000 -- and decided to look at the citation counts on Google Scholar for all of the papers from the conferences and compare. I figured 2000 was far enough back that we'd get meaningful citation counts for comparison. Yes, I know, citations counts are not the end-all and be-all and all-that, but that's an argument for a different day (or for the comment section). In aggregate, they must tell us something important.

You can imagine my surprise when I looked at the numbers and found SODA completely dominated. Here's a little chart:

Papers

Median Cites

Max.Cites

Total Cites

FOCS

66

38

318

3551

STOC

85

21

459

3393

SODA

122

14

187

2578

FOCS seems to have had an absurdly good year, but the data speaks volumes. At least as of 2000, the best results went overwhelmingly to FOCS and STOC. Overall, I found the results of this experiment rather disappointing for SODA.

It would be nice to check to see what the current trend is, but it was annoyingly time-consuming to do this by hand, and I didn't feel like writing a script. (There seem to be many exceptional cases; if you try to do an exact match on the title, you'll often not find the paper because the title has changed somewhere along the line or a typo or something. But with a regular query you often have to do some searching to find the paper. Perhaps someone more talented will be inspired to write a script?)

I learned something interesting and useful by gathering this data. As a community, should we be paying more attention to numbers like these, so we can make our conferences better? For example, I can't recall anyone ever trying to systematically answer whether a PC made good decisions or not after the fact, but apparently we now have the data available to try to answer such questions in an empirical fashion. (We might check if rejected papers were later accepted and highly cited, for example.) What possible lessons can we learn using these citation tools?

Monday, July 16, 2007

A (binary) Poisson-repeat channel works as follows: the sender sends a sequence of n bits, and the channel independently replaces each bit with a number of copies (or repeats) that has a discrete Poisson distribution with mean 1. That is, for each bit, the probability that it is replaced by k copies is e^{-1}/k!. Here k can be 0, in which case the bit is deleted. The receiver gets the resulting string after replacements. I'd like to find an efficient code for this channel that has a non-trivial constant rate. Any rate over 0.01 , for example, would be just fine. Of course, I'd like the bound on the rate to be provable, rather than just experimental, and I really would like the code to be practical, not just polynomial-time encodable/decodable.

What's the motivation? It turns out that the Poisson-repeat channel is closely tied to the deletion channel, where each bit is independently deleted with probability p. A code for the Poisson-repeat channel would immediately yield a good code for the deletion channel for values of p is close to 1; we showed the reduction in this paper.

Codes for insertion/deletion channels are hard; very little is known. Because a code for this specific channel would yield codes for a larger family of channels, I think it's an appropriate and intriguing target.

Friday, July 13, 2007

While at ISIT, I missed a FIND (Future Internet Directions) meeting. (You may have to register -- give your name -- to get access.) FIND is, as it says on their overview page, "a major new long-term initiative of the NSF NeTS research program. FIND invites the research community to consider what the requirements should be for a global network of 15 years from now, and how we could build such a network if we are not constrained by the current Internet -- if we could design it from scratch."

The page on the meeting is interesting in itself if you want to see what's going on with the FIND program (and maybe get ideas for future proposals). What is also interesting to me is this requirement of the program that PIs are supposed to get together at meetings roughly three times a year as part of the grant. This has not historically been how the NSF operates, though I hear it's common for DoD funding. I doubt the approach will necessarily spread to theory; it seems to me that it makes sense for FIND because it's such a highly targeted initiative, while theory grants cover a much more diffuse range of activities. But it may be that in the future we can expect more money being funneled into similar targeted initiatives -- with similar requirements.

Tuesday, July 10, 2007

Continuing my last question, what other support for graduate students (on the Ph.D. track) do you have? The main two I can think of are the following:

Does the department provide funding for incoming graduate students for some period? For example, do all first years get funded by the department automatically? Are there requirements for this funding (such as they have to teach one semester)?

What department support to students get if they take on a Teaching Assistant role during the semester? Does it cover tuition/stipend for the semester? If not, what fraction?

And there's still time to answer the previous question -- what's the annual cost of a graduate student (tuition+stipend+overhead).

Sunday, July 08, 2007

What's a graduate student cost these days? That is, as a line item on an NSF grant, for a 12-month graduate student, tuition+stipend+overhead, what's a graduate student cost? (Please feel free to answer anonymously, as I'm sure this is the sort of information people are interested in but don't want to reveal by institution. And use whatever additional qualifiers are necessary. For consistency, let's say a 3rd year student.)

This question was motivated by a change made this last week in Harvard's policy regarding how graduate students were charged to grants. Previously, Harvard would pay the student the stipend and tuition money directly, and then pull tuition back out of the paycheck. With this setup, Harvard charged overhead on tuition, since it was treated the same as the student stipend. Apparently, a new federal guideline just came out prohibiting this sort of thing; tuition will now be pulled out of the grant directly, free of overhead.

I was quite pleased to hear this. I had been complaining about this practice for a number of years; as far as I know, overhead on tuition is not standard for most schools. It made Harvard graduate students non-trivially more expensive. This change makes graduate students funded by NSF grants much cheaper for me. Moreover, it also just seems to me to be the right way to do it.

Now that the change is happening, though, I'm somewhat concerned. All of a sudden the School of Engineering and Applied Sciences is probably out millions in overhead that it had already budgeted. Somewhere down the line, I'm sure I'll be paying for that out of my grants somehow.

Saturday, July 07, 2007

Can someone tell the SODA-powers-that-be that it's really annoying to have a paper deadline on July 5th or 6th? (That's what it's been the last four years...) July 4th is a major US holiday, which if done properly involves a late night with fireworks.

I realize it's pointless to complain, but that's one reason why I started a blog -- so I could complain publicly. And I realize there are all sorts of arguments that it shouldn't matter what day they pick. But, of course, in practice, it does.

You might wonder why this is something that I (and many others) are pushing so hard. I think it's because we've seen the writing on the wall. When you look at the available theory funding available from the Computing and Communication Foundations Division (and the Theoretical Foundations Cluster within), it's pretty small. This is odd and frustrating, as theory seems to be thriving quite healthily, and to continue building on our successes, we need adequate funding.

The good news is that theoretical research often does well with these cross-cutting programs. When the SIGACT committee looked at the numbers, we found that back in the days of ITR, theorists were getting substantially more (something like 2 times) as much funding from ITR grants as from Theoretical Foundations grants! We've also had a lot of success with Cybertrust. These other pools of money are funding a lot of theory research, because we've been diligent in applying.

Long-term, this is something that we as a community need to try to fix. The baseline funding for theory needs improvement, and we have to keep letting the NSF know why. But for the foreseeable future, we have to take advantage of these kinds of funding opportunities as they arise. And the best way to take advantage of the CDI opportunity is for lots of theorists to send in lots of great grants.

1. Algorithms as a lens on the sciences: Beginning in the mid-1990s, several individual theorists became concerned about the field, where it was going, and how it was funded. There was considerable dissension, with the claims that the field was too inward-looking and too hung up on mathematical elegance as opposed to relevance. Events overtook discussion as theory became highly relevant to web-based applications, protocols, and other areas. Simultaneously, theory funding was dwindling so SIGACT set up a committee to look at these issues. It concluded that new directions that connected theory to other intellectually challenging areas would take funding pressure off the core (since folks have more sources to go to, leaving the core for folks who were uninterested in application areas). A workshop series on network computation led to NSF’s SING program, but SING had no money of its own and actually resulted in a decrease in theory funding. The SIGACT committee went back to work and developed the idea of algorithms as a lens on science. This idea went forward as a White Paper to appropriate folks within NSF. Eventually, after working its way through the internal NSF budget process, it resulted in a large new FY08 request—Cyber-Enabled Discovery and Innovation (a foundation-wide program beginning at $52 million in FY08 and intended to grow to $250 million in FY12).

This is an example of an idea begun by a few individuals, nurtured within a professional guild, supported by a federal agency, and turned into a major funded program—but not without a few bumps.

As part of that SIGACT committee, I think this is a good rough description of what we've been doing. Now that there's money going into the CDI program, theory people need to be sure to apply to get their share. Start thinking now, and there will be more on CDI in this blog coming up.

Also in the May issue is the Taulbee Survey. As far as I know, the primary use of the survey is for faculty to look up how much they should be paid. But there's other interesting data too!

The total Ph.D. production between July 2005 and June 2006 of 1,499 represents a phenomenal 26% increase.

And another nugget:

Actual Bachelor’s degree production in departments reporting this year was only 3.1% lower than the projection from last year’s reporting departments. From this year’s estimates, it would appear that another 16% decline is looming. If this holds true, it would represent a drop of more than 40% over a three-year period.

The news is much better when looking at new Bachelor’s degree students. For the first time in four years, the number of new undergraduate majors is slightly higher than the corresponding number last year.

I'm hoping they're right and that the number of new majors will start going up again...