Wednesday, October 28, 2009

I'm curious if various readers out there would be willing to offer their ranking of networking conferences. The issue has come up in some conversations recently, and I was wondering what other possibly more informed sources think.

Besides your ranking, of course, I'm interested in the reasons behind the rankings. Is it just acceptance rate? Do certain networking conferences specialize in subareas where they are remarkably strong? How does/did such a ranking get built and maintained; does it ever get lost?

Friday, October 23, 2009

I'm thrilled to announce that my colleague Gu-Yeon Wei, in EE here at Harvard, received tenure.

I feel this is worth a mention because:

1) Strangely, people sometimes seem to forget we have EE here are Harvard. They shouldn't. (Gu, for example, is one of the leaders of the RoboBee project, and with David Brooks in CS, has been writing a slew of papers that spans the circuits and architecture divide.)2) Strangely, people sometimes seem to have kept the notion that Harvard EE+CS do not tenure their junior people. That's an outdated impression. Like most other universities, we aim to hire people who will eventually earn tenure.

That's the major happening of the day. The minor happening, from yesterday, was that I visited Yale and gave my talk on Open Questions in Cuckoo Hashing. I had a great day, but will pass on just one great insight : if Joan Feigenbaum recommends a restaurant, it's worth listening to. (Dinner after the talk was a true treat.)

Tuesday, October 20, 2009

One interesting aspect of our WSDM paper is that we have multiple references from the 1930's and 40's. It turns out our problem is related to some of the problems from the early (earliest?) days of experiment design.

This was actually a stumbling block for us for a while. In one sense, we had a very positive starting point, in that I knew there was something out there related to our problem. As a youth (literally, back in high school) I had seen some stuff on the theory of combinatorial design, and while it was too abstract for me to find a direct connection, I knew there must be some stuff out there we better be aware of. We eventually found what we really needed by random searching of keywords; some variation of "experiment design" led us to the Design of experiments Wikipedia page, which used Hotelling's problem as an example. Once we had this (our magic keyword!), we could forward track to other relevant references and information.

In many cases, the problem is not only that you don't know what you should be referencing -- you may not even know you should be referencing something at all. This happens a lot in problems at the boundaries -- econ/CS problems, for example. Most notably, this was a big problem in the early work on power laws, as I pointed out in my survey on power laws -- that's the most egregious example I know, where a lot was "re-invented" without people realizing it for quite some time.

I still get the feeling that, despite the great tools we now have available to us, people don't do enough searching for related work. I can understand why. First, it's not easy. If you don't know what the right keywords are, you have to use trial and error (possibly helped by asking others who might have a better idea). For multiple papers I have written, I have spent multiple hours typing semi-random things into Google and Google scholar, looking around for related work. (In the old days, as a graduate student, I actually pulled out lots of physical books from the library shelves on anything that seemed related -- I like this new system better.) It can seem like a waste of time -- but I really, really encourage authors to do this before submitting a paper. Second, in many cases there's a negative payoff. Who wants to find out (some of) what they did was already done? (In fact, I think everyone who expects to have a long research career would actually prefer to find this out as soon as possible -- but it still can be hard to actively seek such news out.)

On the positive side, I can say that good things can come out of it. Reading all the original work and related problems really helped us (or at least me) better understand the right framework for our variation of the problem. It also, I think, can help get your paper accepted. I feel we tried hard to clearly explain the historical context of our problem -- I think it makes our paper richer than it would be without it, exposing some interesting connections -- and I think it paid off; one reviewer specifically mentioned our strong discussion of related work.

Monday, October 19, 2009

I'm happy to announce our paper "Adaptive Weighing Designs for Keyword Value Computation" -- by me, John Byers, and Georgios Zervas -- was accepted to WSDM 2010 -- The Third ACM Int'l Conference on Web Search and Data Mining. (The submission version is available as a technical report here.) The abstract is at the bottom the post for those who are interested.

The paper's acceptance gives me an excuse to discuss some issues on paper writing, research, conferences, and so on, which I'll do this week. To start, I found it interesting that WSDM had 290 submissions, a 70% increase in submissions over 2009. Apparently, Web Search and Data Mining is a healthy research area in terms of the quantity of papers and researchers. They accepted 45, or just about 15.5%. This turns out not to be too far off from the first two years, where acceptance rates were also in the 16-17% range. I'm glad I didn't know that ahead of time, or I might not have submitted!

I'm curious -- why would a new conference, trying to establish itself and gain a viable, long-term group of researchers who will attend, limit itself to such small acceptance rates when starting out? Apparently they thought the key to success would be a high quality bar, but I find the low acceptance rate quite surprising. I can imagine that the rate is low because there are a number of very poor submissions -- even the very top conferences, I've found, get a non-trivial percentage of junk submitted, and although I have no inside knowledge I could see how a conference with the words "International" and "Web" in the title might receive a number of obviously subpar submissions. But even if I assume that a third of the submissions were immediate rejects, the acceptance rate on the remaining papers is a not particularly large 23.3%.

Attributing a dollar value to a keyword is an essential part of running any profitable search engine advertising campaign. When an advertiser has complete control over the interaction with and monetization of each user arriving on a given keyword, the value of that term can be accurately tracked. However, in many instances, the advertiser may monetize arrivals indirectly through one or more third parties. In such cases, it is typical for the third party to provide only coarse-grained reporting: rather than report each monetization event, users are aggregated into larger channels and the third party reports aggregate information such as total daily revenue for each channel. Examples of third parties that use channels include Amazon and Google AdSense.

In such scenarios, the number of channels is generally much smaller than the number of keywords whose value per click (VPC) we wish to learn. However, the advertiser has flexibility as to how to assign keywords to channels over time. We introduce the channelization problem: how do we adaptively assign keywords to channels over the course of multiple days to quickly obtain accurate VPC estimates of all keywords? We relate this problem to classical results in weighing design, devise new adaptive algorithms for this problem, and quantify the performance of these algorithms experimentally. Our results demonstrate that adaptive weighing designs that exploit statistics of term frequency, variability in VPCs across keywords, and flexible channel assignments over time provide the best estimators of keyword VPCs.

Sunday, October 18, 2009

For those who are interested in such things, Harvard's latest financial report appears to be available. Rumors have it that the report was made (widely) public in part because of a Boston Globe article, showing that Harvard was doing some unwise things with its "cash" accounts. (Our own Harry Lewis gets a quote.) That on top of previously reported news (in the financial report) that Harvard had to pay about $500 million to get out of some bad hedges on interest rates. I'm sure many will get a kick out of it.

I hope, at least, that people will make use of the information properly when discussing things Harvard. A couple of weeks ago I pointed to a truly fact-impaired Kevin Carey opinion in the Chronicle of Higher Education (that I still can't bring myself to link to). One mistake he made, which is common, is to refer to Harvard's $37 billion endowment (now about $26 billion) as though it was all for undergraduate education. In fact, the Faculty of Arts and Sciences (FAS. the "home" for undergraduates) "owns" about $11 billion of the $26 billion; the med school, business school, law school, and various other sub-organizations within Harvard all have their pieces. Also, of this $11 billion, only a fraction is in money that can be used for "general purposes"; much of it is tied to specific purposes (chairs for faculty, financial aid, libraries, etc.). Anyhow, when someone comes along and spouts off about how Harvard should spend its money, I'll have a new pointer for where to start an informed discussion.

Also, on Friday Cherry Murray, Dean of Harvard's School of Engineering and Applied Sciences, had an all-hands meeting, where naturally the topic of SEAS finances was part of what was addressed. (The budget for SEAS is independent of FAS, approximately.) While we're not in great shape, we appear to be somewhat better off, as less of our budget comes from the endowment distribution, and we've had a bit of a buildup in our reserves the last few years that will help us through the next few. This should mean that SEAS will be (slowly) hiring again soon; I'm hoping that computer science and/or applied mathematics will be areas where we'll be advertising for new faculty.

Friday, October 16, 2009

I'd like to welcome myself to the Blogroll for the Communications of the ACM! My colleague Greg Morrisett suggested I get my blog into the CACM Blogroll, so a few e-mail messages later, and apparently I'm in. Just goes to show, they must have a pretty low bar. Actually, since I'm a regular reader of most of the blogs on their Blogroll, it's a pleasure to join the list. It's not clear how this will affect the tone and style of my blog posts -- probably not at all -- but perhaps it will encourage me to branch out into yet more topics of more general interest.

While poking around the CACM I was pleased to see some press on the Harvard RoboBee project, one of the 3 NSF Expeditions awards from this year. While I'm not on the RoboBee team, it's already getting some of my attention; I'm co-advising a senior who wants to do her undergrad thesis on some algorithmic problems related to RoboBees. I'm imagining I'll be drawn into other related sub-projects, as there seems to be lots of possible algorithms questions one might want to tackle in developing artificial insects. Perhaps that's the power of these large-scale, Expeditions style projects: by setting seemingly very distant, almost impossible goals, they push people to think and do new things.

I've said it before but it bears repeating: it's amazing how CACM has changed to become, in my mind, a really relevant resource for computer science and computer scientists. And I'm not just saying that to welcome my new blog overlords.

Wednesday, October 14, 2009

I spent an hour or more today perusing the book Concentration of Measure for the Analysis of Randomized Algorithms, by Devdatt Dubhashi and Alessandro Panconesi (that Alessandro was kind enough to send me). It's a very nice book covering a variety of tail bound arguments and applications, with a number of exercises. I'd recommend it for use in a graduate-level seminar, or as a general reference for people working in probabilistic analysis of algorithms. Theory graduate students should have a copy nearby if not on their shelf.

It treats very well the standard approaches -- Chernoff-Hoeffding bounds, martingales, isoperimetric inequalities, and so on, but I think what particularly stands out in this book's treatment is the consideration of what to do when the random variables are not quite so nice. Tail bounds tends to be "easy" to apply when all the random variables are independent, or when your martingale satisfies a nice simple Lipschitz condition; it's when the variables are dependent or there's some special side case that wrecks your otherwise pleasant martingale that you need to pull out some heavier hammers. This book makes those hammers seem not quite so heavy. Chapter 3 is all about Chernoff-Hoeffding bounds in dependent settings; another chapter has a subsection on martingale bounds for handling rare bad events. I've had these things come up in the past, so it will be nice now to have a compact resource to call on with the appropriate bounds at hand.

I don't think this book is for "beginners"; I'd recommend, for instance, my book, which covers all the basic Chernoff-Hoeffding bounds and martingale bounds for people who just need the basics. But if you really need something with a little more power in your analysis, look here. While it's a bit steep at $56.00 at Amazon for a book that comes in under 200 pages (including bibliography and index), I'm sure it will be showing up in the references of some of my papers down the line.

Tuesday, October 13, 2009

One conference I've never had a paper in -- though I'd like to someday -- is SOSP, the Symposium on Operating Systems Principles, one of the flagship conferences in systems. A friend pinged me from there today, so I went to look at the program. Besides learning that Microsoft Research is dominating big systems work, I found a paper co-authored by a number of well known theorists:

Friday, October 09, 2009

This week I got my batches of papers to review for NSDI and LATIN. If I'm quiet for a while, I'm busy reading (and writing reviews).

Needless to say, I didn't quite realize I'd get the papers for the two within a couple of days of each other. But it actually seems fine. They're just so different from each other, it's almost refreshing to go from one type of paper to the other.

This will also give me a chance to experience HotCRP and EasyChair "head-to-head". It looks like EasyChair has improved since the last time I used it but HotCRP still seems easier to use so far.

Wednesday, October 07, 2009

Thanks to the Harvard Extension School, the lectures for several more Harvard courses have been put online. My understanding is that these are classes taught at Harvard that are also given through the extension school. I suspect my course may end up here too next time it is offered.

The list of courses available right now includes:

Concepts of the Hero in Greek Civilization, by Gregory Nagy and Kevin McGrathBits, by Harry LewisIntensive Introduction to Computer Science Using C, PHP, and JavaScript, by David J. MalanShakespeare After All: The Later Plays, by Marjorie GarberOrganizational Change Management for Sustainability, by John Spengler and Leith SharpChina : Traditions and Transformations, by Peter Bol and William KirbyWorld War and Society in the Twentieth Century : World War II, by Charles S. MaierSets, Counting, and Probability, by Paul BambergAbstract Algebra, by Benedict Gross

Monday, October 05, 2009

Stefan Savage made an insightful comment related to the issue of jobs:

I've long felt that its a fallacy that there exists a fine-grained Platonic ideal of "goodness" for researchers (so too for papers), but its an even bigger fallacy is to expect that decision makers would abide by such a scale even if it existed. In my experience, job offers are job offers, just as paper acceptances are paper acceptances. Trying to analyze such results at a finer or deeper scale is unlikely to reveal many useful truths.

There seems to be in the previous comments (mostly from anonymous commenters) the idea that getting a job is like those contests many of us did back in high school -- you get more points than the next person, you get the prize. This idea, in my mind, requires some underlying assumptions. First, that merit can be precisely measured -- if you get a high enough score, you get the corresponding job, and anything else is a failure of the system. Second, merit [for a position at a top research university] corresponds explicitly to quality of research, and again, using other considerations is a failure of the system. (I should point out these ideas are in no way novel; indeed, this argument seems to arise constantly in debates on undergraduate admissions, regarding admission of underrepresented minorities/legacies/athletes and so on.)

I think both assumptions are invalid in the setting of faculty hires. First, even if you think research quality is the sole criterion on which to base a hire, how do you measure it? Number of papers? Number of citations? Practical impact/number of actual users? Convene a panel of experts to assign a score? There can be, and will be, disagreements; in some cases, only the test of time will tell. Of course it's often easy to separate "the top" as a rough equivalence class, but going beyond that to a rank ordering is often difficult, especially when comparing people in even slightly different research areas.

Second, I don't think research output alone is the sole measure for a faculty position. Obviously, there's teaching, advising, and administration to consider, but there are other less tangible issues as well. Joining a faculty is like joining a team, and the question is what person can best help the team -- the quality of a team is not merely the sum of the quality of the individual members. Will the potential hire collaborate with others, fill in an area where the department needs someone, or offer useful leadership? Can they fit into, and enhance, the department culture? And yes, the question of is this someone everyone can get along with for a couple of decades also comes to mind. Certainly research quality is a primary consideration -- really the primary consideration -- but most or all of the people brought in for interviews have passed a very high bar for research already, and the other issues can come into sharp focus in the late hiring stages. People might skip such considerations for a suitably good researcher -- I imagine many departments, for instance, would take a Turing award winner, even if the person had a destructive personality, assuming the benefits would outweigh the costs. (I don't actually know of a case like that, but the issue has come up, as a purely theoretical issue, in discussions on hiring in the past.)

This may not be the way some people wish things would work, but it's counterproductive to not recognize that this is the way it generally works -- as Stefan suggests. Further, I strongly suspect that the idea that a pure "merit-based" system, whatever that means in this context, is the universally right approach to faculty hiring is based on assumptions that are faulty in both theory and practice.

[Interestingly enough, I recall a similar topic comes up in the Justice classI posted about before; I'll have to review those lectures!]

Saturday, October 03, 2009

One issue that arose is what a PhD in theoretical computer science should know -- what's the "core" of theoretical computer science? The issue arose as some anonymous commenter claimed to have a PhD in TCS but not know of Voronoi diagrams, range counting, etc. after some other commenter claimed that these were topics one learned as an undergraduate. For the record, I think it's the rare undergraduate that learns about these data structures/algorithms; I learned about them in graduate school, and only because I took the computational geometry course (from the fantastic Raimund Seidel, with the occasional guest lecture from fellow-graduate-student Ernie Jeff Erickson).

As TCS expands, so that it's harder and harder to have exposure to everything as a graduate student, the question of what is the TCS core will become more challenging. The same problem, of course, happens at "all levels of the tree" -- what is the core knowledge that a PhD (or undergraduate) in CS should learn across all areas of CS, or the core for an undergraduate in a liberal arts college? Anyone who has served on a faculty committee to deal with this issue knows that this is a challenge -- what is "core" is usually defined as the area the person on the committee is working on. (From my point of view, how could a PhD in TCS not know Azuma's inequality, the fundamentals of entropy, and Bloom filters?... But I am sure there are many that don't.) Arguably, TCS has been small enough until fairly recently that one could describe a core that most everyone knew most of, but I think that's increasingly less true. (I imagine it's been true of mathematics for some time.)

In any case, I think the people who expressed disbelief and dismay that a PhD in theory might not know a fair number of this list of things ("Range counting, predecessor, voronoi diagrams, dynamic optimality, planar point location, nearest neighbor, partial sums, connectivity.") should consider that they might have overreacted -- I believe most PhDs in TCS learn them, but I don't think it's by any means close to mandatory.

This leaves two several questions for comments:

1) What should be the "core" of TCS that (almost) all PhDs should be expected to know? This is going to be a moving target, of course, but what topics would you place on it now? [It's not clear to me whether one would want to put specific examples or thematic concepts in the list -- for example, would you put "Azuma's inequality" or simply "probability tail bounds" -- feel free to suggest both.]

2) How do we enforce that this core gets learned? I find increasingly PhDs expect to get right to research, and view classes as a hindrance rather than an opportunity. I have long found this problematic. I personally find that courses are a great way to inculcate core material. After all, it's because of that course in graduate school that I learned about Voronoi diagrams, and they've proven useful enough that they appeared in a paper I co-authored.

Friday, October 02, 2009

Madhu Sudan gave a colloquium at Harvard yesterday on his work on Universal Semantic Communication and Goal-Oriented Communication (both with Brendan Juba, the latter also with Oded Goldreich). The papers are available here, and here are the slides (pdf). One of the motivating examples for the work is the following : you're visiting another department for the day, and need to print something out. You have access to a printer, but your machine doesn't speak the same language as it does. So you have to get a driver, install it, and so on, and all of a sudden it takes 1/2 an hour to do a simple print job. Why can't the machines figure out how to get the simple job done -- printing -- without this additional work?

More abstractly, one can pose the high-level idea in the following way: Shannon's theory was about the reliable communication of bits, and we've solved a great deal about those types of communication problems. But even if we assume that bits are transmitting correctly over a channel, how can we ensure the meaning of those bits is interpreted properly, particularly in the sense of if those bits represent a task of the form, "Please do this computation for me," how do we ensure the other side performs the computation we want done if we don't have a prior agreed-upon semantic framework?

I've seen in various settings criticism of this line of work, which is quite abstract and certainly a bit unusual for CS theory. The original paper is often referred to as "the aliens paper" because it set the question in terms of communicating with aliens (where there may naturally be no shared semantic framework), and my impression is that several people felt it is too far removed from, well, everything, to be of interest. It was, I understand, rejected multiple times before being accepted.

I have to say the impression that this paper is "too far removed" is incorrect based on the reaction at the talk Madhu gave. Part of it may be a change in message -- no talk of aliens, and more talk of novel devices connecting to the Internet makes the problem more tangible. But it was remarkable how many people were interested -- our computational linguist and programming languages faculty seemed really intrigued, and there was also great interest from some of our other systems people. (It helps to be in a department where faculty outside of theory are generally perfectly comfortable seeing PSPACE-completeness and reductions show up on slides -- is that usual, or are we spoiled here? Multiple questions on the details of the definitions were asked by non-theorists...) Many people shared the impression that this was a new way to think about some very challenging problems, and while the work so far is too theoretical to be of practical use -- indeed, it's arguably still at the stage where perhaps the right frameworks or questions aren't entirely clear -- they seemed to view it as a start worth pursuing further.

I think this sort of paper is a rare beast, but perhaps it does serve as an example that a new conference like ICS is needed as an outlet for this kind of work. In particular, it's not clear to me that FOCS/STOC always has a good handle on where theory could be of large interests and have a significant impact on other communities. My complaint generally takes the form that FOCS/STOC (and even SODA) weighs mathematical "difficulty" far, far greater than practical utility when judging work in algorithms and data structures, but this seems to be a related issue.

Thursday, October 01, 2009

Since I've now co-authored a paper on GPUs, I'm now "in-the-loop" (thanks to my co-author John Owens) on the news of NVIDIA's announcement of its "next generation" of GPUs, code-named Fermi. (To be out in 2010? Competitors are Intel's Larrabee and AMD's Evergreen.) Some articles on it are : Ars Technica, the Tech Report, PC Perspective. I'm still trying to figure out what it all means, myself, but it seems like there's a future in figuring out how to do high-performance computing (algorithms, data structures) on GPU-style chips. Expect more workshops of this type (Uzi Vishkin's workshop on theory + mulit-core from earlier this year).