Monday, May 31, 2010

During the summer theory seminar for my research group, I like to cover a topic that is mathematically challenging, but not something that any of us would normally learn about in the course of our day-to-day research

This is a great idea ! Usually, during the semester, I'm in market-driven mode, choosing topics that are more accessible, and are likely to draw a larger audience. But summer is a good time for harder material since you have smaller self-selected group and they're motivated.

Last summer I ran a "why can't we solve P vs NP" seminar with three students - we went through the standard obstacles, and along the way learnt a fair amount of complexity theory - it was a lot of fun. This summer we're doing lattice theory - I have selfish reasons for choosing this topic (some of my recent work needs it), and it's a very accessible (and relevant!) area, while still being nontrivial enough to get students to think.

We considered a number of other topics as well before settling on this one - we might even return to some of them later. They were, in no particular order.

Thursday, May 20, 2010

What's newsworthy is the shifted submission deadline. Last year, the ICS submit-accept cycle was highly compressed, to make sure it didn't clash with either SODA or STOC. This year, it's in direct conflict with SODA (submission deadline Aug 2), which should make things interesting for the SODA submission levels.

Since the conference is still in some flux (I don't know where it will be next year), it's probably too soon to comment on the timing/deadlines, but I wonder whether it will continue to be a good idea to have SODA and ICS be in direct conflict.

Friday, May 14, 2010

We've seen two generic methods for determining the number of clusters in a data set thus far: the elbow method, and the ROC-curve/diminishing returns method. Now I want to talk about a more "dynamic" approach to identifying the number of clusters in a data set.

These ideas center around a 'simulated annealing' view of the clustering process. Imagine, if you will, a set of highly excited atoms all bouncing around in a state in which all are indistinguishable. If you now start cooling the atoms, they will start falling into local energy traps over time. If the cooling is done on the right schedule, the atoms will find a minimum energy configuration at each temperature, all the way down to absolute zero, where each atom is in its "own state", so to speak.

What does this have to do with clustering ?

The setup works something like this. Imagine a "soft" clustering assignment of points to clusters (where each point assigns a fraction of its membership to each cluster). We can write down the distortion D associated with this clustering by computing the weighted distance of each point from the various cluster centers. We'd also like the clustering to be more or less deterministic, so we can introduce a penalty team for the level of disorder in the clustering (usually captured by the conditional entropy H associated with the assignments of points to clusters).

Now we want to minimize the distortion subject to some constraint on the entropy, so in good old Lagrangian form, we write down a function of the form

F = D - TH

where T is a Lagrange parameter and F is the overall function to be minimized.

But here's the kicker. You can think of F as the free energy of a statistical ensemble, where D represents its energy, and H represents its entropy, and T is the "temperature" of the system. The T=0 limit captures the idea that distortion is all that matters, and so each point is assigned a cluster of its own. The "T large" limit prioritizes the entropy over the distortion, encouraging us to place all points in a single cluster.

So the annealing process corresponds to minimizing F while decreasing T steadily. It turns out, using magic from the realm of statistical physics, that for any temperature T, the probability of assigning point x to cluster y is proportional to exp(-d(x,y)/T), and as T decreases, the probability of assigning a point to any cluster except the very nearest one decreases dramatically.

All of this is very fascinating, and provides a "smooth landscape" in which to understand how points get assigned to (soft) clusters. Much of the work I'm exploring right now is in trying to understand the space of soft clusterings of data. But that's a digression.

What's really interesting is that if you look at the evolution of the probabilistic assignments, you start seeing phase transitions. As T starts off high, all the points are in the same cluster. At a specific temperature T, a phase transition occurs, and the data starts splitting (based on the assignments) into more than one cluster.

How can you tell when this happens ? Here's a very elegant technique, first proposed by Kenneth Rose in his work on deterministic annealing. In the high-T regime, where every point is in the same cluster, the cluster center location can be computed by solving a simple convex optimization. As the process evolves, the positive definite matrix defining the optimization starts losing its positive definiteness, till some point when one of its eigenvalues goes to zero. This point can be computed analytically, and yields a specific temperature value at which the first clusters start to emerge.

Kenneth Rose has some nice examples illustrating this behaviour. As time goes one, more and more phase transitions start to appear, as more and more clusters start to emerge. What you end up with is a hierarchy of clusterings that end with the trivial clustering where all data lie in separate clusters.

This idea has been developed further with the information bottleneck method, which replaces both the distortion and entropy terms by terms involving the mutual information of the assignments. The free energy paradigm works the same way, although now the phase transition points can't be computed analytically (I'll have more to say about the information bottleneck method)

What's entirely weird is that the phase transitions still happen (and I want to stress this point), data sets will split up into different numbers of clusters at the same transition point ! We wrote a paper a few years ago that proposes a particular "temperature" to look for phase transitions in, and lo and behold, we were able to recover the "correct" number of clusters from planted data sets by watching the data split at this point. I'm still not sure why this happens, but it was quite interesting, and yielded one way of identifying "natural clusters" in the data.

Phase transitions and the like are the coin of the realm in bifurcation theory. While a proper understanding of this area is well beyond my skill level, there has been recent work on trying to analyze the information bottleneck from a bifurcation theory perspective, specifically by trying to isolate and classify the different kinds of critical points that emerge in a bifurcation diagram of the clusters as the temperature changes. Albert Parker of the University of Montana wrote a Ph.D thesis on this topic, and I'd love to understand his work better.

Coming up next: Clustering as compression - the information bottleneck and kolmogorov complexity.

Thursday, May 13, 2010

Shape analysis is a topic that is almost a killer app for computational geometry. Where the word 'almost' comes in is an interesting story about the tension between computational power and mathematical elegance.

Shape analysis is defined by the generic problem:

Given two (or more) shapes, determine whether they are similar.

Simple, no ? I don't need to list the applications for this problem. Or maybe I should: computer vision, computational biology, graphics, imaging, statistics, computer-aided design, and so many more.

It would not be an exaggeration to say that in many ways, shape analysis is as fundamental a problem as clustering. It's a problem that people have been studying for a very long time (I heard a rumor that Riemann once speculated on the manifold structure of shapes). Especially in the realm of biology, shape analysis is not merely a way to organize biological structures like proteins: it's a critical part of thinking about their very function.

Like any problem of such richness and depth, shape analysis has spawned its own ecology of concepts. You have the distances, the transformation groups, the problem frameworks, the feature representations, the algorithms, and the databases. You now even have the large data sets of very large shapes, and this brings nontrivial computational issues to the forefront.

Shape analysis (or shape matching) has been a core part of the computational geometry problem base for a long time. I've seen papers on point pattern matching from the late 70s. There's been a steady stream of work all through the past 30 years, introducing new concepts, distances and algorithm design principles.

In parallel, and mostly within the computer vision community, there have been other efforts, focused mainly on designing more and more elegant distance measures. Computer graphics got in on the action more recently as well, with a focus on different kinds of measures and problems.

What I think is unfortunate (and this is entirely my own opinion) is that there's a strong disconnect between the developments happening in the computational geometry community, and the parallel efforts in the more 'applied' communities.

I'm not merely talking about lack of awareness of results and ideas. I believe there are fundamentally different ways in which people go about attacking the problems of shape analysis, and I think all sides could benefit greatly from understanding the strengths of the others.

Specifically, I think that within our community, we've focused for far too long on measures that are easy to define and relatively easy to compute (while not being trivial to work with). On the more 'math/vision' side, researchers have focused much more on measures that have the 'right' kind of mathematical structures, but have bailed miserably on computational aspects.

Over the next post or two, I want to develop this argument further, and hopefully end with a set of problems that I think are worthy of study both from their intrinsic appeal and the unsolved computational issues they present.

Stay tuned....

p.s The clustering series will also continue shortly. Oh do I love summer :)

Wednesday, May 12, 2010

There's about a month left to go for SoCG, and I just returned from a walk through at Snowbird. Surprisingly, it was still snowing - the ski season has wound down though. Most of the snow will disappear by early June - we're seeing the last confused weather oscillations right now before the steady increase in temperature.

We went up there to check out the layout of the room(s) for the conference - the main conference room is nice and large, and the parallel session room is pretty big as well, there'll be wireless access throughout (with a code), and there's a coffee place right next to the rooms. There are nice balconies all around, and of course you can wander around outside as well.

Registrations have been trickling in, a little slower than my (increasingly) gray hair would like. If you haven't yet registered, consider this a gentle reminder :). It helps to have accurate numbers when estimating food quantities and number of proceedings etc.

Saturday, May 08, 2010

Every now and then, we get called upon to project our work into the future. Usually, it's in a grant proposal (especially in a CAREER proposal). Sometimes it might be part of strategic planning at a faculty retreat. It even shows up in solicitations for position papers at various venues (for example this recent one that was circulating on a faculty list). It definitely comes up at faculty interviews, although I usually view it as a hazing ritual or the equivalent of "Nice weather we're having, aren't we?"

I understand the short-term imperative for such things: it's good to know that there are timelines in which your work has some kind of measurable impact, and even better to know that there's more than one (BPP vs NP, anyone?).

But I get the sense (and maybe I'm just off base here) that this kind of future prediction business is more common in non-theoryCS areas. My archetypical story for what happens if you ask theoreticians about future directions is Jeff Erickson's hilarious tale about his interview at MIT.

Of course the most famous example of future projection is in mathematics ! So maybe my premise is doomed ? But somehow I don't think so. I don't think mathematicians since Hilbert go around proposing future directions for entire areas (although there might be general consensus on key open problems), and I think theoryCS has absorbed much of this ethos (although I don't think that's true in theoretical physics).

I ask because I always feel awkward when asked questions like "where is going in the next X years ?" or even worse, "where SHOULD be going in the next X years". Maybe the more reasonable question is "where's all the activity and ferment happening right now".

Tuesday, May 04, 2010

The US news rankings came out a while back (Jon Katz had twoposts on this). As usual, this will prompt a round of either back-slapping or back-stabbing, depending on whether your department ranking went up or down (ours didn't change at all, which could also be a bad thing).

What I'd like to propose is a completely different way of doing rankings.

It's generally accepted that the place where rankings make the most difference is in graduate admissions, and there's a secondary effect in faculty hiring (since faculty want to get good students to work with). The general belief is that students will tiebreak between universities based on ranking, in the absence of more contextual information.

But it's also insanely silly to obsess about the relative rankings of (say) the top 5 schools, or to exult in your movement from 53 to 47 in the rankings. What I believe is generally true is that there are rough strata (antichains in a partial order, if you will) in which departments are generally of equivalent rank. Spending time and energy trying to optimize within such a statum is a useless waste (which doesn't mean that people don't LOVE to do it, because any activity is positive activity, right ? ... right ? ....)

What we do keep track of, and is interesting, is which universities our admits reject us for, or accept us in place of. If I'm not Stanford or MIT, but students are rejecting me only to go there, then I'm not happy, but I feel minor relief that at least they're not rejecting me for the University of Obscurity in Scarceville, Podunkistan.

But of course we know what this is ! it's a topological order ! So I propose the following tiering scheme:

A department is at tier k if "all" departments it is rejected for are at tier k-1 or less.

Note 1: We have to define "all" carefully - there's always someone who's (say) following a boyfriend or girlfriend, or really wants to live in some town, etc etc. My preferred definition of "all" would be "at least 80%" or some large figure like that.

Note 2: If in fact people did select universities based on the "current" ranking scheme, this order would reflect that. Of course I don't believe this will happen

Note 3: This might even allow for more fine grained analysis based on subject area. Depending on the areas of the admitted students, one could create stratified orders by area.

Note 4: No I have no clue how to get this data, but many departments informally maintain this information (I know we try to get this info when we can), and it's not like the current approach is dripping with rigor anyway.

Note 5: If you're an administrator, you'll hate this when you're trying to move to a higher level, and you'll love it when you actuall make the move. The problem with the lack of granularity might annoy some people though.

Semester is over, which means I hope to get back into blogging form again, picking up my clustering series where it left off, and also starting some new rants on problems in shape matching (which more and more to me look like problems in clustering).

But for today, just a little something to ponder. The following situation often occurs in data analysis. You have some data that inhabits a metric space - usually Euclidean space, but that doesn't really matter. You also have distributions over the data, by which I mean some kind of weight vector with one "component" for each data point, and components summing to one. The goal now is to compare these weight vectors in a way that takes into account the structure of the space.

The standard construction that one uses here is the earthmover distance, also known as the Wasserstein distance, or the Monge-Kantorovich distance, or the Mallows distance, or the transportation metric (you pick your favorite one). It's very intuitive (which is probably why it's been invented so many times) and works like this. Imagine piles of earth at each data point, with each pile having mass equalling the weight at that point. We have a "starting configuration" consisting of the first weight vector, and an "ending configuration" consisting of the second weight vector. The goal is to figure out how to "transport" the earth with minimum effort (effort = weight X distance moved) so that the first configuration becomes the second. Formally, this amounts to a generalized matching problem that can be solved via the Hungarian algorithm. The earthmover distance is very popular in computer vision (see the Wikipedia article for details)

Another metric over distributions of the kind above is the Levy-Prokhorov metric, which for two distributions u and v is defined as the smallest e such that on any neighborhood A, the measure of u is within e of the measure of v on A inflated by e (i.e by constructing a ball of size e around A), and vice versa. I haven't seen this metric used much in practice, and it seems hard to compute.

Another approach that I realized recently uses a method that thus far has been used only in shape analysis. I've been working with a shape matching measure called the current distance of late (more on this in another post - it's a fascinating measure). Roughly speaking, it works like this. It starts with a "shape" defined anyway you like (clouds of points, a curve, a surface, whatever), and a similarity function (a positive definite kernel actually) defined on this space. It then allows you to compare these shapes by using the kernel to create a "global signature" that lifts each shape to a point in a Hilbert space, where the induced distance captures the shape distance.

It also works with weighted point sets, which is the relevant point here. Suppose I give you a space with distance defined indirectly via a kernel similarity function (rather than via a distance function). The current distance then gives me a way of comparing distributions over this shape just like the above measures, and the kicker is that this approach is WAY more efficient then any of the above methods, taking near-linear time instead of needing the rather expensive Hungarian algorithm. Moreover, the current distance has a built-in isometric embedding into a Hilbert space, something the earthmover distance cannot have.

If you're curious for more details, wait for my post on the current distance - in the mean time, there are two papers we've written (one online, the other you should email me for) that explore the theory and practice behind the current distance. I'm curious now as to whether the current distance can be used an efficient replacement for the earthmover distance in applications that rely on the EMD, but don't have a natural relationship to shape analysis.

p.s Shape matching in particular, and data analysis in general, is a rich source of ever more exotic metric spaces. I've been working with a number of these, especially in non-Euclidean spaces, and there are lots of interesting algorithms questions here.