02 January 2007

I spent the second day of workshops at NIPS (while not skiing) attending the Learning when test and train inputs have different distributions workshop. This is closely related (or really just a different name) for the domain adaptation problem I've been interested in for quite some time. Unfortunately, I can't easily come across the list of papers (there were some good ones!) which means that my memory may be lacking at some parts. Here are some points I took away.

Statisticians have worked on this problem for a long time. If you provide insurance, you're going to want a predictor to say whether to give a new person a policy or not. You have lots of data on people and whether they made any claims. Unfortunately, the training data (people you have information on) is limited to those to whom you gave policies. So the test distribution (entire popular) differs from the training distribution (people to whom you gave policies). Some guy (can't remember his name right now) actually won a Nobel prize in Economics for a partial solution to this problem, which was termed "covariate shift" (because statisticians call our "inputs" "covariates" and they are changing).

There seems to be a strong desire to specify models which (though see the comment below) can be characterized as "train p(y|x) and test p(y|x) are the same, but train p(x) and test p(x) differ." In other words, the "labeling function" is the same, but the distribution over inputs is different. This is probably the clearest way to differentiate domain adaptation from multitask learning (for the latter, we typically assume p(x) stays the same but p(y|x) changes). I'm not sure that this is really a hugely important distinction. It may be to obtain interesting theoretical results, but my sense is that in the problems I encounter, both p(x) and p(y|x) are changing, but hopefully not by "too much." An interesting point made along these lines by Shai Ben David that I spent a bunch of time thinking about several years ago was that from a theoretical perspective, assuming p(y|x) is the same is a vacuous assumption, because you can always take two radically different p(x) and q(x), add a feature that indicates which (p vs. q) the data point came from, and call this the "global p(x)". In fact, in some sense, this is all you need to do to solve multitask learning, or domain adaptation: just add a feature saying which distribution the input is from, and learn a single model. I've been doing some experiments recently and, while you can do better than this in practice with standard learning models, it's not such a bad approach.

There were several other talks I liked. It seemed that the results (theoretically) were of the form "if p(y|x) is the same and p(x) and p(y) differ only by a 'little' then doing naive things for learning can do nicely." My favorite formalization of p(x) and p(y) differ a little was the Shai Ben David/John Blitzer approach of saying that they differ slightly if there is a single hyperplane that does well (has low error) on both problems. The restriction to hyperplanes is convenient for what they do later, but in general it seems that having a single hypothesis from some class that will do well on both problems is the general sense of what "p(y|x) is the same" is really supposed to mean. I also enjoyed a talk by Alex Smola on essentially learning the differences between the input distributions and using this to your advantage.

In general, my sense was that people are really starting to understand this problem theoretically, but I really didn't see any practical results that convinced me at all. Most practical results (modulo the Ben David/Blitzer, which essentially cites John's old work) were very NIPSish, in the sense that they were on unrealistic datasets (sorry, but it's true). I wholeheartedly acknowledge that its somewhat difficult to get your hands on good data for this problem, but there is data out there. And it's plentiful enough that it should no longer be necessary to make artificial or semi-artificial data for this problem. (After all, if there weren't real data out there, we wouldn't be working on this problem...or at least we shouldn't :P.)