29 April 2010

I get many of emails of the form "I'm looking for a postdoc...." I'm sure that other, more senior, more famous people get lots of these. My internal reaction is always "Great: I wish I could afford that!" NSF has a solution: let them pay for it! This is the second year of the CI fellows program, and I know of two people who did it last year (one in NLP, one in theory). I think it's a great program, both for faculty and for graduates (especially since the job market is so sucky this year).

If you're graduating, you should apply (unless you have other, better, job options already). But beware, the deadline is May 23!!! Here's more info directly from NSF's mouth:

The CIFellows Project is an opportunity for recent Ph.D. graduates in computer science and closely related fields to obtain one- to two-year postdoctoral positions at universities, industrial research laboratories, and other organizations that advance the field of computing and its positive impact on society. The goals of the CIFellows project are to retain new Ph.D.s in research and teaching and to support intellectual renewal and diversity in the computing fields at U.S. organizations......Every CIFellow application must identify 1-3 host mentors. Click here to visit a website where prospective mentors have posted their information. In addition, openings that have been posted over the past year (and may be a source of viable mentors/host organizations for the CIFellowships) are available here.

20 April 2010

I remember back in middle school or high school -- added uncertainty so as not to date myself too much -- I first learned of the existence of file compression software. We used pkzip, mostly. I remember one of the first things I did was: compress myfile.exe to myfile.exe.zip. Then compress that to myfile.exe.zip.zip. Then myfile.exe.zip.zip.zip. I cannot tell you how disappointed I was when, at one stage, the file size went up after "compression."

We read papers all the time that demonstrate something roughly of the form "if A then B." The happens most obviously the closer you get to theory (when such statements can be made precise), but happens also in non-theoretical work. The point of this post is: if you believe A => B, then you have to ask yourself: which do I believe more? A, or not B?

The simplest case is the compression case. Let's say a weak compressor is one that always reduces a (non-empty) file's size by one bit. A strong compressor is one that cuts the file down to one bit. I can easily prove to you that if you give me a weak compressor, I can turn it into a strong compressor by running it N-1 times on files of size N. Trivial, right? But what do you conclude from this? You're certainly not happy, I don't think For what I've proved really is that weak compressors don't exist, not that strong compressors do. That is, you believe so strongly that a strong compressor is impossible, that you must conclude from (weak) => (strong) that (weak) cannot possibly exist.

Let's take the most obvious example from machine learning: boosting. The basic result of boosting is that I can turn a "weak classifier" into a "strong classifier." A strong classifier is one that (almost always) does a really good job at classification. A weak classifier is one that (always always) does a not completely crappy job at classification. That is, a strong classifier will get you 99.9% accuracy (most of the time) while I weak classifier will only get you 50.1% accuracy (most of the time). Boosting works by repeatedly applying the weak classifier to modified data sets in order ot get a strong classifier out.

In all the boosting papers, the results are presented as positive. That is: look, obviously we want strong classifiers. But weak classifiers look much more attainable. And voila, by boosting magic, we can turn the weak into the strong. This is an A => B setting: (existence of weak classifiers) => (existence of strong classifiers).

Sounds great, right? But let me ask you: do you believe more strongly that (weak classifiers exist) or more strongly that (strong classifiers do not exist)? For me, it's the latter. To some degree, no free lunch theorems apply here. This yields a totally different read of boosting results, namely: weak classifiers don't even exist!!!

More on the language side, one of the more classic experimentally results we have is that if you give me a really good semantic representation, I can do an awesome job doing natural language generation from those semantics. In order to do translation, for instance, I just have to generate the semantics of a French sentence and then I can generate a near-perfect English translation. (French semantics) => (Perfect translations). But do I believe mose strongly that we can get perfect French semantics or that we can not get perfect translation? Right now, almost certainly the latter.

My point is really one about critical reading. When you read a paper, things will be presented in one light. But that's often not the only light in which they can be interpreted. Apply your own interpretation.

12 April 2010

I've had discussions about this with tons of people, and it seems like my approach is fairly odd. So I thought I'd blog about it because I've put a lot of thought into it over the past four offerings of the machine learning course here at Utah.

At a high level, if there is one thing I want them to remember after the semester is over it's the idea of generalization and how it relates to function complexity. That's it. Now, more operationally, I'd like them to learn SVMs (and kernels) and EM for generative models.

In my opinion, the whole tenor of the class is set by how it starts. Here's how I start.

Decision trees. No entropy. No mutual information. Just decision trees based on classification accuracy. Why? Because the point isn't to teach them decision trees. The point is to get as quickly as possible to the point where we can talk about things like generalization and function complexity. Why decision trees? Because EVERYONE gets them. They're so intuitive. And analogies to 20 questions abound. We also talk about the who notion of data being drawn from a distribution and what it means to predict well in the future.

Nearest neighbor classifiers. No radial basis functions, no locally weighted methods, etc. Why? Because I want to introduce the idea of thinking of data as points in high dimensional space. This is a big step for a lot of people, and one that takes some getting used to. We then do k-nearest neighbor and relate it to generalization, overfitting, etc. The punch line of this section is the idea of a decision boundary and the complexity of decision boundaries.

Linear algebra and calculus review. At this point, they're ready to see why these things matter. We've already hinted at learning as some sort of optimization (via decision trees) and data in high dimensions, hence calculus and linear algebra. Note: no real probability here.

Linear classifiers as methods for directly optimizing a decision boundary. We start with 0-1 loss and then move to perceptron. Students love perceptron because it's so procedural.

The rest follows mostly as almost every other machine learning course out there. But IMO these first four days are crucial. I've tried (in the past) starting with linear regression or linear classification and it's just a disaster. You spend too much time talking about unimportant stuff. The intro with error-based decision trees moving to kNN is amazingly useful.

The sad thing is that there are basically no books that follow any order even remotely like this. Except...drum roll... it's actually not far from what Mitchell's book does. Except he does kNN much later. It's really depressing how bad most machine learning books are from a pedagogical perspective... you'd think that in 12 years someone would have written something that works better.

On top of that, the most recent time I taught ML, I structured everything around recommender systems. You can actually make it all work, and it's a lot of fun. We actually did recommender systems for classes here at the U (I had about 90-odd students from AI the previous semester fill out ratings on classes they'd taken in the past). The data was a bit sparse, but I think it was a lot of fun.

The other thing I change most recently that I'm very happy with is that I have a full project on feature engineering. (It ties in to the course recommender system idea.) Why? Because most people who take ML, if they ever use it at all, will need to do this. It's maybe one of the most important things that they'll have to learn. We should try to teach it. Again, something that no one ever talks about in books.

Anyway, that's my set of tricks. If you have some that you particularly like, feel free to share!

07 April 2010

One of the most fun AHA moments in ML or stats is when you see that for multinomial distributions, your naive idea of relative frequency corresponds (through a bunch of calculus) to maximum likelihood estimation. Ditto means of Gaussians, though that's never quite as amazing because Gaussians seems sort of odd beasts anyway. It's nice when our intuitions match what stats tells us.

I'll often watch a DVD, and then leave it in the DVD player until I want to watch something else. Usually it will return to the menu screen and the timer display will go through 0:00, 0:01, 0:02, ..., until it hits the length of the title screen loop, and then go back to 0:00. What I always wonder when I glance up at it is "what is the actual length of the title screen loop?" That is, what is the highest value it'll count up to.

Being a good statistician, I set up a model. First I play the frequentist game. Since I glance up and pretty arbitrary times, my observations can be modeled as a uniform random variable from 0 to some upper bound U, which is the quantity I want to infer.

Supposing that I only make one observation, say 0:27, I want a maximum likelihood estimate of U. It's easy to see that U=27 is the maximum likelihood solution. Anything less than 27 will render my observation impossible. For any U>=27, my observation has probability 1/(U+1), so the ML solution is precisely U=27. Note that even if I observe five observations a <= b <= c <= d <= e, the ML solution is still U=e. Huh. The math works. The model is as correct as I could really imagine it. But this answer doesn't really seem reasonable to me. Somehow I expect a number greater than my observation to come about.

Perhaps I need a prior. That'll solve it. I'll add a prior p(U) and then do maximum a posteriori. Well, it's easy to see that if p(U) is unimodal, and its mode is less than my (highest) observation, then the MAP solution will still be the (highest) observation. If it's mode is more than my (highest) observation, then the MAP solution will be the mode of p(U). If I think about this a bit, it's hard for me to justify a multi-modal p(U), and also hard for me to be happy with a system in which my prior essentially completely determines my solution.

Or I could be fully Bayesian and look at a posterior distribution p(U | data). This will just be a left-truncated version of my prior, again, not very satisfying.

(Note that the "Maximum Likelihood Sets" idea, which I also like quite a bit, doesn't fix this problem either.)

It also really bugs me that only my largest observation really affects the answer. If I get one hundred observations, most of them centered around 0:10, and then one at 0:30, then I'd guess that 0:30 or 0:31 or 0:32 is probably the right answer. But if I observe a bunch of stuff spread roughly uniformly between 0:00 and 0:30, then I'd be more willing to believe something larger.

I don't really have a solution to this dilemma. Perhaps you could argue that my difficulties arise from the use of a Uniform distribution -- namely, that were I to use another distribution, these problems wouldn't really arise. I don't think this is quite true, as described below:

We actually run in to this problem in NLP fairly often. I observe a billion documents and see, in this 1b documents, 100k unique words, many of which are singletons. But I'm not willing to believe that if I see that document 1,000,000,001, that there won't be a new unseen word. Sure it's less likely that I see a new unique word than in document 1001, where it is almost guaranteed, but there's still a relatively large probability that some new word will show up in that next document.

The whole point is that somehow we expect random samples to be representatives of the underlying distribution. We don't expect them to define the underlying distribution. I don't actually expect to ever see the corner cases, unless I've seen everything else.

UPDATE: The Bayesian approach actually does do something reasonable. Here is some Matlab code for computing posteriors with a uniform prior, and results with an upper bound of 100 and various observations are plotted below:

01 April 2010

In the context of some work on multitask learning, we came to realize that classification is kind of weird. Or at least linear classification. It's not that it's weird in a way that we didn't already know: it's just sort of a law of unexpected consequences.

If we're doing linear (binary) classification, we all know that changing the magnitude of the weight vector doesn't change the predictions. A standard exercise in a machine learning class might be to show that if your data is linearly separable, then for some models (for instance, unregularized models), the best solution is usually an infinite norm weight vector that's pointing in the right direction.

This is definitely not true of (linear) regression. Taking a good (or even perfect) linear regressor and blowing up the weights by some constant will kill your performance. By adding a regularizer, what you're basically doing is just saying how big you want that norm to be.

Of course, by regression I simply mean minimizing something like squared error and by classification I mean something like 0/1 loss or hinge loss or logistic loss or whatever.

I think this is stuff that we all know.

Where this can bite you in unexpected ways is the following. In lots of problems, like domain adaptation and multitask learning, you end up making assumptions roughly of the form "my weight vector for domain A should look like my weight vector for domain B" where "look like" is really the place where you get to be creative and define things how you feel best.

This is all well and good in the regression setting. A magnitude 5 weight in regression means a magnitude 5 weight in regression. But not so in classification. Since you can arbitrarily scale your weight vectors and still get the same decision boundaries, a magnitude 5 weight kind of means nothing. Or at least it means something that has to do more with the difficulty of the problem and how you chose to set your regularization parameter, rather than something to do with the task itself.

Perhaps we should be looking for definitions of "look like" that are insensitive to things like magnitude. Sure you can always normalize all your weight vectors to unit norm before you co-regularize them, but that loses information as well.

Perhaps this is a partial explanation of some negative transfer. One thing that you see, when looking at the literature in DA and MTL, is that all the tasks are typically of about the same difficulty. My expectation is that if you have two tasks that are highly related, but one is way harder than the other, is going to lead to negative transfer. Why? Because the easy task will get low norm weights, and the hard task will get high norm weights. The high norm weights will pull the low norm weights toward them too much, leading to worse performance on the "easy" task. In a sense, we actually want the opposite to happen: if you have a really hard task, it shouldn't screw up everyone else that's easy! (Yes, I know that being Bayesian might help here since you'd get a lot of uncertainty around those high norm weight vectors!)