31 January 2009

The following was pointed out to me recently. Maybe other people think it's obvious, and it's obvious once you hear it, but I'd never seen it that way. Suppose we're doing boosting and the weak learner that we choose is a thresholded linear model (eg., perceptron, SVM, logistic regression). Then what we get out of boosting is basically a neural network with a single hidden unit. The hidden units themselves are the weak learners, and the "alphas" that arise as part of the boosting procedure (eg., AdaBoost) are the weights connecting the hidden units to the output. (Thanks to Antonio Paiva for pointing this out to me.) So, in a sense, boosting a linear model can be seen as a kind-of-unusual method for training a two layer network. Essentially the number of boosting iterations that you run determines the number of hidden units (i.e., the network architecture).

Okay, that's great. But let's say we don't care about two layer networks, we care about deep learning. For a 3-layer network you could simple boost boosted-linear-models. For a 4-layer network you could boost boosted-boosted-linear models. And so on.

Is there an alternative?

Thinking purely procedurally, let's say my weak learner is a linear model. I start boosting and I've got 5 linear models trained in the standard AdaBoost manner. Now I have a choice. Should I train a 6th linear model to throw in to the standard boosting set? Or should I treat the 5 boosted linear models as a new base classifier and boost against the combination? If I choose the latter, I've now gone from two layers to three layers.

Why might it be a good idea to boost against the 5 collectively? Well, if you believe the whole deep learning propaganda, then it's a good idea because deep = good. From a more theoretical perspective, you might how that the extra level of recursion might get you an increased rate of improvement in the error rate. I.e., the recursion could potentially lead to stronger boosting results than the standard linear boosting. Of course, this is just a hunch: I haven't at all looked to try to figure out if it would actually work in theory. But it seems plausible. For instance, in neural networks theory, we know that a 2 layer network can approximate any (reasonable) function, but you might need an exponential number of hidden units; the number of required hidden units goes down if you make deeper networks (under assumptions).

21 January 2009

Featuritis (term from John Langford) is (in my mind) the process of throwing in a ton of features to a learning system without thinking about it.

Long gone are the days when one had to select a small number of useful features for supervised learning problems. Now that maxent has replaced naive Bayes and CRFs have replaced HMMs, we are free to throw a gigantic number of features into our learning problems with little to no repercussions (beyond a constant computation factor). The one key exception to this is MT, where the number of features is typically kept small because the algorithms that we currently use for feature weight tuning (eg., MERT) scale badly in the number of features. I know there is lots of work to get around this, but I think it's fair to say that this is still not de facto.

I think this is related to the fact that we cherish linear models.

That is, I think that a significant reason that featuritis has flourished is because linear models are pretty good at coping with it; and a reason that linear models have flourished is because they are computationally cheap and can always be "fixed up" by taking a featuritis approach.

I think a great point of contrast is the work that's been done in the machine learning community on using neural networks for solving NLP tasks. This work basically shows that if you're willing to give your machine learning algorithm much more power, you can kind of forget about representation. That is, just give yourself a feature for every word in your vocabulary (as you might, for instance, in language modeling), throw these through a convolution, then through a multilayer neural network and train it in a multitask fashion, making use of (essentially) Ando and Zhang-style auxiliary problems (from, eg., Wikipedia text) to do semi-supervised learning. And you do as well as a featuritis approach.

I suppose this is the standard "prior knowledge versus data" issue that comes up over an over again. Either I can put more prior knowledge into my system (cf., adding more features that I think are going to be useful) or putting more data into my system. The nuance seems to be that I cannot only make this trade-off. When I add more data to my system, I also have to change my learning model: a simple linear approach no longer cuts it. The linear model on a simple feature space just doesn't have the representational power to learn what I would like it to learn. So I have to go to a more complex function class and therefore need more data to reliably estimate parameters.

So why isn't everyone using neural nets? Well, to some degree we've been conditioned to not like them. Seeing cool positive results makes it a bit enticing to forget why we were conditioned not to like them in the first place. To me, there are basically three advantages that linear models have over multilayer neural nets. The first is that there is very little model selection to do: in a neural net, since I have little experience, I have no idea how to choose the network architecture. The second is training efficiency. Linear models are just ridiculously fast to train, and neural nets (despite all the progress over the past 20 years) are still darn slow. (Although, at least neural nets are fast to predict with; unlike, say, kernel machines.) The third is non-convexity. This means that we probably have to do lots of random restarts.

I doubt the third issue (non-convexity) carries much weight in the NLP community. We're such fans of algorithms like EM (also non-convex) and Gibbs sampling (atrociously not even comparable to notions of convexity) that I can't imagine that this is the thing that's stopping us.

The first issue (choosing network structure) is roughly analogous to choosing a feature representation. I think the difference is that when I say "I add a feature that pulls out a two-character suffix of the word," I can see exactly how this might affect learning and why it might be useful. When I say that I add a new node in a hidden layer of a network, I have no idea really what's going to happen.

The second issue (speed) is actually probably non-trivial. When I'm training a relatively simple classifier or sequence labeler, I kind of expect it to be able to train in a matter of minutes or hours, not days or weeks. The primary issue here doesn't seem to be the representation that's making things so much slower to train, but the fact that it seems (from experimental results) that you really have to do the multitask learning (with tons of auxiliary problems) to make this work. This suggests that maybe what should be done is just to fix an input representation (eg., the word identities) and then have someone train some giant multitask network on this (perhaps a few of varying sizes) and then just share them in a common format. Then, when I want to learn my specific task, I don't have to do the whole multitask thing and can just use that learned network structure and weights as an initial configuration for my network.

At the end of the day, you're going to still have to futz with something. You'll either stick with your friendly linear model and futz with features, or you'll switch over to the neural networks side and futz with network structure and/or auxiliary problem representation. It seems that at least as of now, futzing is unavoidable. At least network structure futzing it somewhat automatable (see lots of work in the 80s and 90s), but this isn't the whole package.

(p.s., I don't mean to imply that there isn't other modern work in NLP that uses neural networks; see, for instance, Titov and Henderson, ACL 2007.)