16 May 2006

A structured prediction problem is a distribution D over pairs (x,y), with x in X and y in Y, together with a loss function l(y,y'). Typically Y is large, and elements y have complex interdependent structured, which can be interpreted in various ways. The goal is a function f : X -> Y that minimizes the expect loss: E_{x,y ~ D}[l(y,f(x))].

A multitask learning problem is a distribution D over tuples (x,y,y1,...,yK), with x in X and (for simplicity) y,y1,...,yK all in 2={0,1}. The goal is a function f : X -> 2 that minimizes the expected loss: E_{x,y,y1,...,yK ~ D}[1(y != f(x))]. Importantly the loss doesn't care about y1...yK. However, we typically assume that y1...yK are related to the x -> y mapping.

Without assumptions on the SP loss or output space Y, SP subsumes MTL. For instance, in sequence labeling, we can think of a multitask learning problem as a sequence labeling problem where we only care about the label of the first word. This reminds me of some results that suggest that it's really not any easier to predict the first word in a translation than to solve the entire translation problem. This means that we could try to solve an MTL problem by applying, say, a CRF.

One key difference that seems quite important is that results from MTL have suggested that it is easy to get an MTL model that performs better on the first task, but its hard to get a single MTL model that performs well on all tasks simultaneously. For instance, in neural networks, one might want to run 1000 training epochs to do well on task 1, but only 10 or as many as 10000 to do well on one of the other tasks. This implies that it's probably not a good idea to naively apply something like a CRF to an MTL problem. For this reason (and the nature of the loss functions), I feel that SP and MTL are complementary problems.

The application of MTL to time-series data has been reportedly very successful. Time-series data is nice because it naturally provides additional tasks: in addition to predicting the value at time t, also predict at times t+5, t+10 and t+20. During training we have access to these values, but we don't during testing, so we don't have the option of using them as features.

One especially cool thing about this observation is that it applies directly to search-based structured prediction! That is, when attempting to make a prediction about some search step, we can consider future search steps to be related tasks and apply MTL techniques. For instance, when trying to predict the first word in a translation, we use the second and third words as related tasks. I, personally, think this would be quite a cool research project.

So my tentative conclusions are: (1) SP can solve MTL, but this is probably not a good idea. (2) MTL can be applied to search-based SP and this probably is a good idea. (3) SP and MTL should not be considered the same problem: the primary difference is the loss function, which is all important to SP, but not so important to MTL. That said, SP algorithms probably do get a bit of gain from the "good internal representation" learned in an MTL setting, but by trying to learn this representation directly, we could probably do much better.