One thing we wanted to understand better is how this approach is different from MIRA. One obvious difference which the authors push is that they’re capturing variance of individual features as well as between features which yields stronger performance. Those are all valid points. But if we strip the feature covariance out of the picture how does the update optimization problem differ? The answer, I think, is that they’re essentially equivalent modulo one subtle difference which is probably important. This is probably obvious to machine learning gurus, but it took me a few minutes to work out. I’m sure this observation is even spelled out in one of the CW papers.

MIRA: Here’s the variant of MIRA I’m working with. You have a current weight vector $\mu’$ and you want to update to a new weight vector $\mu$ based on a new example pair $(x,y^*)$ :

Here $\gamma > 0$ is typically a fixed constant and the above update is done only when an error is made $(\hat{y} \neq y^*)$.

CW: In contrast, CW doesn’t have a single weight vector, it has distribution over weight vectors $w \sim \mathcal{N}(\mu,\Sigma)$. In normal CW, you get the covariance matrix $\Sigma$ as parameters. Here, I’m considering a variant where the covariance matrix is fixed to be the identity.3 The only parameters I’m considering here are the mean weight vector $\mu$. The update optimization for CW in this context is given by:

The
is the Kullback-Liebler Divergence. If you take a look at the expression for the KL diverence between two gaussians, here, it’s pretty straightforward to see that if the covariance matrices are the identity, the KL divergence is within a constant of $| \mu - \mu’ |^2$.

Now for the constraint. The first thing to notice is that

So if $Z$ is a zero-mean unit-variance gaussian, we want

If $\Phi$ is the cumulative distribution function for the unit-normal, we want:

Here’s the subtlety: If we assume that our feature vectors $(x,y)$ are normalized and that for any two $y,y’$ that $f(x,y)$ and $f(x,y’)$ don’t overlap in non-zero features (which is common in NLP since weight vectors are partitioned for different $y$s) then $| \Delta f |$ is a constant independent of the particular update. In which case, ensuring $erf (c \mu^{T} \Delta f) \leq 1 - 2 \eta$ (assuming $\eta > 0.5$) just amounts to making sure $\mu^{T} \Delta f$ exceeds some constant independent of the particular update, which is equivalent to selecting that choice of $\gamma$ in MIRA. So the two optimizations are essentially the same.

However, if feature vectors are not normalized, then the two aren’t equivalent. Essentially, the larger the feature vector norm the larger the “gap” term $\mu^T \Delta f$ needs to be. If you have exclusively binary features, which many NLP applications do, this means the more features active in a datum, the larger “gap” ($\mu^T \Delta f$) we require. This makes a lot of sense. We can get this in MIRA pretty straightforwardly:

then it’s always equivalent modulo constant choices. I don’t actually know if the $ \| \Delta f \|$ scaling improves accuracy, but I wouldn’t be surprised if it did.

This algorithm is actually called the Passive Aggressive algorithm, but I’ve always known it as MIRA. ↩