The Long Road to CRFs

CRFs are Done

The first bit of good news is that LingPipe 3.9 is within days of release. CRFs are coded, documented, unit tested, and I’ve even written a long-ish tutorial with hello-world examples for tagging and chunking, and a longer example of chunking with complex features evaluated over:

And They’re Speedy

The second bit of good news is that it looks like we have near state-of-the-art performance in terms of speed. It’s always hard to compare systems without exactly recreating the feature extractors, requirements for convergence, hardware setup and load, and so on. I was looking at

for comparison. Okazaki also evaluated first-order chain CRFs, though on the CoNLL 2000 English phrase chunking data, which has fewer tags than the CoNLL 2003 English named entity data.

While my estimator may be a tad slower (it took about 10s/epoch during stochastic gradient), I’m applying priors, and I think the tag set is a bit bigger. (Of course, if you use IO encoding rather than BIO encoding, like the Stanford named entity effort, then there’d be even fewer tags; on the other hands, if I followed Turian et al. (ref below), or the way we handle HMM encoding, there’d be more.)

It also looks like our run time is faster than the other systems benchmarked if you don’t consider feature extraction time (and I don’t think they did in the eval, but I may be wrong). It’s running at 70K tokens/second for full forward-backward decoding; first-best would be faster.

All the Interfaces, Please

Like for HMMs, I implemented first-best, n-best with conditional probabilities, and a full forward-backward confidence evaluation. For taggers, confidence is per tag per token; for chunkers, it’s per chunk.

Final Improvements

Yesterday, I was despairing a bit over how slow my approach was. Then I looked at my code, instrumented the time spent in each component, and had my D’oh! moment(s).

Blocked Prior Updates

The first problem was that I was doing dense, stochastic prior updates. That is, for every instance, I walked over the entire set of dense coefficient vectors, calculated the gradient, and applied it. This was dominating estimation time.

I can vouch for the fact that result vectors didn’t change much, and speed was hugely improved to the point where the priors weren’t taking much of the estimation time at all.

Caching Features

The other shortcoming of my initial implementation was that I was extracting features online rather than extracting them all into a cache. After cleaning up the prior, feature extraction was taking orders of magnitude longer than everything else. So I built a cache, and added yet another parameter to control whether to use it or not. With the cache and rich feature extractors, the estimator needs 2GB; with online feature extraction, it’s about 20 times slower, but only requires around 300MB of memory or less.

Bug Fixes

There were several subtle and not-so-subtle bugs that needed to be fixed along the way.

One problem was that I forgot to scale the priors based on the number of training instances during estimation. This led to huge over-regularization. When I fixed this problem, the results started looking way better.

Structural Zeros

Another bug-like problem I had is that when decoding CRFs for chunkers, I needed to rule out certain illegal tag sequences. The codec I abstracted to handle the encoding of chunkers and taggers and subsequent decoding could compute legal tag sequences and consistency with tokenizers, but the CRF itself couldn’t. So I was getting illegal tag sequences out that caused the codec to crash.

So I built in structural zeros. The simplest way to do it seemed to be to add a flag that only allowed tag transitions seen in the training data. This way, I could enforce legal start tags, legal end tags, and legal transitions. This had the nice side benefit of speeding things up, because I could skip calculations for structural zeros. (This is one of the reasons Thorsten Brants’ TnT is so fast — it also applies this strategy to tags, only allowing tags seen in training data for given tokens.)

Feature Extraction Encapsulation

I was almost ready to go a couple of days ago. But then I tried to build a richer feature extractor for the CoNLL entity data that used part-of-speech tags, token shape features, contextual features, prefixes and suffixes, etc. Basically the “baseline” features suggested in Turian, Ratinov, Bengio and Roth’s survey of cluster features (more to come on that paper).

So I raised the abstraction level so that the features for a whole input were encapsulated in a features object that was called lazily by the decoders and/or estimator. This allowed things like part-of-speech taggings to be computed once and then cached.

This will also allow online document features (like previous tagging decisions) to be rolled into the feature extractor, though it’ll take some work.

And a Whole Lotta’ Interfaces and Retrofitting

I added a whole new package, com.aliasi.tag, to characterize the output of a first-best, n-best, and marginal tag probability tagger. I then implemented these with CRFs and retrofitted them for HMMs. I also pulled out the evaluator and generalized it.

Along the way, I deprecated a few interfaces, like corpus.ChunkHandler, which is no longer needed given corpus.ObjectHandler<Chunking>.

Still No Templated Feature Extraction

Looking at other CRF implementations, and talking to others who’d used them, I see that designing a language to specify feature extractions is popular. Like other decisions in LingPipe, I’ve stuck to code-based solutions. The problem with this is that it limits our users to Java developers.

3 Responses to “The Long Road to CRFs”

If I understand your “blocked prior” comment correctly, you can actually make this update exactly in the case of L2. In the case of L1, you cannot exactly achieve sparsity at every instance, but you can still, with bookkeeping, efficiently incorporate the gradient of the regularizer at each instance.

The L2 (Gaussian) version is discussed in the Shalev-Shwartz and Singer Pegasos paper (ICML 2007), Since the portion of the gradient corresponding to the regularizer is just a scaling, you can keep track of the updates using 2 extra scalars. Duchi & Singer’s JMLR paper “Online and Batch Learning using Forward Backward Splitting” give details for L1 & L\infty, L1 is essentially the truncation algorithm of Langford et al. (2008).

Thanks for the pointers — I hadn’t seen Duchi and Singer, but I’ll check it out. I’m still a little sketchy on the theoretical guarantees in terms of what’ll converge to the correct coefficient estimates.

What I’m doing is a bit different. I’m updating the likelihood gradient fully stochastically (one item at a time). I’m updating the prior every N items. For the likelihood, there’s a multiplier of learning rate. For the prior, a multiplier of learning rate * N / NumberOfItems.

A standard blocking approach would compute the gradient for likelihood for all N items at once, then update. (Thus making it easier to parallelize than what I’m doing.) And it lets you do ordinary (conjugate) gradient by setting N to the number of items.

I only implemented the prior the way I did because I was doing fully item-at-a-time updates and found the full dense prior update was dominating computation time by an order of magnitude.

When I coded logistic regression classifiers, I applied the prior in a lazy fashion, which seems to be pretty standard. I just kept an array storing how many items had been processed before I’d last updated a prior. Then I only updated a prior on a dimension before seeing an item with a non-zero value for a dimension. That didn’t seem like it’d be nearly as efficient for CRFs, because of the lattice of feature vectors all needing to update the last-seen array.

The lazy approach doesn’t quite apply the same penalty as L2, because the gradient is proportional to the value. Is that what Shalev-Schwartz and Singer discuss? I always figured someone good enough at math could figure out the “compound interest”. I’ll have to go back and look that up. (And could you do the Cauchy prior, too, while you’re at it, so I’ll have the complete set?)

The lazy approach does apply the same penalty as item-at-a-time L1, because the gradient’s just a (truncated) constant.

>> The lazy approach doesn’t quite apply the same penalty as L2,
>> because the gradient is proportional to the value. Is that what
>> Shalev-Schwartz and Singer discuss?
You’re right. Effectively what happens is that “compound interest” builds up. This is exactly what Shalev-Shwartz and Singer discuss. I’m not sure about Cauchy priors. I’ve never worked with them before. There might be some extra bookkeeping there.

I haven’t done any real hardcore benchmarking, but when I’ve coded CRFs and used L2 priors, the sparse updates suggested by Pegasos work equally well there. It depends on how sparse your lattice of feature vectors is. But at least for NER and some word alignment stuff, it still was helpful. Of course, you’re right in noting that computing the gradient for the regularizer is trivial compared to forward-backward.