16 August 2016

I wrote my first (and only) coreference paper back in 2005. At the time, my goals were to (a) do well on coref, (b) integrate background knowledge (like "Bush" is "president") using simple techniques, and (c) try to figure out how important different (types of) features were for making coreference decisions.

For the last, there is a reasonably extensive feature-type ablation experiment using backward selection (which I trust far more than forward selection). After writing the paper, I had many internal dialogues about why experiments like that are interesting. I have had, over the years, a couple of answers:

The obvious answer is "it tells us something interesting about language." It would be nice if this were true, but I'm not totally sure it is, and it's definitely not true if one doesn't put a bunch more effort into it than I put into that 2005 paper. What can we say? Yeah, spelling is important. Knowledge is important. Syntax is hard to actualize. I don't know that we didn't already know these things before.

Engineering. Suppose someone wanted to build a similar system. They want to put their effort where it's most valuable, and so feature ablation experiments tell you where you're likely to get the most bang for the buck. In a sense, you can see these as a type of negative result. Which features actually aren't that important. In the 2005 paper, you could remove syntactic, semantic, and class-based features with zero performance degradation; and also get rid of pattern-based features with minor performance degradation. This saves a lot of effort because some of these are actually quite a pain to implement and/or are slow and/or require lots of external resources.

Today, I mostly lean toward the engineering answer, or at least that's what I want to use as a jumping off point here.

Now that we're partially allergic to feature engineering and prefer to replace it with architecture engineering, I think the charge is stronger, not weaker, to do ablation experiments. Does that thing really need to be a biLSTM? Would an RNN suffice? What about just averaged bag of word embeddings? Do you need two layers of attention there or would one suffice? Do you need attention at all? Does that layer really need to be that wide?

These are all easy questions to ablate and answer.

There's never going to be a crisp answer like "yes, if I cut my hidden state from 493 units to 492 units performance goes down the drain." Many things will be gradual, but not all.

Why do I think this is important? Precisely for reason #2 above, but about a bajillion times more so. Training these really complicated models with wide hidden units, bidirectional stuff, etc., is really slow. Really really slow. If you tell me I can be within 1% accuracy but can train 100 times faster, I'm going to do it. Sure, for a final test run I might crank everything up again (and then report that!) but for development, it's super useful to have a system you can train and evaluate efficiently.

Does this tell us anything interesting about language? Almost certainly not (or at least not without a huge amount of extra work). But it does make everyone's life better.

4 comments:

Getting a handle on how difficult the problem is sometimes far more important than simply beating the best existing approach. Looking at learning curves and how they interact with data/architectural features often gives you that insight. That is occasionally of linguistic interest (sometimes the answer is that there's no data like more data, but not always). The engineering costs are, as you mention, also good to put the problem in perspective.

Yeah that's a good point. Everything is a function of dataset size, and I think we're learned repeatedly that tradeoffs at small data don't always extend to large (eg Banko+Brill, or phrase-based MT versus neural MT).