20 November 2006

I gave a talk today at our small group meeting about feature engineering for coreference resolution problems. I think it is oft underappreciated by people (myself occasionally included) who spend a lot of time working on fancy machine learning algorithms that the difference between success and failure often hinges more on clever features than on fancy learning. For those who want to look at the slides, you can take an OpenOffice or PDF version for your perusal. I forewarn that it may be a bit hard to follow without the soundtrack, but it's sort of an "untold story" version of the coref paper we had at HLT/EMNLP 2005.

So what did we do that's clever (or, at least, as far clever as I knew at the time and know now)? Well, a few things. First, I tried extending the notion of Hobbs' distance to discourse trees. This actually works surprisingly well. The true referent is the first element according to syntactic Hobbs' distance in about 70-80% of the cases and we (almost) see the same thing at the discourse level. The only oddity is that you have to flip the nuclearity of the attribution relation. Then you get about 80% holding, assuming perfect discourse trees. Of course, we never have perfect discourse trees and the noise introduced by poor parsing is enough to make this feature more or less useless.

The second thing we tried was to use this name/instance data that Mike Fleischman gathered at ISI before going off to MIT. This data was gathered in the context of Q/A to answer questions like "Who is the president of the U.S.?" by using lookup-tables. But it's great data for coref! It includes about 2 million examples of names paired with their occupations. Maybe you can find this data interesting too. Mike describes it more in his ACL 2003 paper.These guys make a huge difference in coref performance.

The next thing that helped a lot was a clever, non-standard use of gazetteers. What we do is take the ID of the gazetteer to which the anaphor belongs and pair it with the lexical item of the antecedent. Why? Well, suppose we have a gazetteer that contains words like "California" and "Utah" and "New York." By pairing this list id (eg., "list-5") with antecedents, we get features like "list-5 *AND* state" or "list-5 and country" etc. We'll probably learn that the former is a good feature and the latter is not. We also have a feature that checks to see if the two referents are on the same gazetteer but are not the same word. Why? This is a negative feature. "California" and "Utah" appear on the same list (states) but are not identical (string-wise). So they're probably not coreferent.

We tried pretty hard to get WordNet to work, but it just kept not helping. WordNet distance is a terrible measure (eg., "Japan" and "Russia" have distance 2 -- they are both under "country"). On the other hand, like in the gazetteer case, "nearby in WordNet but not the same word" is a reasonably good feature, though it turned out not to help much. Hypernym and hyponym were the most promising, but we never actually got any benefit from them.

The rest of the stuff in the talk made it fairly close to the front of the paper. The most surprising result was that count-based features help a lot! These features look at things like the entity to mention ratio, the number of entities detected so far, etc. These can't be included in a simple coref model that makes pairwise decisions, but for us it's easy.

The end result is that using just string-based features (standard "substring" and edit distance stuff), lexical features (word pairs), count-based features and Mike's data, we can get an ACE score of 87.6 using LaSO for doing the learning (this goes up about 0.8 if you use Searn in most cases, but I don't have all the numbers). Adding in some inference to predict things like number and gender gets you up to 88.3. Gazetteers get you up to 88.7. After that, the rest of the features (regular expression patterns, word clusters, WordNet, discourse, and syntax) don't help...all together, they get you up to 89.1, but that's not much bang for the buck.

My conclusion is basically that while learning makes a difference (Searn is a bit better than LaSO), the feature engineering makes a bigger one[*]. Many of the features we found useful were a priori not really that obvious. I spent a month or two of my life trying to come up with ways of being clever to help improve performance (playing on dev data so as to be fair). It's definitely a non-trivial enterprise, but it often pays off.

[*] This is assuming that your learning is robust enough to handle the features you want: for instance, the standard pairwise models could not handle the count-based features that helped us a lot.

11 comments:

The biggest contrast is whether you're doing resolution against a database. It seems to be a whole different field of research, known as record linkage or database deduplication, depending on whether the merging is inter- or intra-DB.

For instance, WestLaw resolves mentions of expert witnesses and lawyers in case law against their professional registration/licensing.

We do this using context to map mentions of genes in biomedical research articles to their citations in EntrezGene. A typical request from customers is to link product mentions to their respective pages (e.g. ShopWiki or Froogle).

What Hal's talking about is the more knowledge free clustering problem (I love the plug and play architecture in his linked ppt). The issue of feature selection is rather different within documents and across documents. Within documents, syntactic features are key and the one-sense-per-document approximation is usually a reasonable heuristic.

Across documents, context is hugely helpful. In fact, it's the only way to handle token-by-token identical mentions, as in the umpteen John Smith's on the web.

"My conclusion is basically that while learning makes a difference ... the feature engineering makes a bigger one[*]. Many of the features we found useful were a priori not really that obvious. I spent a month or two of my life trying to come up with ways of being clever to help improve performance (playing on dev data so as to be fair). It's definitely a non-trivial enterprise, but it often pays off."

Agreed.

The interesting thing for me, as an ML researcher, is that we have a pretty good understanding of learning methods, inference algorithms, &tc. However, automatic methods for feature engineering are little studied.

That's a shame because the potential payoff of good automatic feature engineering is huge. It means you can tackle new problems with relatively little time investment and without a human expert in the loop. I expect this research direction would have a big impact on the field.

For the past two years, I've been thinking for a while about methods for automatically inducing features over structured prediction problems, and it's become one of my favorite "if I only had the time" questions. I've spent enough idle time wrestling with this problem that I've refined my ideas down to a very simple approach that would be simple to implement for any sequence labelling or tree prediction tasks, but nonetheless powerful enough to capture all commonly used features I know about. If I only had the time...

I think the automatic feature engineering thing is really really interesting, but I have to admit I'm a bit skeptical. I'm fully on board with Andrew McCallum's CRF-style feature induction, where you start out with a bunch of base features and then start adding conjuctions (or disjunctions, or implications, or ...). This is a hard search problem (but we could combine this with learning to search...), but doable.

But I see a big difference between this and sort of a true automatic feature engineering technique. I feel that the set of "base" features is pretty much the most important thing, and that conjunctions and such on top of that is a bit of icing. But coming up with good features is really hard and I really don't see how it could be done automatically. It seems to be fair, one would have to do everything at the character level, but then you just lose so much information...

Example: in the coref stuff, there are a lot of words that are syntactically singular but semantically plural (eg., "group", "company", etc.). If it weren't for these, a "number" feature is very useful, but with them, the number feature often hurts and doesn't help. So I grepped a ton of data for the string "members of the ____" and pulled out all syntactically singular nouns that appeared in "____" and then adjusted the number feature to consider these as plural. Voila! Performance went up.

It's hard to imagine an automatic system coming up with things like this, unless you put in as one of its abilities "come up with regular expressions to grep through a corpus with" but then how do you come up with them and how do you use that resulting information.

I'm not sure McCallum's feature induction technique is powerful enough. He can find features of the form "there exists an item I1 over which predicate P1 holds and there exists an item I2 over which predicate P2 holds", but not features of the form "there exists an item I over which predicates P1 and P2 hold". If I'm not mistaken, his technique cannot induce complex item predicates, and all item predicates must be specified in advance.

I feel that the set of "base" features is pretty much the most important thing, and that conjunctions and such on top of that is a bit of icing.

I'm not sure I agree. The "base" features (what I call atomic item predicates) are important, but the relations between the items may be important too.

For example, "there is item I1 with label L1 and item I2 with label L2 and I1 is the head child of I2". The head-childness relation may be more important than whether we examine the label or the headtag of the individual items.

Sorry if my terminology is confusing, I haven't really refined it yet. When I say "predicate", I mean a boolean function over items with arity one, i.e. P: I -> {+1, -1}. When I say "relation", I mean a boolean function over items with arity two, i.e. R: I X I -> {+1, -1}. I assume relations are non-decomposable into predicates. I also believe that, for problems like sequence labelling and tree prediction, arity one and two boolean item functions are sufficiently powerful to represent most common features.

I've been thinking primarily about induction of binary item relations. In sequence labelling, we can give an item as a discrete label and a location in the sequence. The label contains all predicate information, e.g. headtag, headlabel, etc., i.e. every bit of information about the item that is independent of any other item in the structure. The location is the information of the item that is only useful wrt to other items. We can then begin with two atomic binary item relations "i1 is left of i2" or "i1 is at the same location as i2". From these atomic relations, we can---in principle---induce more complex relations like "i1 is one to the left of i2" ("i1 is left of i2 and there does not exist an item i3 such that (i1 is left of i3 and i3 is left of i2)"). Similarly, for tree prediction problems, an item can be given as the discrete label and the left and right location of the span. Using the atomic relations above (for either the left location or the right location) we have sufficient power to induce more complex relations like "i1 is the leftmost child of i2".

What I've been focusing on is automatic methods for inducing relations between items, because this is a closed domain. I simplify the item predicate problem by reducing all items to a small label set (e.g. the constituent label) and have designed some techniques for inducing useful relations. These relations could then be used with more complex predicates.

You are right that the item predicates may be very important too. I haven't given much thought to this problem, but it seems like we could make some progress just by examining the item predicate problem in isolation (i.e. "what predicate is needed to discriminate successfully on this task?"). Predicate creation may still require significant human feature-engineering effort.