Optimizing Feature Extraction

A widely quoted optimization tip is that the fastest code is code that isn’t executed. The same goes for memory; if you want to use less, don’t allocate. I’m currently applying this optimization technique to feature extraction. As I’ve discussed here before, feature extraction is the slowest aspect of linear classification, which is otherwise just a sequence of multiply and add operations.

As things stand in LingPipe 3.6, features are extracted from generic objects and passed to classifiers as maps from names represented as strings to numbers. This is mediated by the feature extractor interface:

Even worse, the map gets converted to an instance of matrix.Vector using a symbol.SymbolTable to store the feature to dimension mapping. For binary logistic regression, the vector is then multiplied by a vector stored in the classifier, with results greater than zero resulting in category 1.

Using callbacks for inversion of control, we can get rid of the map and vector altogether. With some noise, we can also get rid of the symbol table.

Implementations of this interface will receive callbacks to process vectors. For binary logistic regression, the handler is simple, incrementing an accumulator by multiplying the value by the parameter vector’s value for that dimension:

A handler is used for each classification, getting the dimensions and values fed to it through callbacks. Calling the same dimension twice is equivalent to summing the feature values, making this especially convenient for structured classification settings.

For instance, the type E could be an XML document represented as a DOM tree, or it might be a sequence of tokens tagged with part-of-speech tags. The parser extracts features and sends them to the handler. In code, the classifier is simple:

The handler defined above is then an inner class within LinearClassifier.

There is no need with this arrangement to create features themselves if the dimensions can be extracted. We can use hash codes for them. With n-grams, we can even use the Karp-Rabin algorithm to compute the hash codes.

For convenience, I’ve implemented adapters based on feature parsers that are analogous to vector entry parsers, only with string-based features rather than vector dimensions.

4 Responses to “Optimizing Feature Extraction”

We had to do a similar optimization when we built our classification server. In C++ each STL map entry has an overhead of 28 bytes if I recall correctly, and strings in best case 4 bytes, worst case (strings longer than 16) 20 bytes. Using maps and strings for features was not possible for large frequency distributions – using an intermediate map for training which is “compacted” into raw memory (where we can do binary search) every now and then lowered memory consumption a lot. Do you know how much overhead maps have in Java?

In Java, Map is an interface, so we have to talk about specific implementations, like TreeMap and HashMap.

Objects are the real killer, though. Every object stores a reference (4 byte integer) as well as a pointer to the class itself (4 byte integer, I think), so you’re in for 8 bytes/object minimum. Strings store an underlying string (array of characters, meaning an object with associated pointer), as well as integer start/length, so you’re in for at least 24 bytes with an empty string.

Then if you look at HashMap, it’s structured as an array of Map.Entry implementations. HashMap implements Map. Entry as linked lists, with a generic key, generic value, next value, and integer hash, so 16 bytes plus object overhead, so about 24 bytes.

I think this can actually be worse in that some implementations store a 4-byte synchronization key on every object.

And it’s not just size — all the hash lookups are extremely slow compared to the basic multiply/add operation of a linear classify.

It’s possible to make smaller map implementations, as I do implicitly with tries in the lm package and with and the util.SmallObjectToDouble map class in LingPipe. (SmallSet is easier to understand and more general if anyone’s looking for examples.)