Thursday, November 06, 2008

Further performance gains

One thing I've learned over the last few months is that minor improvements to performance, accumulated over time, are just as important as big architectural changes which result in big gains. With the new code generator in place, I spent some time working on a few tweaks I've been thinking about for a while.

Tuple dispatch overview

First up, I sped up method dispatch on tuples by eliminating some unnecessary conditional branches and memory accesses. In Factor, the object system is implemented in the language, and so all dispatch is done by generated Factor code. This code is generated at compile time. For example, we can define a tuple,

As you can see above, the generated code uses various low-level primitives to accelerate dispatch. Usually you will never work with these primitives directly. The words named class=>generic are not real words in the dictionary; they are word objects corresponding to methods. So methods are essentially words, except they're not part of any vocabulary. Again, this is an implementation detail and you don't need to be aware of how methods are implemented behind the scenes. In the image where I defined this tuple, I had no other tuples with a slot named breed, so the generic accessor only has one method, and the code generated for a generic with one method is very simple. For generics with many methods, more complex code is generated; if you have enough methods, a hashtable dispatch is generated.

The general philosophy of method dispatch on tuples is as follows. To each tuple, we associate an integer which is the length of the superclass chain from the tuple up to the root class of tuples, tuple. I call this the tuple's "echelon". Every tuple instance in the heap holds a pointer to a tuple layout object in its first slot. The tuple layout object holds the echelon, the sequence of superclasses, and the tuple size.

The tuple class methods of a generic word are sorted into echelons, and dispatch proceeds by starting at the highest echelon and going down until a suitable method is found (or an error is thrown). At each echelon, the generic word looks at the corresponding entry in the superclass list of the tuple on the stack, and performs a hashtable lookup. So dispatch runs in O(n) time, where n is the maximum echelon number of the tuple classes participating in the generic word.

For example, suppose we have the following inheritance hierarchy -- its contrieved, because in practice you probably would not have square inherit from rectangle, but it demonstrates a point here:

If the tuple on the stack has echelon >= 4, we get the 4th element in its superclass chain, and check if it's rectangle or parallelogram. If so, we dispatch to that method. Note that the 4th element of the superclass chain of both rectangles and squares is rectangle.

If the tuple on the stack has echelon >= 3, we get the 3rd element and check if it's a triangle. If so, we dispatch to that method.

If the tuple on the stack has echelon >= 2, we get the 2nd element and check if its a circle. If it is, we dispatch to that method.

If the tuple on the stack has echelon >= 1, we get the 1st element and check if it's text. If so, we dispatch to that method.

For this generic word, a maximum of four tests must be performed because the inheritance hierarchy is very tall. This situation is rare, since for the most part inheritance is very flat.

Method dispatch: removing conditionals and indirection

The first thing I realized is that every tuple has an echelon level of at least 1, since it must have itself and tuple in the superclass chain. So the conditional test on the final step is unnecessary. Furthermore, if all tuples have echelon 1, there is no need to even load the echelon of the tuple on the stack into a register.

Next, I decided to put the superclass list inside the tuple layout object itself, instead of having the tuple layout object reference an array. This removes one level of indirection, which reduces CPU cache pollution.

To illustrate these improvements, look at the code generated for the call word before these improvements:

Method dispatch: faster hashcode lookup

If a generic word has more than 4 methods at the same echelon level, a hashtable dispach is generated instead of a series of linear tests. Formerly, the generated code would indirect through the class word to look up the hashcode. Again, this is bad for locality because it pulls in the entire word object's cache line in for no good reason. So what I did was intersperse the hashcodes with the superclass chain in the tuple layout object. The tuple layout object will already be in the cache, since we just read one of the superclass chain entries, so reading the hashcode is essentially free.

Method dispatch: re-ordering tests

I noticed is that the >fixnum generic word had conditionals in an essentially random order. The cut-off between a series of conditionals and a jump table is 4 methods, and >fixnum, defining methods on every subtype of the reals, is the borderline case:

I/O system tweaks and string-nth intrinsic

I made some changes to io.ports and io.buffers; mostly adding inline declarations. I also added a hint to the push word for the sbuf class; stream-readln would push every character onto an sbuf, and adding this fast-path eliminated some method dispatch. Finally, I made string-nth a compiler intrinsic rather than a call into the VM, which means that words where the compiler infers you're operating on strings will inline the intrinsic and perform less memory loads/stores and subroutine calls than before.

Eliminating useless branches with value numbering

Take a look at what the high-level optimizer does to the following quotation:

A lot of dispatch is eliminated and methods are inlined, but there is some pretty crappy control flow redundancy there. I extended the low-level optimizer to eliminate it when building the low-level IR.

One of the worst offenders is the = word, defined as follows:

: = ( obj1 obj2 -- ? ) 2dup eq? [ 2drop t ] [ equal? ] if ; inline

If obj2 is a word, then equal? expands into 2drop f, so we get

2dup eq? [ 2drop t ] [ 2drop f ] if

Then, dead code elimination converts this to

eq? [ t ] [ f ] if

Ideally, we'd get rid of [ t ] [ f ] if altogether. Instead of doing this in the high-level optimizer, I decided to detect this code pattern, along with its negation [ f ] [ t ] if (which is the inlined definition of not) and emit a branchless comparison, instead:

Now, suppose you have a code sequence like [ t ] [ f ] if [ 1 ] [ 2 ] if after inlining. Value numbering builds up an expression chain where one comparison depends on the result of another, and it is able to collapse them down to a single comparison.

Finally, the empty conditional, [ ] [ ] if, manifests in value numbering as a conditional branch instruction where both successors point to the same node. This is replaced with an unconditional branch.

Results

Here are the timings, in milliseconds, for 10 runs of the reverse-complement benchmark: