Lameness Disclaimer: All this is written to the best of my knowledge. Corrections, additions etc. are certainly welcome.

The faster objc_msgSend

You can download my test case. This one comes ready compiled with all sources. It requires emacs and make this time :).

This article is a byproduct of the original yet unpublished part IMP Caching Deluxe. At the end of the article I knew that my NSAutoreleasePool class was much faster than the original, but I couldn't quite properly explain why. Then I had to dig a little deeper and came up with some explanations, which immediately led to two followup articles. Namely this one and its predecessor Giving [] a boost

objc_msgSend analyzed

Sending a message to an object in Objective-C works like this. The object's class (the isa pointer) and the selector _cmd (method name) are used to find the implementation in the class and then jump to that implementation.

For better performance the implementation and its selector are noted in a private cache that is part of every class. This cache is very simple and very fast.

In this example, only a few methods have been memorized yet in this cache of a class Foo. Therefore the cache is small, it only holds 8 entries. (NSTextView has 256 slots right from the start, just initialized from a NIB and displaying.) There are 5 slots filled. Each slot shows the preferred slot for the indexed method. Our target methods optimal slot is slot #3, but since this is occupied, it resides in slot #4.

A part of the selector is cookie cut with the cache mask to quickly index into the cache. This avoids a linear search. In the given example the selector indexes into the slot #3, which is occupied by a different method. After one cache miss the matching selector and its implementation is found.

The graphic also tries to visualize, that the cache is arranged in a circular fashion. If the last entry doesn't match the the search continues with the start of the cache. (Notice that slot #0 is occupied with a method, that would preferably be stored in slot #7)

For those who can read C easier than PPC assembler, here's the objc_msgSend entry part in almost functional - but yet unfixable and unusable - C:

Some more background information: The cache with its mask is dynamically grown. The cache is never more than 75% full. Methods that are recorded later have a statistical speed advantage over those entered earlier.

objc_msgSend rewritten

The following code is entry part of Apple's objc_msgSend. This contains all the code executed when there is a cached implementation. I have noted the instruction cycles on the left, as I think they are spent. An L means latency, there is a stall as the next instruction depends on the previous load. (This is for a G3. Latencies only increase with later processors, due to their longer pipelines)

Change the branch prediction

The first idea is simple yet non-obvious. At cycle 18 the assumption of bne+ is, that the cache has failed and that another loop must be taken. This is the standard way to write loops, it optimizes the loop. Lets see how this will pan out:

Method Cache Misses

Predictions Detail

Predictions Total

0 Cache misses

-

(-1)

1 Cache misses

+ -

(0)

2 Cache misses

+ + -

(+1)

3 Cache misses

+ + + -

(+2)

It penalizes the O Cache miss path and favors the 2 Cache misses and worse path. Lets assume we code it bne- aloop then the table turns

Method Cache Misses

Predictions Detail

Predictions Total

0 Cache misses

+

(+1)

1 Cache misses

- +

(0)

2 Cache misses

- - +

(-1)

3 Cache misses

- - - +

(-2)

Now its a matter of likelihood of cache misses. Looking at actual cache contents, the spread of Cache hits and misses will not be even. In fact 0 Cache misses is a much more likely event than 2 Cache misses or more.

So this could potentially depending on the branch prediction architecture and the instruction cache contents pay off.

The superflous rotation

09 rlwinm r11,r11,2,0,29 ; shift mask up by 2

This is always done to the mask being fetched. It'd be easy to just pre-compute the shift in the cache filling routine instead of having the dispatcher do the work, this definitely saves a cycle.

A different cache algorithm

Currently the buckets are organized in a circular buffer fashion. If the search hits the end of the bucket array, the search continues with the beginning. This is just slightly more complicated than extending the end of the array to accommodate enough methods.

Lets assume the cache fill algorithm has been rewritten so that the buckets are not in a circular buffer but extend until a sentinel (NULL) value is reached. The algorithm at first only changes slightly, with the above two optimizations already coded in:

Hmm so far that didn't gain anything, because the cleared up slots are incurring latencies.
The latency between 5 and 6 one can get around by shadowing the mask of the objc_cache structure in the objc_class structure. Then the CPU does not have to wait anymore for the cache pointer to be fetched. This is also kind of neat, because now the original mask format can be kept in the cache structure and the pre-shifted mask is now available in the class structure. Here's how it could look like, with a reshuffling of the registers:

Of course this doesn't work immediately because we are wasting r12, which we need in the loop. But that can be fixed.

Also there are now potentially extra memory accesses in the loop, that may punish us in more than 0 cache misses cases. Buy how much will it punish us ?
How likely is the CPU data cache likely to miss one wonders... Given that a cache line is 16 bytes on older processors, we have if 0(r2) is fetched a 50:50 chance that 8(r2) is already in the cache. On a G5 with larger cache lines the chance is even better.
So given the likelyhood of

a data cache hit

a method cache hit

I think its a worthwhile endeavour.

Now r12 needs to be freed up somehow. This can be done by another slight tweaking of the code:

objc_msgSend optimized

That is 19 cycles vs. 22 cycles plus an unknown statistical gain because of the improved branch prediction. Around 14% improvement on paper... This is offset by the potential loss because of the extra memory hit, if the method cache incurs lots of misses.
The actual advantage has to be measured. (Try the test case!)

So that's what I did on my old Lombard Powerbook 400 MHz.
Here's the results of a typical run