Why Premature Optimizations Are Rarely A Waste Of Time

13 Jan 2014

Yesterday I set out to try and improve go's strings.IndexByte implementation when an offset could be provided. After a few hours, I was finally able to put my implementation to the test. Overall speed improvement? None. Why do I consider this a success? Because the journey to nowhere was awesome.

The strings package has an IndexByte function which returns the position of a byte within a string, or -1 if the byte isn't found. For example, the following will print 4:

It isn't uncommon to want to find the index of a byte starting from a position other than 0. For example, if we wanted to break a string into words, we might look for a space starting from the position following the previous space. The idiomatic way to do this in go is to use a slice:

// we start at the previous space (4) + 1 so that we skip what we've already found
println(strings.IndexByte(haystack[5:], needle))

In my mind, given a large haystack, that would result in a lot of slices being created. That's when I thought to myself Wouldn't it be better to have a 3rd parameter which specifies an offset? Something with a signature like:

println(strings.IndexByteFrom(haystack, needle, 5))

The first thing I did was create a baseline benchmark: one for the entire string, and 3 for various slice lengths:

Since LongSlice runs slower (takes more time per iteration), I thought I was onto something: there's clearly overhead in creating a slice. Of course, that overhead is small and is quickly made up for by having a smaller string to scan. Still, why not have your cake and eat it to?!

Ok....since this is in strings_decl.go, I assume this is some type of header required to link assembly code. I followed the comment and opened up /pkg/runtime/asm_amd64.s. If you search, you'll find the IndexByte function (one for strings and one for bytes).

Now, I spent a fair amount of time in this code, looking and [trying] to learn. The brunt of the work happens in runtime.indexbytebody. I have no idea how this actually works. However, it's well enough documented that we can tell that it uses SSE instructions to look at 16-byte chunks of data at a time, taking care to read any unaligned head and tail chunks. (the asm_386 version is much simpler.)

I was a little deflated, but I still wanted to try. I knew that I'd have to take advantage of runtime.indexbytebody. All I needed to do was setup my data with the right offset. This took me a bit of time to figure out, and I'm not positive it's right, but I essentially ended turning:

Essentially, the added 3rd parameter (n) which is our offset, is subtracted from the total length (BX) and added to our string (SI). I also added some extra bound checking, but am leaving it out of here for the sake of simplicity. One thing that I struggled with was the offset of my new parameter. Our function takes: a string, a byte and an integer. A string is represented as a pointer and a length. Since we're on 64bit code, each of those is 8 bytes. We can see that in the above:

The offset of a value is the sum of the size of all the values before it. If you look the bytes.IndexByte code, you'll see that c is at offset 24, not 16. Why? Because array slices have a 3rd 8 byte value, the capacity. In other words, the offset of a value after a string will always be equal to the offset of the string + 16. The offset of a value after an array slice will always be equal to the offset of the array + 24.

My confusion though came from the offset for n and the return value. Since c is a single byte, I would have assumed n to be at 17 and the return value to be at 25 (since n is 8 bytes). But in the original code, the return value was a 24. Turns out that the spec is quite clear on this: All parameters less than 8 bytes in length have 8 byte slots reserved on the stack to preserve alignment and simplify variable-length argument list access.

Back to our benchmark, we add three tests that make use of our new assembly entry point:

Confused, I decided to look at the runtime.MemStats of each version. I had expected to find that the slice implementation allocated more objects than the baseline string version. Instead, I found that Mallocs, which tracks the number of mallocs, and HeapObjects, which tracks the total number of allocated objects, were the exact same with or without the creation of slices.

Maybe, I thought, there's a pool of slice objects? But I couldn't find one, so I posed this conundrum to the golang-nuts google group. The answer was perfect:

A slice of a string is a pointer and a length.
No allocations required. Just two words on the stack.

In other words, my assumption that a string slice would result in a heap allocation (and subsequent GC), was wrong. My 'tweaked' version and the slice version are effectively the same (a couple values pushed onto the stack).

What I'm saying is that this is a pattern I've gone through many times before. The performance gains from optimizing tend to be small compared to what the process can teach you. In this case, what I learned (more or less in order of importance):

I wrongly associate stack with primitives and the heap for everything else. I still think that's true for Java, and, structs aside, true for C#. But, if I'm honest, I have no idea how Ruby or Node approach this. Go's specs, I've since learned, never mention the heap or the stack, so it's implementation specific. I know how the current implementation behaves, in this case at least.

Go has highly optimized Index functions for both strings and bytes. I'll never worry about that again.

Calling assembly from Go is easier than I would have guessed. It's probably safe to say that calling C from Go is equally easy. In the future, this knowledge will help me quickly rank the appropriateness of various solutions.

I'll probably never push my own assembly code into production. Still, I refuse to believe that playing around with it was a loss. Just seeing the impact of different instruction sets (SEE vs not) was obvious but neat. It's also neat to see the control flow through various CMP and JMP operations (which I remember pretty well from school, still, the aligned, sse, condition looping is awesome). I also can't figure out how you pick your registers (something from school tells me some flags are tied to specific registers?). Why R11 vs R10 vs X0 vs BX? The documentation is pretty brutal.

Honestly, it doesn't matter why you're exploring stuff outside of what you already know. As long as you're doing it.