The add instruction is unconditional to avoid a dependency on the comparison. Unrolling the loop would mask the latency of the ldr instruction as well, but that is outside the scope of this experiment.

Now compile this code with gcc -march=armv5te -O3 and watch in horror:

A complicated mask/shift/or operation is used where a simple shifted add would suffice.

Write-back addressing is not used.

The loop control counts up and compares instead of counting down.

Useless mov in the prologue; swapping the roles or r2 and r4 would avoid this.

Using lr in place of r4 would allow the return to be done with pop {pc}, saving one instruction (ignoring for the moment that no callee-saved registers are needed at all).

Even for this trivial function the gcc-generated code is more than twice the optimal size and slower by approximately the same factor.

The main issue I wanted to illustrate is the poor handling of bit-fields by gcc. When accessing bitfields from memory, gcc issues a separate load for each field even when they are contained in the same aligned memory word. Although each load after the first will most likely hit L1 cache, this is still bad for several reasons:

Loads have typically two or three cycles result latency compared to one cycle for data processing instructions. Any bit-field can be extracted from a register with two shifts, and on ARM the second of these can generally be achieved using a shifted second operand to a following instruction. The ARMv6T2 instruction set also adds the SBFX and UBFX instructions for extracting any signed or unsigned bit-field in one cycle.

Most CPUs have more data processing units than load/store units. It is thus more likely for an ALU instruction than a load/store to issue without delay on a superscalar processor.

Redundant memory accesses can trigger early flushing of store buffers rendering these less efficient.

No gcc bashing is complete without a comparison with another compiler, so without further ado, here is the ARM RVCT output (armcc --cpu 5te -O3):

This is much better, the core loop using only one instruction more than my version. The loop control is counting up, but at least this register is reused as offset for the memory accesses. More remarkable is the push/pop of two registers that are never used. I had not expected to see this from RVCT.

The ARM ABI specifies that a bit-field shall be contained within one naturally aligned instance of its declared type. If the bit-field is wider than the declared type special rules apply. In our test case, the alignment is clearly 4 bytes.

I beg to differ. Any programme can be broken down into small pieces on each of which, according to yourself, a human will beat the compiler.

Compilers are useful mainly for two reasons:

Compilers are good enough that the effort saved by writing in a high-level language are usually worth a sacrifice in performance.

Writing in an HLL allows the code to be compiled for any target machine without modifications.

The above notwithstanding, until something radical happens to compilers, there will remain areas where the gains from hand-optimisation make the extra effort and loss of portability well worthwhile. A good example is the FFT used in FFmpeg. It is probably the fastest C FFT available, yet the same algorithm hand-written in NEON assembler runs 12 times faster on a Cortex-A8. With the C implementation, the FFT completely dominates the execution time of many popular audio decoders, while the assembler version is barely noticed in profiling.

As a reference point, FFmpeg contains roughly 7000 lines of ARM assembler (excluding comments and whitespace). In H.264 decoding, FFmpeg on Cortex-A8 spends 30-40% of the total execution time in hand-written assembler, the rest is C code. I am convinced writing the C parts in assembler would double the speed of that code, equivalent to a total speedup of approximately 40%. Much of this code is, of course, very complex, and writing it in assembler would be very difficult. As with all engineering, there is a trade-off between perfection and practicality.

FFmpeg is probably an extreme example, both in the demands for processing power and in the returns from hand-optimisation. A more mundane application, such as an email client, spends most of the time waiting for the network or user input, and optimising it beyond the basics makes little sense.

The point I want to get across is that anything can be optimised, the trick lying in identifying the parts which should be. The better the compiler is, the less code will require hand-optimising, and it is precisely small, trivial bits of code like the function above where compilers can realistically be improved. Code like this is also what makes up the bulk of the applications we use.

I suppose that I expected the reader to do a little thinking on his own. For programs of tens or hundreds of thousands of lines in length, human programmers cannot reasonably devote the resources that it would take to optimize the entire program even if it were modularized into reasonably sized components. To optimize it, that is, so that is better than what a compiler could produce.

While your code is probably faster a few comments: Does it really matter (As in are the gains in speed worth the loss in readability)? Additionally, gcc is FAR from one of the “best compilers”; moreso when you’re not generating x86{-64} assembly.

For one piece of code that size it probably doesn’t matter. For a large application built from many such pieces I would argue that it does matter. Consider a web browser on a mobile device. Would you like the browser to use less of precious flash space? Would you like web pages to render quicker? Would you like the battery to last longer? I’d be surprised if you didn’t answer yes to at least one of those questions. A better compiler could give you all three.

When I said “best compiler”, I was referring to RVCT, which is to my knowledge the best ARM compiler around. If you know of a better one, I’m all ears.

My most recent experience with a Green Hills compiler involved a week-long debugging session culminating in the discovery that the compiler would occasionally access memory below the bottom of the stack. An interrupt at an unfortunate time would result in the registers being dumped on the stack, and the wrong value would be loaded. This only happened when compiling without debugging symbols.

OK, I’ll bite. RVCT’s loop optimizer doesn’t like structures. Blame it
(the optimizer) for not recognizing the fact that it is looking at an
integer operation in this case. Notice that the loop gets transformed
into a down counting loop.

This is hardly more portable than assembler (depends on bit-field
layout) and took about ten times as long to come up with :-) but this
is the compiler generated assembler code (RVCT, Build 697, -O3 -Otime):

I think a fairer thing to say is “even the best compilers are still no match for the best humans, given infinite time and ingenuity.”

I frequently am pleasantly surprised at how clever a compiler is being on my behalf. For example, gcc knows how to turn “n / 1000000UL;” into “(n * 1125899907) >> 50″. Maybe if you locked me in a room for an hour and demanded that I optimize that expression I would have come up with that. But the compiler thought of it for me in a fraction of a second.

After all, everything a compiler does is something a human thought of first, so naturally a compiler isn’t going to outdo the smartest human with an expertise for the architecture. But it’s definitely going to outdo a lot of people. It’s also much less error-prone.

Also, picking on GCC-on-ARM is like bullying the pipsqueak on the playground. :)

The reason for the “unnecessary” saving of registers is probably because you are not using -fomit-frame-pointer, which means that all functions must have a stack frame (so they can be debugged and unwound, although DWARF2 debug/unwind information makes that unnecessary)

Nice! You almost got me there. However, the compiler I am using (4.0, build 697) does not like your code. I mean it does compile it of course, but the instruction sequence is significantly worse than that of the pure integer based loop. And yes, I did fix the loop in my example.

“hardwarebug_20100130.c”, line 58: #1636-D: Could not optimize: Complicated use of variable (q)

Stack alignment is only necessary in non-leaf functions as you stated. So every once in a while you will see single registers pushed on the stack. Why the compiler thought this was needed in your example, I don’t know.

I love replying to myself. It is perhaps interesting to know that RVCT’s loop optimizer seems to interfere with the structure access in this example. Switching back to -O2 -Otime (the loop optimizer is only enabled at -O3, AFAIK) gives me the same efficient code that David posted.