Thursday, July 20, 2017

A load/store performance corner case

I have recently seen a number of “is X faster than Y?” discussions where micro benchmarks are used to determine the truth. But performance measuring is hard and may depend on seemingly irrelevant details...

The performance “should not” depend on the distribution of the values in the buffer p, but running this on a buffer with all bytes set to 0 and one buffer with random values gives me the result (using the Google benchmark library and this code)

That is, running on random data is 3.5x faster compared to running on all-zero data! The reason for this is that loads and stores are slow, and the CPU tries to improve performance by executing later instructions out of order. But it cannot proceed with a load before the previous store to that address has been done,1 which slows down progress when all loop iterations read and write the same memory address histogram[0] .

This is usually not much of a problem for normal programs as they have more instructions that can be executed out of order, but it is easy to trigger this kind of CPU corner cases when trying to measure the performance of small code fragments, which results in the benchmark measuring something else than intended. Do not trust benchmark results unless you can explain the performance and know how it applies to your use case...

1. The CPU does “store to load forwarding” that saves cycles by enabling the load to obtain the data directly from the store operation instead of through memory, but it still comes with a cost of a few cycles.