Saturday, September 27, 2014

Counting bytes fast - little trick from FSE

An apparently trivial and uninteresting task nonetheless received some special optimization care within FSE : counting the bytes (or 2-bytes shorts when using the U16 variant).

It seems a trivial task, and could indeed be realized by a single-line function, such as this one (assuming table is properly allocated and reset to zero) :

while (ptr<end) count[*ptr++]++;

And it works. So what's the performance of such a small loop ?
Well, when counting some random data, the loop performs at 1560 MB/s on the test system. Not bad.(Edit : Performance numbers are measured on a Core i5-3340M @2.7GHz configuration. Benchmark program is also freely available within the FSE project)
But wait, data is typically not random, otherwise it wouldn't be compressible. Let's use a more typical compression scenario for FSE, with a distribution ratio of 20%. With this distribution, the counting algorithm works at 1470 MB/s. Not bad, but why does it run slower ? We are starting to notice a trend here.
So let's go to the other end of the spectrum, featuring highly compressible data with a distribution ratio of 90%. How fast does the counting algorithm run on such data ? As could be guessed, speed plummets, reaching a miserable 580 MB/s.

This is a 3x performance hit, and more importantly, it makes counting now a sizable portion of the overall time to compress a block of data (let's remind FSE targets speeds of 400 MB/s overall, so if just counting costs that much, it drags the entire compression process down).

What does happen ? This is where it becomes interesting. This is an example of CPU write commit delay.

Because the algorithm writes into a table, this data can't be cached within registers. Writing to a table cell means the result must necessarily be written to memory.
Of course, this data is probably cached into L1, and a clever CPU will not suffer any delay for this first write.
The situation becomes tricky for the following byte. In the 90% distribution example, it means we have a high probability to count the same byte twice. So, when the CPU wants to add +1 to the appropriate table cell, write commit delay gets into the way. +1 means CPU has to perform both a read and then a write at this memory address. If the previous +1 is still not entirely committed, cache will make the CPU wait a bit more before delivering the data. And the impact is sensible, as measured by the benchmark.

So, how to escape this side-effect ?
A first idea is, don't read&write to the same memory address twice in a row. A proposed solution can be observed in the FSE_count() function. The core loop is (once cleaned) as follows :

The burden of counting bytes is now distributed over 4 tables. This way, when counting 2 identical consecutive bytes, they get added into 2 different memory cells, escaping write commit delay. Of course, if we have to deal with 5 or more identical consecutive bytes, write commit delay will still be there, but at least, the latency has been used counting 3 other bytes, instead of wasted.

The function is overall more complex : more tables, hence more memory to init, special casing non-multiple-of-4 input sizes, regroup all results at the end, so intuitively there is a bit more work involved in this strategy. How does it compare with the naive implementation ?

When compressing random data, FSE_count() gets 1780 MB/s, which is already slightly better than the naive strategy. But obviously, that's not the target. This is when distribution gets squeezed that it makes the most difference, with 90% distribution being counted at 1700 MB/s. Indeed, it's still being hit, but much less, and prove overall much more stable.

With an average speed > 1700MB/s, it may seem that counting is a fast enough operation. But it is still nonetheless the second contributor to overall compression time, gobbling on its own approximately 15% of budget. That's perceptible, and still a tad too much if you ask me for such a simple task. But save another great find, it's the fastest solution I could come up with.

Edit 2 :
Thanks to recent comments, notably from gpd, Nathan, and Henry Wong, a new and better reason has been provided to explain the observed delay. Its name is store-to-load forwarding. I would like to suggest here the read of the detailed explanation from Nathan Kurz, backed by his cycle-exact Likwid analysis, and the excellent article from Henry on CPU microarchitecture.
In a nutshell, while write commit delay used to be a problem, it should now be properly handled by store-cache on modern CPU. However, it introduces some new issues, related to pipeline, serial dependency and prefetching, with remarkably similar consequences, save the number of lost cycles at stake, which is quite reduced.

Edit 3 :Nathan Kurz provided an entry which beats the best speed so far, achieving 2010 MB/s on a Core i5-3340M @ 2.7 GHz. Its entry is provided within the fullbench program (as algorithm 202), alongside a simplified version which achieves the same speed but is shorter (algorithm 201).
It's more than 10% better than the initial entry suggested in this blog, and so is definitely measurable.
Unfortunately, these functions use SSE 4.1 intrinsic functions, and therefore offer limited portability perspectives.

Counting 16 bits values would make read speed twice faster. That sounds like a good plan.

But unfortunately, the underlined limiting factor seems to be the counter write speed, and counting 16 bits values doesn't help much on this front.

There are additional problems which can be expected with such a scheme : If recording the 16 bits values directly, with the objective to find back individual byte values later, it makes a table with 65536 cells, which is likely larger than L1 cache, and will therefore likely be a problem for performances.

It also doesn't help when there are 4 or more identical bytes, since it would use the same cell number for 2 consecutive recording. Note that it can be mitigated by using 2 or more tables, but with considerable side effects on CPU cache load and init time.

Finally, calculating at the end the number of occurrence of each byte value will be quite CPU intensive. It can be guessed there will be a lot of final addition loops to finish the job.

So, all in all, my guess is that it's not going to improve the counting algorithm. But of course, I would love to be proved wrong.

I expect that, as soon as you have multi-core CPU, you can't completely evade the problem, since a memory address must be "protected" against parallel access while it's not yet fully committed.

The situation would probably be different if there was a (hardware) guarantee that such data can only be accessed by the currently running core. Alas, I guess only single-core systems could make such assumption...

Yann, I do not think that multi-core has anything to do to it. The store buffer on x86 is explicitly outside of the coherence domain and no protection aganist parallel read/writes is present: an application need to issue an explict barrier to flush it to publish local stores and see remote stores.

The store buffer should effectively hide the store-to-load latency (making it very close to an L1 hit), so I do not think that the commit delay is the issue here.

My guess is what you are seeing is the effect of speculation. Your i3 is an agressively OoO machine and will attempt to reorder newer loads on the count table before older stores to the same table. Obviously this can only be done if the store and the load do not depend on each other (i.e. they do not alias). Because the load and store addresses are not known untill late in the execution, the cpu has to guess (i.e. rely on a predictor) and speculate around the RAW hazard.

In theory in both the 90% don't alias and the 90% alias the predictor should be doing its job, but because of the catastorphic performance decrease either the predictor is predicting never-alias all the time or the cost of a speculation failure in your use case is extremely high.

Finally note that, according to http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ on intel cpus since Sandy Bridge, store-load forwarding that trigger speculation has a (3x) performance penality even in the never-alias case: if that's true, even in the best case you are leaving performance under the table.

Thanks for the advise. And thanks for the link too, which was very interesting to read. Indeed, your explanation makes perfect sense, and get reinforced by similar comments from Nathan Kurz. I will update the blog post to reflect this excellent explanation.

Divide the input data in L1 cache sized subsets. Then do 256/K passes over the data. Each pass counts the appearance of a K subset of all possible values incrmenting a local R_kcounter; you want to keep K small so you can fit all the counters in registers. At the end of the pass you increment your table with the content of each counter.

You can load and test 16 bytes at a time by using SSE compare aganist the Kth value (which is known statically), copy the resulting mask to a general purpose register and then do a popcnt.

Another possible optimization is to set the subset size to half of the L1 cache and prefetch the other half while performing computation on the first half.

Keep in mind that current solution already works at 1.7 GB/s on a 2.7 GHz CPU. That's approximately 1.6 cycles / bytes. So any multiple passes solution will have to fight against that. This seems only viable if each pass can handle more than one byte at a time.

Vectorization, with large registers, looks like the way to go. But unfortunately, it also hurts portability which is one of FSE primary goals. It's definitely an interesting research area though.

I tried that, and it was slow. The best result I could get using SSE was 244Mbit/sec, with a block size of 16384 (L1 cache size is 32K bytes) and K=32 (I used eight 64-bit variables, each containing four sums, which imposed 64K limit on the input size). The standard solution achieved 1.4Gbit/sec on my machine.

Yann is right: 1.6 cycles/sec is already quite small; it won't be easy for a multi-pass solution to fight against it.

By the way, I didn't observe original problem (slow execution on unbalanced data). I was running on Xeon E5-2670 chip.

Nice article, and interesting to ponder. I think your conclusions are right, and your solution good, but the that the problem isn't actually due to any write commit delay. These used to be an issue, but with current store-to-load forwarding, they aren't a factor any more.

Instead, as 'gpd' suggests, the issue is with speculative loads. The processor tries to get ahead by reading ahead in the input stream and pre-loading the appropriate counter. It even speculatively executes the addition and issues the write. But before the write is retired, the counter preload is invalidated by a previous store. Usually no reload of the counter is necessary, since the updated value is forwarded directly to the appropriate register from the load, but the work done for the increment and store is tossed out (not retired) and the micro-ops are re-executed with the new value.

I think I can show this fairly directly using the CPU's performance counters. But when I tried to post the details here, your blog engine didn't want to accept them. And then it got into a loop where it kept remembering the original version, and would throw away my edits. So I sent long message to you via your contact form, and am trying to post this shorter version from an anonymous browser window in the hope that I could at least get it to accept something.

Ok, I have plaeyd with this, and this is the best I could come up with using SSE4.1:

http://pastebin.com/CFQmxBPg

Compile it with "gcc -O2 -msse4.1 test.c". For random data the naive solution runs for 615 ms vs the vectorized for 347 ms, for zero filled data it is 1795 vs 352 ms.

I have also tried using 256 bit AVX registers where each bit would be used to count the number of occurrences of a specific byte, and use several AVX registers each of which would store the 0th, 1st, 2nd, etc bit of the count and use AND and XOR to update them all simultaneously. It was not faster.

This is an excellent work Miklos.I tested your code within fullbench.cIt reaches 2.7 GB/s, which is the best speed so far. It is also extremely stable, whatever the byte value distribution.

Now, it nonetheless raise 2 questions :- Final count : why are you doing i+=8 per loop ? It looks to me that i++ would be more appropriate- Inner loop : it seems like you are extracting 8 bytes out of 16 possible within each SSE register, potentially missing half of them. This seem confirmed by extraction of largest value stat (~6500 at default setting, while it should be ~13K)

(Do you know if your clock speed of 2.7 GHz includes the effect of turbo boost? From your 1780MB/s number, I'd guess it was actually running at ~3.2 GHz).

I ran your code on an Ivy Bridge (i5-3570K, 4200 MHz), and got ~1.8 clocks per byte (2366 MB/s, -P0, 64-bit, gcc 4.4.3). Ignoring the ALU ops (2 loads + 1 store per byte processed), I would expect at best 1.5 clocks per byte, so you're already quite close (Ivy Bridge can do two memory ops up to one store per cycle).

I think some of the confusion with the comments about load-store forwarding is due to having called it a write "commit" delay. Load-store forwarding allows this delay to be ~5 clocks instead of waiting for commit (often multiple tens of cycles). But repeatedly incrementing the same location in memory still requires serializing all of the read-modify-write operations even with store-load forwarding, while accessing independent locations allows the increments to be pipelined.

I mostly disagree with the comments about store dependence speculation. While memory dependence speculation may enable the issue to be visible (by allowing independent accesses to proceed. Note your trivial code gets a load address from memory too.), and misspeculation may play a role, the effect you're seeing will exist even with perfect speculation. Increments to independent memory locations can happen at rate of one every 1.5 clocks, while dependent increments can't happen any faster than once every 6 cycles on Ivy Bridge. (This also explains why keeping groups of 4 independent increments should be enough to get back almost all of the lost performance: 6/1.5 = 4)

My guess is that memory dependence misspeculation is not playing a big role (at least for -P90), because the (second) load is "dependent" so often it will predict "dependent" all the time, leading to no flushes. Note also that the (first) load for address generation is never dependent, and that can also be correctly predicted.

I attempted software pipelining the address generation, and reducing the number of loads (in exchange for more ALU ops by loading two or 4 bytes then computing the addresses for the increments using shifts), and managed to squeeze a ~5% improvement on Ivy Bridge, but it got an even bigger decrease (~8%?) on AMD Bulldozer. :P

I've read Nathan's performance counter results, and I don't quite trust them. I'll comment on that in a separate comment.

For memory dependence misspeculations, there are performance counters specifically measuring those: MACHINE_CLEARS.MEMORY_ORDERING gives a count, and INT_MISC.RECOVERY_CYCLES gives number of cycles caused by machine clears in general. This counter says there are very few memory dependence misspeculations (Using 10k iterations as Nathan did):

For P90, I get ~2k out of 6.6B instructions retired,For P20, I get 800k out of 6.6B instructions retired. (I used VTune. It should be the same performance counters that likwid uses)

I did reproduce Nathan's observation that store data counts (port 4) got much higher with P90 on an Ivy Bridge. However, I don't really trust that particular counter (even though I do not know why it seems to be off).

- The number of store-data (Port 4) ops with P90 is too high. Every store-data needs a corresponding load, unless there is value prediction or some form of speculative memory bypassing (rather than store-load forwarding), neither of which are implemented, to my knowledge. You can't know what data to store until you've loaded and incremented the number. Yet for P90, there are more store-data ops executed than loads executed.

- For P0, I've also noticed the number of store-data ops to be slightly *lower* than the number of stores committed (by ~8%). You can squash executed ops, but you can't have a committed store without having executed it.

Ivy bridge doesn't have a separate store AGU, but Haswell does. Maybe the store AGU (Port 7) might show something different than the store-data port (Port 4)? In any case, I think I trust the "memory ordering machine clears" counter more than the increase in executed ops.

Nice insights. While I'm certainly often wrong, I think my numbers are correct this time. The big discrepancy you note between the number of loads and stores is true, but a result of my poor explanation rather than a problem with the counters. The counter I used for PORTS_23 is the number of cycles that a load happens on _either_ Port 2 or Port 3. I did it this way because there were only 4 PMC counters available, and I wanted to show everything in a single run. If you use separate counters, you get very close to double the number of loads as you would expect. I'm pretty sure that the Port 4 numbers are correct, as it eventually agrees with MEM_UOP_RETIRED_STORES as I increase the number of tables.

I do not think that MACHINE_CLEARS.MEMORY_ORDERING or INT_MISC.RECOVERY_CYCLES are the correct counters to pay attention to here. These are heavier weight exceptions, versus the light weight recovery of simply reissuing a store. You might look at Fabian's explanation of the first here: http://fgiesen.wordpress.com/2013/01/31/cores-dont-like-to-share/

You are right to suspect that there are some strange Port 7 issues happening. Particularly, while Haswell can calculate store addresses there, it can be difficult to make it happen. It turns out that Port 7 only supports fixed offset addressing (offset + base), so if the address is using (base + index * scale) addressing it needs to execute on Port 2 or 3. I was able to determine conclusively that this is a real bottleneck on Haswell: when I changed the code so IACA said it would execute on Port 7, I got a significant boost in speed. The bottleneck then became the 4 uop per cycle limit.

I ended up with two improved versions of Yann's code, one based on spreading things out to Port 7 on Haswell, and one based on reducing the number of loads by using _mm_loadu_si256() and then splitting into bytes. The first approach got to me 2200 MB/s on 3.4 GHz Haswell, and 2050 on 3.6 GHz Sandy Bridge. The second XMM approach got me up to 2450 MB/s on 3.4 GHz Haswell, and 2150 on 3.6 GHz Sandy Bridge.

I sent my code to Yann, and he'll probably post something after he looks at my code and figures out if I've made some silly mistake (certainly possible). There are still a number of details I don't understand, but the PMC numbers for these seem to make sense as well. I'm hoping that when Yann publishes (presuming no terrible logic problems with my code) someone with a deeper understanding (or knowledge of some assembly tricks) might be able to push things a bit faster.

Interesting observation about Port 7's addressing mode constraints. It makes sense since simple addressing modes are more common, though I wonder what the cost of having more AGUs feed the store queue is.

No, I don't think it's just your explanation. I ran the same test cases under VTune on an Ivy Bridge and got similarly nonsensical numbers for port 4 (store data), and I was using separate counters per port. At P0, I saw Port 4 numbers lower than stores committed (by 8%), and at P90, Port 4 was greater than the sum of (AGU) ports 2 and 3 (by 30%).

When there is a memory ordering violation, it is not a simple matter of reissuing a store. You also have to replay all dependent loads, and any instructions dependent on any of those loads, and so on. As far as I know, there is no practical hardware mechanism that can track all of these dependencies, so there is no "light weight" selective re-execution mechanism. A memory ordering violation should result in a "machine clear".

To summarize, I believe memory ordering misspeculation doesn't play a role here for three reasons (mostly restating my earlier post):- A machine clear is the only mechanism for memory ordering misspeculation recovery, and vtune says these are near zero.- The memory dependences are easily predictable: First load is never dependent, second is almost always dependent.- Even when there are no misspeculations, performance is still expected to slow by 4x.

i.e., I think misspeculations do not happen, they shouldn't be happening, and they're not necessary to explain the performance difference.

Admittedly, I'm guessing at a lot of undocumented Intel internal behaviour, paralleling the argument that John McCalpin makes here: https://software.intel.com/en-us/forums/topic/506515#comment-1780994

Your interpretation (I think) says that the counter values are simply wrong and should be ignored. From my point of view, I have yet to view any non-sensical measurements: so far, the counter numbers have been explained by my working theory. Perhaps this is because of a fix between Ivy Bridge and Haswell?

Still, as I change the ordering of instructions to produce higher throughput, the numbers all converge as would be expected: MEM_UOP_RETIRED_STORES == UOPS_EXECUTED_PORT_PORT_4 == expected number of stores (1 per byte processed); MEM_UOP_RETIRED_LOADS == expected number of loads (2x stores); and UOPS_EXECUTED_PORT_PORT_2 + UOPS_EXECUTED_PORT_PORT_3 + UOPS_EXECUTED_PORT_PORT_7 == stores + loads.

While searching, I came across this useful older presentation on memory disambiguation mechanisms in the Intel Core: http://www.hotchips.org/wp-content/uploads/hc_archives/hc18/3_Tues/HC18.S9/HC18.S9T4.pdfIt doesn't touch much on the recovery mechanism (and supports you the extent that it does), but it has some nice graphics illustrating the concepts.

Do you know by chance if hist_4_32 gets automatically vectorized by GCC ? I find it suspicious that the sudden speed boost happens exactly when there are precisely 4 32-bits registers loads within the main loop.

No, when compiled with gcc 4.8.1 -O3, the main body of the hist_4_32 loop does not involve any vectorized operations. The summation at the end does, but does not take any significant portion of the time. It's fast because it's doing a good job of using the high and low bytes of the legacy registers to avoid some of the moving, shifting and masking: "mov %ch -> %r9" instead of "mov %rcx -> %r9; shr $8, %r9; and $ff, %r9".

I was looking for this because I'd earlier made my own version on this basis. I was able to get further speed up by interleaving two 64-bit variables together to avoid touching the high and low bytes of the same register during the same cycle. I've also come up with some significantly faster vector approaches.

Here's where I'm currently at for Haswell at 3.9GHz (instead of 3.4 as in my previous comments), along results for my ported versions of powturbo's code.

Number 2 "count2x64" is the non-SSE interleaved high/low byte approach, which at 3050 MB/s at 3.9GHz is running at 1.27 cycles/byte.

The fastest one, Number 20 "port7vec" at the bottom, is at 1.21 cycles/byte for 3200 MB/s. It's a ridiculous Haswell only AVX2 approach that converts the 16 input bytes at a time to a 16 x 16-bit vector, does a 2-bit shift left on the entire vector, then writes the vector to a fixed buffer. It's then reread as 16 separate scalar words, but since the shift has already been done the store can execute on Port 7.

For fairness, I should note that although nominally written in C, all of my fastest approaches involve considerable amounts of fragile inline assembly. They are proofs-of-concept to as I try to explore the limits of the hardware, and not necessarily advances that can be easily incorporated into multi-compiler cross-platform software.

I finally took the time to integrate and benchmark your new histogram functions. They are indeed quite fast, and as suggested, count2x64 becomes the new king of the hill, at 2050 MB/s !

The gain seems probably less impressive than the one you experienced on your own platform, but then the CPU models are different (Ivy Bridge vs Haswell), so it probably explains the benefit difference.

It looks like you're ultimately looking for the maximum value, right? In that case, do you need the other values?

Interestingly, I tried reading a machine word (32-bit or 64-bit integer) at a time, and 'slicing' it into bytes, and it had a slightly negative affect on my Intel I7-920 -- but a 20% performance boost on an AMD Phenom II.

Congratulation and thanks to Nathan!TurboHist improved and tested on Sandy Bridge. "hist_8_32" w/o SIMD and w/o inline assembly is now nearly as fast as the fastest "count2x64". see: https://github.com/powturbo/TurboHist

Thanks for testing. It would be nice, if you can also report the timing for "hist_8_32". It is now faster than "hist_4_32" on my system.More speed increase is theoretically possible using scatter/gather instructions, if available.

Unfortunately, it has not worked as well on my test system. hist_8_32 gets approximately 1950 MB/s, which is good, but still (slightly) less impressive than your updated version of hist_4_32. One can note however that hist_8_32 speed is very stable, and does not slow down when distribution becomes highly squeezed (P=95%), while hist_4_32 does slow down a little bit. But it's not significant enough to change the ranking.

FWIW, I remember encountering precisely this sort of problem several (about 17) years ago when I wrote the histogram kernel that's eventually evolved into the one found in C6x DSP's IMGLIB. I didn't work alone on this kernel—a few of us looked at how to solve the recurrence, and eventually we converged on this clever approach.

On that device, loads have 4 exposed delay slots (total latency 5), and there is no hardware forwarding between loads and stores. But, a load immediately following a store sees the preceding store's results. To write a correct count[idx]++ sequence in assembly, you end up with 7 cycle recurrence: Load, 4 dead cycles, Add, Store.

In the DSP's histogram kernel, we were able to maximize the memory bandwidth utilization (just shy of 1 bin update per cycle--8 updates in 9 cycles, IIRC) with just 4 separate histograms, and one level of "increment forwarding." That is, if there are two consecutive updates to the same bin in the same histogram, add an extra '1' to the second update, so you can the next update's load above the previous update's store. (That may be possible on an x86; I've never tried.) The C6x makes it easy to forward an extra increment, as its compare instructions deposit 0 or 1 in a general-purpose register.

The resulting sequence to update a single histogram looks something like this in steady state, in pseudo-code:

Notice the stores for idx0 happen _after_ the loads for idx1, and vice versa. There's obviously some code before/after the loop to get the initial values of everything into a proper state, and to fix things up at the end.

The architecture has a second curveball it throws at you. The C6x CPUs CPU can process two memory accesses per cycle (which is why it can average one bin update per cycle), but most family members use single-ported memory. To allow two simultaneous accesses in the same cycle, it divides the memory into multiple banks, selected by the LS-bits of the address. If two memory accesses go to the same bank, you get a bank hit. (1 cycle stall.) On the C64x and later, it divides memory into 8 x 32-bit banks.

So, the four histograms are also interleaved. That is, histogram 0 goes to banks 0 and 4 (assuming a C64x or later device), histogram 1 goes to banks 1 and 5, histogram 2 goes to banks 2 and 6, and histogram 3 goes to banks 3 and 7. This ensures 100% bank-conflict free computation.

Scheduling that assembly was a chore, but we made it work.

Getting back to x86, which most people are concerned with...

As I recall, the original Pentium had an LS-banked L1D also (8 x 16-bit banks if memory serves), to allow the U and V pipe to access memory in parallel. I don't know if bank conflicts are a big factor in the modern L1Ds, but if you have a strongly correlated input set and the L1D is single-port LS-banked, then maybe it does matter.