I am recruiting!

Archives

Boring stuff

How fast can you count lines?

Inspired by earlier work by Llogiq, I decided to look at how fast I could count the number of lines in a string. By assuming that the string relies on ASCII, UTF-8 or other ASCII superset character encoding, the bulk of the work consists in counting the linefeed (‘\n’) characters.

Most programmers would proceed as follows…

cnt =0;for(i =0; i < size; i++)if(buffer[i]=='\n') cnt ++;return cnt;

It runs at about 0.66 cycles per input character on a recent x64 processor.

Your compiler achieves this speed by automatically vectorizing the problem.

Can we do better?

Let us try a vectorized approach using AVX intrinsics. The idea is simple. We use a vectorized counter made of 32 individual counters, initialized at zero. Within a loop, we load 32 bytes, we compare them (using a single instruction) with the linefeed character. We add the result to the running counter.

So per blocks of 32 input bytes, we use 3 instructions (one load, one comparison and one addition), plus some overhead. Could we do better? Maybe but I suspect it would be very difficult to use far fewer than 3 instructions per 32 input bytes on current processors.

With appropriate loop unrolling, this vectorized approach runs about ten times faster than the naive approach (0.04 cycles per input byte) or about 1.3 cycles per 32 input bytes. Because x64 processors have a hard time sustaining more than 4 instructions per cycle, we can estimate that our vectorized implementation is probably within a factor of two of optimality.

How does it compare with Llogiq’s best approach? I’d be interested in knowing, but his code is written in Rust, so this makes a direct comparison with my C code somewhat more difficult.

It seems that icc and clang also vectorize the problem. It is not surprising considering how common such code is likely to be. What is slightly disappointing is how they all seem to get it wrong, performance wise.

According to https://github.com/vks/bytecounttest, llogiq’s solution runs in about 29ms against your 40ms in the `avxu` version (if I read that benchmark correctly). I’m a little worried about differences around inlining (since the bench code is written in Rust, linking with your C code), but I’d say that the perf is very much in the same ballpark.

In case it helps, I personally have met vks at multiple meetups and trust his results are accurate

It is not a matter of putting into question the integrity of the individual. Rather it is a matter of having numbers that say what you think they say.

I report processing input bytes at a rate of 0.04 cycles per byte.

What should I infer from the 29 ns vs. 40 ns number? That he tested a faster approach that can process input bytes at a rate of 0.03 cycles per byte?

That’s not an impossible number… but is that what is being claimed?

You see my problem?

When I write that I do not trust these numbers, I don’t mean that the author made up numbers or cheated in some way… I mean that I am not confident people know what they mean exactly.

Something else bothers me. He reports two orders of magnitude gains over the basic code. I report a single order of magnitude difference. What explains this difference? Possibly the inefficiency of the Rust compiler, but once we start factoring in the possibility that the Rust compiler is deficient, all sorts of considerations must enter into play.

To put it bluntly, we need a lot more analysis to understand what is going on.

At this point, we do not have an analysis (e.g., how many CPU cycles are used per input byte) and it is not straight-forward to reproduce the benchmark.

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

This will be displayed in a monospaced font. The first four
spaces will be stripped off, but all other whitespace
will be preserved.
Markdown is turned off in code blocks:
[This is not a link](http://example.com)