When I generate my random array and run my search for 100 randomly generated values of x, the searches complete in about four seconds. Knowing of the great wonders that sorting does to searching, however, I decided to sort my data - first by Item1, then by Item2, and finally by Item3 - before running my 100 searches. I expected the sorted version to perform a little faster because of branch prediction: my thinking has been that once we get to the point where Item1 == x, all further checks of t.Item1 <= x would predict the branch correctly as "no take", speeding up the tail portion of the search. Much to my surprise, the searches took twice as long on a sorted array!

I tried switching around the order in which I ran my experiments, and used different seed for the random number generator, but the effect has been the same: searches in an unsorted array ran nearly twice as fast as the searches in the same array, but sorted!

Does anyone have a good explanation of this strange effect? The source code of my tests follows; I am using .NET 4.0.

Populated in 00:00:01.3176257
Found 15614281 matches in 00:00:04.2463478 (Unsorted)
Populated in 00:00:01.3345087
Found 15614281 matches in 00:00:08.5393730 (Sorted)
Populated in 00:00:01.3665681
Found 15614281 matches in 00:00:04.1796578 (Unsorted)
Populated in 00:00:01.3326378
Found 15614281 matches in 00:00:08.6027886 (Sorted)

@jalf I expected the sorted version to perform a little faster because of branch prediction. My thinking was that once we get to the point where Item1 == x, all further checks of t.Item1 <= x would predict the branch correctly as "no take", speeding up the tail portion of the search. Obviously, that line of thinking has been proven wrong by the harsh reality :)
–
dasblinkenlightDec 24 '12 at 17:20

33

This question is NOT a duplicate of an existing question here. Do not vote to close it as one.
–
ThiefMaster♦Dec 25 '12 at 20:56

1

@Sar009 Not at all! The two questions consider two very different scenarios, quite naturally arriving to different results.
–
dasblinkenlightDec 27 '12 at 10:58

1

Not related to your question, but you create a class TupleComparer but that is entirely unnecessary since Comparer<Tuple<long, long, string>>.Default already has this behavior (from the IComparable implementation of Tuple<,,>). So you can just use data.Sort() with no arguments.
–
Jeppe Stig NielsenAug 9 '13 at 20:49

2 Answers
2

When you are using the unsorted list all tuples are accessed in memory-order. They have been allocated consecutively in RAM. CPUs love accessing memory sequentially because they can speculatively request the next cache line so it will always be present when needed.

When you are sorting the list you put it into random order because your sort keys are randomly generated. This means that the memory accesses to tuple members are unpredictable. The CPU cannot prefetch memory and almost every access to a tuple is a cache miss.

This is a nice example for a specific advantage of GC memory management: data structures which have been allocated together and are used together perform very nicely. They have great locality of reference.

The penalty from cache misses outweighs the saved branch prediction penalty in this case.

Try switching to a struct-tuple. This will restore performance because no pointer-dereference needs to occur at runtime to access tuple members.

Chris Sinclair notes in the comments that "for TotalCount around 10,000 or less, the sorted version does perform faster". This is because a small list fits entirely into the CPU cache. The memory accesses might be unpredictable but the target is always in cache. I believe there is still a small penalty because even a load from cache takes some cycles. But that seems not to be a problem because the CPU can juggle multiple outstanding loads, thereby increasing throughput. Whenever the CPU hits a wait for memory it will still speed ahead in the instruction stream to queue as many memory operations as it can. This technique is used to hide latency.

This kind of behavior shows how hard it is to predict performance on modern CPUs. The fact that we are only 2x slower when going from sequential to random memory access tell me how much is going on under the covers to hide memory latency. A memory access can stall the CPU for 50-200 cycles. Given that number one could expect the program to become >10x slower when introducing random memory accesses.

Good reason why everything you learn in C/C++ doesn't apply verbatim to a language like C#!
–
MehrdadDec 24 '12 at 17:48

27

You can confirm this behavior by manually copying the sorted data into a new List<Tuple<long,long,string>>(500000) one-by-one before testing that new list. In this scenario, the sorted test is just as fast as the unsorted one, which matches with the reasoning on this answer.
–
BobsonDec 24 '12 at 17:52

3

Excellent, thank you very much! I made an equivalent Tuple struct, and the program started behaving the way I predicted: the sorted version was a little faster. Moreover, the unsorted version became twice as fast! So the numbers with struct are 2s unsorted vs. 1.9s sorted.
–
dasblinkenlightDec 24 '12 at 21:31

1

So can we deduce from this that cache-miss hurts more than branch-mispredication? I think so, and always thought so. In C++, std::vector almost always performs better than std::list.
–
NawazDec 25 '12 at 5:57

1

@Mehrdad: No. This is true for C++ also. Even in C++, compact data structures are fast. Avoiding cache-miss is as important in C++ as in any other language. std::vector vs std::list is a good example.
–
NawazDec 25 '12 at 6:00

Since Count with predicate parameter is extension method for all IEnumerables, I think it doesn't even know if it's running over the collection with efficient random access. So, it simply checks every element and Usr explained why performance got lower.

To exploit performance benefits of sorted array (such as binary search), you'll have to do a little bit more coding.

I think you misunderstood the question: of course I wasn't hoping that Count or Where would "somehow" pick up on the idea that my data is sorted, and run a binary search instead of a plain "check everything" search. All I was hoping for was some improvement due to the better branch prediction (see the link inside my question), but as it turns out, locality of reference trumps branch prediction big time.
–
dasblinkenlightDec 25 '12 at 16:12