I just stumbled upon this blog post. The author shows two code samples that loop through a rectangle and compute something (my guess is the computing code is just a placeholder). On one of the example, he scans the rectangle vertically, and on the other horizontally. He then says the second is fastest, and every programmer should know why. Now I must not be a programmer, because to me it looks exactly the same. Can anyone explain that one to me?

6 Answers
6

Cache coherence. When you scan horizontally, your data will be closer together in memory, so you will have less cache misses and thus performance will be faster. For a small enough rectangle, this won't matter.

It's faster to scan horizontally if you store your 2d array as array[ywidth+x] (as in the original example). If you store your array as array[xheight+y] then scanning vertically will be faster.
–
KaiJun 15 '09 at 17:15

1

To further clarify on this excellent answer: arrays in c/c++ is actually stored as a single block of storage. So a[x][y] is actually the same as accessing a[x*width + y], when you iterate height first, you are skipping over large blocks of data which can cause cache misses if your array is large enough.
–
z -Jun 15 '09 at 17:16

@Kai that's true, if they wanted to store an array like that, then it would be faster to go vertically first.
–
z -Jun 15 '09 at 17:18

2

Well, you're right but that isn't called "cache coherence." That's just optimizing for cache hits. Cache coherence is an entirely different animal.
–
BobbyShaftoeOct 22 '09 at 22:40

Yes, cache is a big part of the reason all those elements have to be stored in memory in some order. If you index through them in the order they are stored, you are likely to have less cache misses. Likely.

The other issue (also mentioned by a lot of answers) is that pretty much every processor has a very fast integer increment instruction. They do not typically have a very fast "increment by some amount multiplied by this second arbirary amount". That's what you are asking for when you index "against the grain".

A third issue is optimization. A lot of effort and research has been put into optimizing loops of this kind, and your compiler will be much more likely to be able to put one of those optimizations into affect if you index through it in some reasonable kind of order.

Usually, as programmers, we can think of our programs' addressable memory as a flat array of bytes, from 0x00000000 to 0xFFFFFFFF. The operating system will reserve some of those addresses (all the ones lower than 0x800000000, say) for its own use, but we can do what we like with the others. All those memory locations live in the computer's RAM, and when we want to read from them or write to them we issue the appropriate instructions.

But this isn't true! There are a bunch of complications tainting that simple model of process memory: virtual memory, swapping, and the cache.

Talking to RAM takes a fairly long time. It's much faster than going to the hard disk, as there aren't any spinning plates or magnets involved, but it's still pretty slow by the standards of a modern CPU. So, when you try to read from a particular location in memory, your CPU doesn't just read that one location into a register and call it good. Instead, it reads that location, /and a bunch of nearby locations/, into a processor cache that lives on the CPU and can be accessed much more quickly than main memory.

Now we have a more complicated, but more correct, view of the computer's behavior. When we try to read a location in memory, first we look in the processor cache to see if the value at that location is already stored there. If it is, we use the value in the cache. If it isn't, we take a longer trip into main memory, retrieve the value as well as several of its neighbors and stick them in the cache, kicking out some of what used to be there to make room.

Now we can see why the second code snippet is faster than the first. In the second example, we first access a[0], b[0], and c[0]. Each of those values is cached, along with their neighbors, say a[1..7], b[1..7], and c[1..7]. Then when we access a[1], b[1], and c[1], they're already in the cache and we can read them quickly. Eventually we get to a[8], and have to go to RAM again, but seven times out of eight we're using nice fast cache memory instead of clunky slow RAM memory.

(So why don't accesses to a, b, and c kick each other out of the cache? It's a bit complicated, but essentially the processor decides where to store a given value in the cache by its address, so three objects that aren't near each other spatially are unlikely to be cached into the same location.)

By contrast, consider the first snippet from lbrandy's post. We first read a[0], b[0], and c[0], caching a[1..7], b[1..7], and c[1..7]. Then we access a[width], b[width], and c[width]. Assuming width is >= 8 (which it probably is, or else we wouldn't care about this sort of low-level optimization), we have to go to RAM again, caching a new set of values. By the time we get to a[1], it will probably have been kicked out of the cache to make room for something else. In the not-at-all-uncommon case of a trio of arrays that are larger than the processor cache, it's likely that /every single read/ will miss the cache, degrading performance enormously.

This has been a very high-level discussion of modern caching behavior. For something more in-depth and technical, this looks like a thorough yet readable treatment of the subject.

Yeah, 'cache coherence'...of course it depends, you could optimize memory allocation for vertical scans. Traditionally video memory is allocated left-to-right, top-to-bottom, going back I'm sure to the days of CRT screens which drew scanlines the same way. In theory you could change this though--all this to say there isn't anything intrinsic about the horizontal method.

The reason is because there is really no such thing as a 2 dimensional array when you get down to the hardware level of how memory is laid out. So scanning 'vertically' to get to the next cell you need to visit you're doing an operation along these lines

For a 2D array indexed as (row, column) this needs to be translated into a single dimension array of array[index] because memory in a computer is linear.

So if you're scanning vertically, the next index is calculated as:

index = row * numColumns + col;

however, if you're scanning horizontally then the next index is just as follows:

index = index++;

A single addition is going to be fewer op codes for the CPU then a multiplication AND addition, and thus horizontal scanning is faster because of the architecture of computer memory.

Cache is not the answer because if this is the first time you're loading this data, every data access will be a cache miss. For the very first execution, horizontal is faster because there are fewer operations. Subsequent loops through the triangle will be made faster by cache, and vertical could be slower because of cache misses if the triangle is sufficiently large, but will always be slower than horizontal scanning because of the increased number of operations needed to access the next element.

Get a picture for your profile. There are too many Jim's on this site , it's hard to tell you apart.
–
JaredParJun 15 '09 at 19:39

2

This isn't true. Both versions of the algorithm provided require the same number of increments, multiplies, reads, and stores. There's no difference whatsoever in how the indexes are calculated.
–
David SeilerJun 15 '09 at 20:09

You're saying that whether you access an array defined in C that looks like a[10][10] in this order: a[0][0], a[1][0], a[2][0] is no different than: a[0][1], a[0][2], a[0][3] because the exact same translation is happening a[row * numCols + col] even though we're telling it to access horizontally. The inefficiency here is letting the compiler decide how you want to access the data. If the compiler is not smart enough to notice you're accessing the data completely sequentially and thus optimizes for the addition case, then you're better off writing this in ASM in which case my answer holds
–
Jim WallaceJun 15 '09 at 20:26

1

Well, sorta. The arrays in the example aren't multidimensional, but that doesn't matter; a reasonably capable C compiler can convert the inner loop to pointer arithmetic. But it can do that regardless of access order, so the cost in cycles between the two cases is the difference between adding y to a pointer, and adding 4 (since ints are 4-aligned). This is not a large number of cycles, and may even be zero. One cache miss costs as much as hundreds of register to register adds, so cache coherence is the right answer for any working set large enough to be worth this degree of focus.
–
David SeilerJun 15 '09 at 21:59