Sunday, December 9, 2007

Memory is not free (more on Vista performance)

A while ago I was running the Windows CE part of the Microsoft MN-700 project, which was a joint project between Windows CE and Microsoft Hardware (and it also happened to be my thesis for an enterpreneurship class at TM MBA). I always had a soft spot in my heart for cheap pervasive smart devices...

Anyway, this particular board had a version of Linux that was ported to it my Broadcom, so a direct comparison between Linux and Windows CE routing performance was possible... and it was not to Windows CE's advantage.

Far from it. Linux was routing packets, WAN-LAN, at ~30Mbps, and wireless at ~20. Windows CE was crawling at barely 12Mbps wired and 6Mbps wireless. Now, this was CE's debut in the router space, so, obviously, no perf tuning was done previously - this was our first attempt.

Be it as it may, we were past the date where the product could have been moved by ship to guarantee the shelf space for Christmas, and dangerously close the threshold for airlift (turns out you need to have it in stores around mid to end of August, or you kiss your Christmas sales goodbye, because the shelf space will be given to other products).

After much profiling during many sleepless nights, we found that no particular part of our system was slow, but it was slow as a whole. After eliminating the last data copy (so the packet essentially went from DMA of one network controller directly to the DMA buffer of the other), we were still slower than Linux, which was completely beyond us - you cannot copy data less than zero!

Then we started looking at CPU performance registers (as it turned out we could read the Linux CPU state with a hardware probe). What we found out was Windows CE had a LOT more instruction cache misses than Linux. The CPU (a MIPS variant) had 8K I-cache, and the entire Linux routing code path fit into the cache. Windows CE has 12K instructions, so 50% of the cache was constantly evicted.

After we changed the routing algorithm to be more cache-local (by routing batches of packets, rather than individual packets), we started doing 35MBps WAN-LAN, and 25MBps (which is a theoretical maximum for "g") wireless - 20% better than Linux.

So what does all this ancient mean for modern computing? We have come to believe than the memory is basically free - an inexhaustible resource. It is so plentiful, that running out of memory has been deemed grounds for a program abort, instead of graceful handling - most programs die outright if "new" or "malloc" fails. So plentiful that it is pretty much impossible to say how much memory does a program use in Windows.

However, even though 4GB of RAM is now firmly under $100, and it is hard or impossible to buy a computer with less than 1GB, DRAM is very, very, glacially slow, and caches are still very limited.

That's right - the moment your memory footprint spills out of 32KB (L1 cache), the data accesses becomes 3 times slower. The moment you run out of 4MB (L2 cache), it's 15 times (compared to L2), or 60 (compare to L1).

If you're writing commercial software today, it is probably not complicated math computations - vast majority of it is exactly this - the data accesses. Sorting stuff. Putting it in hashes. Computing hash codes. Retrieving. Enumerating, looking for matches. All of it either goes into memory, or comes from it.

And as you can see, it only takes 1 miss out of 15 accesses to double the amount of time it takes your program to run. And this is only a 6% data cache miss rate... Fifty percent of cache misses, and your program is slower by a decimal order of magnitude.

I am willing to bet the bank that Vista's overall sluggishness is directly traceable to memory abuse - code paths that spill out of cache all too often, poor cache locality on memory accesses, etc. You can actually see it - I have seen laptops with relatively slow CPUs, and monster gaming desktops with extremely fast CPUs to have very similar "feel" with Vista. The one factor that is common to this otherwise very different hardware is memory latency. So high rate of cache misses seems to be the only rational explanation.

Here's an interesting comparison of a Mac Plus (1987) vs an AMD desktop (2007) on a few office application tasks.

Turns out that Mac Plus is faster on some tasks (a little bit), and slower on others (but not by much). So in 20 years we have not seen a lot of improvement in software responsiveness, while CPU power grew by many orders of magnitude. As did memory bandwidth - but not memory latency!

So if you think that times when software had to be efficient had long passed, you're wrong. Think about it this way - your code has 1MB to run in, tops (other programs and OS has claim on L2 cache as well). And yes, this includes the code, as well. What would be different in the way you design your code?