"It’s all a numbers game – the dirty little secret of scalable systems"

Martin Thompson is a High Performance Computing Specialist with a real mission to teach programmers how to understand the innards of modern computing systems. He has many talks and classes (listed below) on caches, buffers, memory controllers, processor architectures, cache lines, etc.

His thought is programmers do not put a proper value on understanding how the underpinnings of our systems work. We gravitate to the shiny and trendy. His approach is not to teach people specific programming strategies, but to teach programmers to fish so they can feed themselves. Without a real understanding strategies are easy to apply wrongly. It's strange how programmers will put a lot of effort into understanding complicated frameworks like Hibernate, but little effort into understanding the underlying hardware on which their programs run.

A major tenant of Martin's approach is to "lead by experimental observation rather than what folks just blindly say," so it's no surprise he chose a MythBuster's theme in his talk Mythbusting Modern Hardware to Gain "Mechanical Sympathy." Mechanical Sympathy is term coined by Jackie Stewart, the race car driver, to say you get the best out of a racing car when you have a good understanding of how a car works. A driver must work in harmony with the machine to get the most of out of it. Martin extends this notion to say we need to know how the hardware works to get the most out of our computers. And he thinks normal developers can understand the hardware they are using. If you can understand Hibernate, you can understand just about anything.

The structure of the talk is to take a few commonly held myths and go all MythBusters on them by seeing if they are really true. Along the way there's incredible detail on how different systems work, far too much detail to gloss here, but it's an absolute fascinating talk. Martin really knows what he is talking about and he is a good teacher as well.

The most surprising part of the talk is the counter intuitive idea that many of the devices we think of as random access, like RAM, HDDs, and SSDs, effectively become serial devices in certain circumstances. A disk, for example, is really just a big tape that's fast. It's not true random access.

Myth 1: CPUs are Not Getting Faster

The fundamental issue is CPUs can't get hotter, not that they can't get faster. As we run CPUs faster (higher clock speeds) they get hotter and hotter and heat dissipation at these small scales is incredibly difficult.

Sandy/Ivy Bridge goes parallel inside instead of going faster. There are 3 ALUs and 6 ports for loading and storing data. More ports are needed to feed the ALUs so you can have up to 6 instructions per cycle happening in parallel.

There's only one divide and one jump. Highly branched or code with a lot of division doesn't go as fast as straightforward code using +, -, *.

CPUs have counters so they are easy to profile. On Linux access the counters using perf stat.

Running perf stat on Nehalem 2.8GHz in the Alice and Wonderland test we notice that the processor is idle about a 1/3rd of the time so a faster processor wouldn't help.

On a later Sandy Bridge 2.4GHz the CPU is idle about 25% of the time. The reason CPUs are getting faster is not faster clock speed, but instructions are being fed into the CPU faster.

Myth 2: Memory Provides Random Access

At the end of the day it's a cost equation. To get to main memory is fairly expensive. We want to feed the CPUs really fast. How do we feed the CPU fast? We need the data to be close to them on something that's very very quick. Modern register to register copies don't cost anything because what is happening is a remapping, it doesn't even move.

Layers of caches. Caches get bigger and slower so it's speed versus cost, but also power is important. Power to access a disk is so much more than accessing a L1 cache. Modern processors are starting to transfer data from network cards into cache, skipping memory, to keep the thermals down by not involving the CPU.

There's lot of detail on memory ordering and cache structures and coherence. The gist seems to be there's an immensely complicated circuitry between the different layers of memory and the CPU. If you can't make the memory sub-systems, caches, buses fast enough then you can't feed the CPU fast enough so there's no point in making CPUs faster.

This means software must be written to access memory in a friendly manner or you are starving the CPU. On Sandy Bridge sequentially walking through memory will take you 3 clocks for L1D, 11 clocks for L2, 14 clocks for L3, and 6ns for memory. In-page random is 3 clocks, 11 clocks, 18 clocks, 22ns. Full random access is 3 clocks, 11 clocks, 38 clocks, 65.8 ns.

Since you can't walk memory sequentially you want to reduce coupling and increase cohesion. Keep things together. Good coupling and good cohesion makes this all just work. If your code branches everywhere and runs all over the heap it will make for slow code.

Myth 3: HDDs Provide Random Access

Zone bit recording. There's a big difference between writing on the inner and outer parts of the disk. More sectors are put on the outer parts of the disk so you get greater density. For one revolution of the disk you are going to see more sectors so you'll get greater throughput.

On a 10K disk when sequentially reading the outer tracks you'll get 220 MB/s and when reading the innter tracks you'll get 140 MB/s.

The fastest disks are 15K and they haven't got any faster for many many years.

Hardware will prefetch and reorder queues as the head moves over the sectors. A sector is now 4K to get more data on a disk. If you read or write a byte the minimum transferred is 4K.

What makes up an operation?

Command processing. Subsecond.

Seek time. 0-6ms server drive, 0-15ms laptop drive.

Rotational latency. For a 10K RPM disk rotation takes 6ms for an average of 3ms.

Data transfer. 100-200MB/s.

For random access of a 4K block, the random latency is 10ms or 100 IOPS. Throughput at random is less than 1 MB a second, maybe 2 MB a second with really clever hardware. So randomly accessing a disk isn't practical. If you see fantastic transaction numbers then the data isn't going to disk.

A disk is really a big tape that's fast. It's not true random access.

Myth 4: SSDs Provide Random Access

SSDs gernerally have 2MB blocks arranged in an array of cells. SLC - single level, can store a bit. Has voltage or doesn't have a voltage. MLC - multiple voltages, so you can store 2 or 3 bits per cell.

Expensive to address individual cells so you can address a row at a time, which is called a page, pages are usually 4K or 8K. Reading or writing a random page sized thing is really fast, there's no moving parts.

When you delete you can only erase a whole block at a time. The ways SSDs work is they write every cell to be a one. When you put data into it you turn off the bits you don't want. Turning off a bit is easy because it's draining a cell. Turning on a bit by putting voltage into the cell tends to light up the cells around it so you can't accurately set a single bit. So you must delete a whole block at a time. Bits are marked as deleted because you don't want to erase a whole block at a time because there's a limited number times you can read and write a block. You don't want a disk to wear out. So bits are marked as deleted and the new data is copied to a new block. This has a cost. In time the disk ends up fragmented. Over time you have to garbage collect, compacting blocks.

Example SSD can read and write at 200 MB/s. When you starting deleting read performance looks good, but writes slow down because of the garbage collection process. For some disks performance falls off a cliff on writes and you need to reformat. There's also write amplification where small writes end up triggering a lot of copying.

Reads have great random and sequential performance. If you only do append only writes then performance could be quite good.

At 40K IOPs with 4K random reads and writes, average operation times are 100-300 microseconds with large up to half a second pauses during garbage collection.

Reader Comments (6)

I was a bit astonished when I read Myth No. 1 and 2, comming from this fine magazine. This topis are in every university architecture course, and good programmers should know architecture. Thing is, we've been hidden these details by the media and we know now, as we should have known before, that there are more things that matter, besides raw speed, like in the case of the cpu. Nevertheless, most programmers, who don't need to work those numbers, won't care of this innards, because it's still cheaper to make a program easir to program than effricient.

Alfonso says this stuff doesn't matter for code that isn't at the bleeding edge of performance, but I disagree. I have read that a significant fraction of the CO2 pumped into the atmosphere comes from data-centres, If my code is 10 times more efficient than yours, and it will be at least that if I follow Martin's approach, that is 10x less hardware needed to operate it and 10x less CO2 pumped into the atmosphere. So we save money, have cleaner, more efficient code and we help to save the planet!

Code written to follow Martin's advice is not harder to program, fast code is, pretty much, by definition providing the most function for the least code. Locality of reference and separation of concerns are the hallmarks of good code. It so happens that code that is written like this is also fast to run and cache-friendly.

This is a win win win if only we stop being lazy and start learning the basics of how our hardware works. I can think of no other field of human endeavour that allows the levels of inefficiency that we accept as normal in software. I think we should start trying to redress that, and save some money for our employers at the same time.

Dave Farley: That doesn't really follow. Basically Amdahl's law says that if we improve a fraction f of a process by a factor of s, the overall improvement is

1/((1- f) + (f/s))

The vast majority of the cost in most computations comes from a relatively small part of it. For example, a web app will usually be largely dependent on the database for its performance, whilst a program to compute pi will not be dependent on its pi-computation code (and not its logger). So if we say that (as a rough estimate) a general piece of large scale software (a database, a web app) will see 95% of its computation costs from one 'fast' codebase, and the other 5% from a 'slow' codebase, then improving the 'slow' codebase by a factor of 10 will improve the overall performance by about 4.7% (subbing into the rule above) - the cost would not lead to a good result.

This is why we see people using Ruby for web apps as a genuine alternative to Java - yes, Java is typically an order of magnitude faster than Ruby, for sure, but that's not usually the problem - it's the database latency etc.

This is the guy's point - if we're limited by something (CPU, hard drive etc) there're in most cases only a few things that are causing that bottleneck - you optimise them, and not everything else, because those things will yield the actual benefit. As a case in point, in the example I gave above, improving the 'fast' codebase by about 5% would lead to the same speedup as optimising the 'slow' codebase by 1000%. Many things will never become a bottleneck and so optimising them is merely time wasted.

It doesn't make sense to access things randomly when they are stored sequentially, which most data are on an HDD, at least in large chunks. As far as I know, the computer doesn't instruct the disc controller to access data that are actually next to each other by specifying their locations individually -- it just reads on.

Just a few things to clarify. At least with NAND flash, it's not actually reading the contents (voltage) of the cell like with DRAM or SRAM. It's testing each cell one at a time for the gate voltage threshold that will allow that cell to turn on. With SLC it's only one test. With MLC each cell needs to be tested 3 times; the last threshold doesn't need to be tested but can be assumed if the other 3 don't turn it on. With TLC it's 7 times. And it's all fine when the parity check passes, but when it doesn't it's going to apply error correction, which will slow down reads.

My understanding of Intel/Micron's new 3D XPoint is that it actually is like reading the contents of an individual cell.

I believe that theoretically SLC could be rewritten without erasing. However, everything is erased to 1, but it's only possible to write to 0. Rewriting without a block erase used to be a trick with programming EEPROMs for specific applications. MLC and TLC are trickier because the voltage threshold is set to an intermediate value.