In the past year, I’ve given the Scaling Apache talk more than a few times and essentially I think there’s one genuinely useful thing in it; using dd as a window into scheduler and vm performance. This actually came about by accident more than design, but it’s proved to be a reasonably good, quick, way of getting a handle on how well a system will perform. It’s not intended to be as useful as something like lmbench, aimbench or tiobench but it will certainly tell you whether a system has improved and will allow you to compare two systems with each other. And since the book I reviewed in my previous post dealt with analytical system administration, I thought I’d add a small contribution to the blogosphere.

About three and a half years ago, when zero-copy I/O was a hot topic in performance tuning we decided to set about determining the optimum buffer size for such I/O. If you’ve never encountered zero-copy before, it’s a pretty obvious idea really. When data is being transferred from one place to another and various different parts of hardware and software are involved they each have their own buffer size, the size of the chunk of data they read at a time. So, a disk drive might read 64 KBytes at a time, but then the kernel might read 128 Kbyte at a time and the application only 4 Kbyte at a time. This means that memory somewhere in the system has to buffer all of the gaps, and it also means that we have a lot more I/O operations than we need. This becomes even worse when you’re faced with buffers that differ by only a small amount, a bit like this:

Since we were going to go to the trouble of ensuring that buffer sizes matched up everywhere from the socket layer through to the application layer, it made a lot of sense to pick the most efficient buffer size. We reasoned that even the complex mash of disk drives, RAID memory, kernel subsystems and the virtual filesystem layer must have some preferred data size. The simplest method to try and put a figure on this is to read with lots and lots of different buffer sizes and measure the speed of those reads. So we wrote dder.sh:

This rather hacky script relies on the pretty common dd-performance-counter.patch which makes dd output a “transferred in X seconds” message when done. When run, the script produces a nice simple file, formatted “<buffer size> <bytes/second>“, which plotted with gnuplot comes out with something looking like;

It’s a bit of mess, isn’t it? Well, no, this graph is actually a wealth of information. There are over 100,000 measurements in there, that’s a good deal more than is the basis for our knowledge of the existance of most sub-atomic particles. We just need to try and read it.

This messy graph, which is typical of a memory-constrained x86 system, conveys a lot. First off, it’s pretty clear that there is a general trend, and that the peak is somewhere around the 30,000 byte mark. This is pretty surprising in and of itself, since nearly all unix software uses 4096 or 1024 bytes as its default buffer size. On modern systems, were 30k of memory really is not a whole lot, this may be sub-optimal and there could be a strong case for changing those defaults to something much larger.

Looking more closely at the peak, we can see that it corresponds to over 1.1 x 109 bytes/sec, which is 8.8 Gigabit/sec (something not a lot of poeple realise is that the SI prefixes are interpretted exclusively as powers of 10, never powers of 2, when dealing with bandwidth), which is close to saturation for a 10 Gigabit interface.

This particular data set was actually attained from some RAID5 7200 RPM IDE disks, and there was a lot more I/O going on at the time too. It can be surprising just how little difference fast disks make to this kind of sequential-read load. Try to remember next time you’ve been staring at iostat or sar output trying to get better performance for your load or debug an I/O problem; even 7200 RPM disks can push 10 Gig if the load is organised correctly, replacing the disks may not be the answer. (though of course loads with a lot of seeks and writes, like a database, really do need fasts disks, for reasons unrelated to raw throughput).

The next thing we can see on the graph should be pretty clear, there is obvious bandation. Although it’s a bit spread out, there are bands of white between the more densely packed red regions. Anyone familiar with steady-state physical processes will recognise this as a tell-tale sign of an underlying resonant frequency and its harmonics, and repeating the experiment always reproduces similar bands. Experimenting on different systems, different operating systems with differing underlying hardware seems to indicate that these frequencies co-incide with optimal and highly sub-optimal buffer sizes for the physical hard disks.

What is really useful about the bandation though is that it allows us to discern the next bundle of information from the graph. There are some clear sudden shifts up and down. The whole plot seems to instantaneously move upwards and downward by about 3 x 108 bytes per second, bands and all. The graph looks like the superposition of an exponential curve and some step functions, or a square wave of varying periodicity.

Now, since we’ve gone to the trouble to determine that the bands are due to a physical component, and physical components don’t instantaneously change, it’s pretty unlikely that those sudden shifts are due to physical wear or something at the hardware level. Instead, it turns out those bands are a reflection of how well the system scheduler and virtual-memory manager are co-operating to provide the dd processes with the resources they need. Sudden deprioritisation of the dd process, or kswapd becoming active becomes very apparent. The jerkiness of the graph is a reflection of how uniformly (or non-uniformly as the case may be) the system affords the dd processes CPU time.

To illustrate this, here’s a graph from the same system with 4 times the amount of memory:

And here again with 12 times the original amount of memory:

We can now see that the dramatic shifts have nearly dissappeared entirely, which shows just how big an effect memory has, even on processes which are barely using any. The virtual memory manager can eat a lot of CPU time when you’re short on memory, and processes become a lot more predictable when this doesn’t have to happen. Another thing which is clear from the graphs is that the peak hasn’t moved, it’s still around the 30kb mark, and the bandation is still present, though now we can count the 4 clear bands. This lends more credibility to our theory about the dependence of those factors on underlying physical factors.

Now, let’s look at a graph from a totally different kind of system:

Note the difference in scale, the peak in the previous graphs wouldn’t even get a quarter of the way up this scale. This system looks like it could fill a 40 Gigabit/sec interface, and the graph seems to become a gentle asymptote rather than a peak and then recess. The bands are crystal clear and we have some really neatly clustered measurements around the harmonics, and there is almost no shifting. This graph is from a dual Itanium with 32GB of RAM. Turns out that all of Intel and HPs work optimising the I/O paths for the chipset wasn’t entirely a waste of time, especially when you consider that the bus frequency is less than half that of the systems the prior graphs are from. Despite the terrible reputation Itanium has, it seems that it’s actually quite a good platform if raw throughput is what you want.

Now, I’m not really certain of the general-purpose usefulness of all of the above information. As it turns out, the optimal buffer sizes arn’t useful for true zero copy when dealing with the internet. Because, for the most part, it’s only possible to send a mere 1500 bytes at a time over internet links, and as the graphs show 1500 bytes is way down in the sub-optimal range of any system.

We use large buffer sizes anyway and buy enough memory to cope. Where I have found the graphs useful is as a good visual aid to demonstrate to audiences the differences between certain systems and the effecacy of some tunings, as it’s always easy enough to see the differences. Tune the scheduler/vm better and the graph gets smoother, get faster disks and the plot gets higher, that sort of thing.

But what I am sure about is that approaching these things analytically, designing experiments, taking measurements, interpretting results and refining theories is a much more productive way to get the most out of systems. The four graphs in this post convey a huge amount of information, as long as you know how to read it, and the means of measurement could hardly be any simpler. This was all based on just one variable, and it still tought us a whole lot.

There are subtle tweaks that can be made too. For example, instead of using the bytes/second number from dd itself, we could record the time before calling dd and after its exit within the shellscript. This number would not only measure the I/O efficiency, but also incorporate the fork() and exec() speed, which can often be useful values.

At this point many system adminstrators will probably be tearing their hair out at my over-analysis and the sheer obviousness of much of the information in the graphs, but the real point is in going to the trouble of making them. The point is that once you get into the habit of doing these things analytically, and getting really used to having the data available to you in such a readily accessible form, you start to see entirely new ways in which you can optimise your system.

Just look at the amount of incredibly useful information which is condensed into Justin’s latest graphs on SpamAssassin rulesets for another example. Or have a browse around HEAnet’s MRTGs and see how much information you can parse quickly. So why on earth are we still staring at iostat, top and other ncurses and simple text apps?

Oh, and buy this book. Now hopefully I can finally put that talk to bed, and have forced myself into writing a new one.

8 Replies to "Duff’s Device and Scheduler Benchmarking"

Man, my posts are getting too long. I’ve got the wrong dd, although dd can be implemented using a duff’s device, dd comes from “dataset definition”. I’ve been assuming it was the other way around for years. Stupid me!

[...] In order to make sure that the real benchmarks are as efficient as they can be, we’ve repeated our usual procedure using dd to determine the most efficient buffer size. More details about that procedure can be found in my earlier mis-titled blog post on scheduler benchmarking. [...]