Call it 4 minutes. 240 seconds, to create a 123 GB file. This is a little north of 500 MB/s write. A very cache unfriendly write at that.

My goal was to time md5sum. Watching it now, it is limited by computation it looks like, to about 350 MB/s, so this may not be a good test as the test maxes out before the hardware. Then again, looking at the CPUs, it looks like 14% user space usage, with the remaining 8% in system usage. So my guess is it is doing a character read, a small vector buffer at a time. Ugh. Should look at this code at some point.

Ok, need a better reader. That was … slow, and it didn’t look like it was slow due to the disks …

Ok, so I wrote a quite reader. Read As Fast As Possible (RAFAP).

Running it. Looks like its pegged at 650 MB/s according to dstat. Hmmm… vmstat and dstat are reporting numbers that differ by a factor of 2. Going to have to investigate that at some point (dstat has been quite reliable in the past).

So it looks like using fread/fopen, we are limited by the operating system. The user load was 1-2 %, while the system load was around 20%. IOzone and others are still pushing quite a bit higher than this.

I tried some O_DIRECT bits to turn off caching. No impact. I wonder if I am getting zonked with kernel memory/user memory copying affects. Alas I am running rPath OpenFiler, and it is somewhat short of tools, so I am trying to build them in an rPath VMware session and copy them over. Could also be file system journaling issues. I tried creating the journal on a different device, but it crashed the mount command when I tried mounting it. Gaak.

Hmmm… Maybe alignment issues, I didn’t take any pains to align the buffers. Will look at this.

FWIW: other simple tests seem to place many of the large block “random” reads at north of 500 MB/s. Would like to see this better. Looking at block size effects. Will see if block size reduction helps or hurts random IO. I think it will actually hurt it, as the controllers will thrash. Will also try larger block sizes, see if we can un-thrash it. I would rather be limited by larger block reads than by smaller ones, as I can hide some more latency in there.

Update: Ok, so I am thinking about this more, and wondering if the limitation I am running into is actually the processor on the RAID card. Basically it might be rigged to do the RAID calculations really fast, but isn’t clocked fast enough for high throughput non-calculation intensive IO. Will need to look into this. I could always simply export all the drives as a big old jbod, and build a RAID in software … but then if the RAID CPU can’t handle the IO now, it really won’t like having the system CPUs shoving bits down its throat at 4 GB/s per RAID card.

Going to have to think about this one, and look up this processor. If I am hitting its limits, I need to see how I can make more effective use out of it, even if this means the corner cases remain corner cases.