If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Ironically enough, compressed reiser4 would blow everything else out of the water in these benchmarks.

Well I don't know about that, as I've been doing a java port of my companies ide and language and working with a 4.9 GB database (real customer shipping data) it seems here anyways that no matter on SSD or a regular HD but btrfs seems to be edging out r4.

[...] compressed reiser4 would blow everything else out of the water in these benchmarks.

None of these safer* fs "blow everything else out of the water" ... ext2 would prbly come the closest but who wants to race boats w/o at least a lifejacket so when one gets tossed into the water, at least there's a chance of survival.

*safer as in journaled or CoW (or something) to get the data back when errors knock the fs out of whack.

Well I don't know about that, as I've been doing a java port of my companies ide and language and working with a 4.9 GB database (real customer shipping data) it seems here anyways that no matter on SSD or a regular HD but btrfs seems to be edging out r4.

I'm not talking about real-world benchmarks, I'm talking about these synthetic benchmarks that Phoronix used in this article, that only write a bunch of zeroes to the disk. They just aren't adequate for benchmarking compressed file systems (reiser4 nor btrfs).

Trust me, the one thing reiser4 is really good at is compressing zeroes.

Originally Posted by fhj52

None of these safer* fs "blow everything else out of the water" ... ext2 would prbly come the closest

(Even though you completely missed my point.) You would think so, but most of the time it's not actually true. Modern journaling file systems are much better tuned than old unsafe file systems (ext2, UFS).

In fact, for a random-write workload, CoW is pretty much the ideal file system layout, because it turns random writes into sequential ones.

I, and anyone ithink, will agree using zeros is not 'real-world' of course but nevertheless it is a baseline, which ithink is what the author/tester was aiming to get(despite the tests being run on unstable kernel, unstable btrfs and ext4( with, IMO, dubious stability ).

Maybe the compression test(s), at least, could be better. It would be, ithink, more constructive to suggest how to achieve something closer to end-user(desktop & server) usage rather than waste BW discussing effectively dead or old fs that have neither journal or CoW safety nets.

iozone write parameters

I'm Chris Mason, one of the btrfs developers. Thanks for taking the time to benchmark these filesystems!

Someone forwarded me the iozone parameters used, and it looks like they have iozone doing 1K writes, which is less than the linux page size (4k on x86,x86-64 systems).

One way that btrfs is different from most other filesystems is that we never change pages while data is being written to the disk. When the application is doing 1k writes, each page is modified 4 times.

If the kernel decides to write the page somewhere in the middle of those four writes, ext4 will just change the page while it is being written. This happens often as the kernel tries to find free pages by writing dirty pages.

Btrfs will wait for the write to complete, and then because btrfs does copy on write, it will allocate a new block for the new write and write to the new location. This means that we are slow because we're waiting for writes and we're slow because we fragment the file more.

[QUOTE=sektion31;108954]oh thanks for clarification. i read that reiser4 and btrfs are more similar to each other than to ext3/4. so i assumed they have a similar design idea.

Just to clarify, the big thing that I took from reiserfs (actually reiserv3, which was the one I worked on) was the idea of key/item storage. The btrfs btree uses a very similar key structure to order the items in the btree.

This is different from ext* which tend to have specialized block formats for different types of metadata. Btrfs just tosses things into a btree and lets it index it for searching.

Hi Chris,
I'm Ric, one of the users excited about the btrfs fs(as geeky as that is). Thank you for taking the time for development!

I ran IOzone while setting-up a new SAS2 RAID adapter(LSI 9211) and disks(Hitachi c10k300) on Opteron dual skt 940(285's) system while using openSUSE Linux with 2.6.32 kernel. The initial purpose was to use md so a md RAID 0 was created and then stress tested. IOzone was one of the tools used. A 8GB file is used to overcome the effects of the installed 4GB RAM.
:

Others, ext3, ext4, & JFS, faired about the same but READ were faster and, more importantly, faster than WRITE as would be expected.

I was a bit short on time then so just now ran it with same IOzone parameters but using the 9211's "Integrated RAID" RAID-0 on a different kernel.
:
File size set to 8388608 KB
Record Size 64 KB
Machine = Linux sm.linuXwindows.hom 2.6.31.6-desktop-1mnb #1 SMP Tue Dec 8 15: Excel chart generation enabled
Excel chart generation enabled
Command line used: iozone -L64 -S1024 -a -+u -i0 -i1 -s8G -r64 -M -f /mnt/CLONE/SAS600/iozoneTESTFILE -Rb /tmp/iozone_[Mandriva_2.6.31]_[btrfs]_[9211-8i_RAID-0].xls
Output is in Kbytes/sec :

"Writer report"
"64"
"8388608" 412,433

"Re-writer report"
"64"
"8388608" 417,586

"Reader report"
"64"
"8388608" 391,542

"Re-Reader report"
"64"
"8388608" 393,962

As you can see, same thing: WRITE is faster than READ even on the IR RAID.
Something weird is going on ... Perhaps it is an IOzone & btrfs issue.(?) If so IOzone tests are skewed (... the wrong way, ). I'd blame it on this test but none of the other fs had faster WRITEs than READs in the results.

I have not tried it on Intel Nehalem platform yet but thought maybe you should know something odd was occurring (that is not exhibited by the other fs).

I don't need an explanation or anything like that but would be good to know you got the post if you have the time. I do have the excel files if needed.

-Ric

PS: This is not the first time I found md to be faster than a HBA or RAID card's RAID. Distressing but also very good for us Linux geeks. ...wish it(md) was cross-platform.

Hi Chris,
I'm Ric, one of the users excited about the btrfs fs(as geeky as that is). Thank you for taking the time for development!

I ran IOzone while setting-up a new SAS2 RAID adapter(LSI 9211) and disks(Hitachi c10k300) on Opteron dual skt 940(285's) system while using openSUSE Linux with 2.6.32 kernel. The initial purpose was to use md so a md RAID 0 was created and then stress tested. IOzone was one of the tools used. A 8GB file is used to overcome the effects of the installed 4GB RAM.
:

Others, ext3, ext4, & JFS, faired about the same but READ were faster and, more importantly, faster than WRITE as would be expected.

Thanks for giving btrfs a try. Usually when read results are too low it is because there isn't enough read ahead being done. The two easy ways to control readahead are to use a much larger buffer size (10MB for example) or to tune the bdi parameters.

Btrfs does crcs after reading, and sometimes it needs a larger readahead window to perform as well as the other filesystems. You could confirm this by turning crcs off (mount -o nodatasum).

Linux uses a bdi (backing dev info) to collect readahead and a few other device statistics. Btrfs creates a virtual bdi so that it can easily manage multiple devices. Sometimes it doesn't pick the right read ahead values for faster raid devices.

In /sys/class/bdi you'll find directories named btrfs-N where N is a number (1,2,3) for each btrfs mount. So /sys/class/bdi/btrfs-1 is the first btrfs filesystem. /sys/class/bdi/btrfs-1/read_ahead_kb can be used to boost the size of the kernel's internal read ahead buffer. Triple whatever is in there and see if your performance changes.

If that doesn't do it, just let me know. Most of the filesystems scale pretty well on streaming reads and writes to a single file, so we should be pretty close on this system.

Hi Chris,
Thanks for the explanation and suggestion.
Before seeing it, I did try an older Parallel SCSI card, LSI megaRAID 320-2x with some Fujitsu U320 disks in RAID 0. The card has 512MB of BBU cache ...no way I know to adjust that. [ ...unless you meant a kernel adjustment.(?)]
The results were strikingly different as before but more so:
"Writer report"
"64"
"8388608" 244679

"Re-writer report"
"64"
"8388608" 231935

"Reader report"
"64"
"8388608" 51755

"Re-Reader report"
"64"
"8388608" 50160

Then I found & used your suggestions of nodataram and changed the 4096 value to 12288 for the readahead [ *SAS600 type btrfs (rw,noatime,nodatasum)], and that looks like it definitely improved the WRITE faster than READ oddity for the 9211-8i SAS2 card. ( It has no buffer/cache onboard but the HDD have 64MB and the adapter is set so disk cache is on.)

It is a bit slower too for READ ... but no drama.
Like most everybody else, PAS disks won't be used much longer so I put those numbers up there just as information for you, in case it is needed.
...

On the up side, man, look at those numbers. The btrfs just walloped ext4 for this test!
That 490,296 kBps is the fastest I've ever seen here for a WRITE. By all means, please keep up the good work!

I'll look at the buffering but with the 9211 HBA there's not much to do for it. Perhaps, the buffering with the disks' cache got turned off between Linux and MS OS somehow. It should not as it is an adapter setting but the LSI2008 and LSI2108 kernel module(mpt2sas) is relatively new. ... it'll take a while to get the SW running to find out.

Hi Chris,
Thanks for the explanation and suggestion.
Before seeing it, I did try an older Parallel SCSI card, LSI megaRAID 320-2x with some Fujitsu U320 disks in RAID 0. The card has 512MB of BBU cache ...no way I know to adjust that. [ ...unless you meant a kernel adjustment.(?)]
The results were strikingly different as before but more so:
"Writer report"
"64"
"8388608" 244679

"Re-writer report"
"64"
"8388608" 231935

"Reader report"
"64"
"8388608" 51755

"Re-Reader report"
"64"
"8388608" 50160

Then I found & used your suggestions of nodataram and changed the 4096 value to 12288 for the readahead [ *SAS600 type btrfs (rw,noatime,nodatasum)], and that looks like it definitely improved the WRITE faster than READ oddity for the 9211-8i SAS2 card. ( It has no buffer/cache onboard but the HDD have 64MB and the adapter is set so disk cache is on.)

It is a bit slower too for READ ... but no drama.
Like most everybody else, PAS disks won't be used much longer so I put those numbers up there just as information for you, in case it is needed.
...

-Ric

Thanks for trying this out. nodatasum will improve both writes and reads because it isn't doing the checksum during the write.

On raid cards with writeback cache (and sometimes even single drives with writeback cache), the cache may allow the card to process writes faster than it can read. This is because the cache gives the drive the chance to stage the IO and perfectly order it, while reads must be done more or less immediately. Good cards have good readahead logic, but this doesn't always work out.

So, now that we have the kernel readahead tuned (btw, you can try larger numbers in the bdi read_ahead_kb field), the next step is to make sure the kernel is using the largest possible requests on the card.

cd /sys/block/xxxx/queue where xxxx is the device for your drive. You want the physical device, and if you're using MD you want to do this to each drive in the MD raid set (example cd /sys/block/sda/queue)