Time for a deeper look at what's going on here...I installed RHEL6 Beta
2 yesterday, on the presumption that since the release version just came
out this week it was likely the same version Marti tested against.
Also, it was the one I already had a DVD to install for. This was on a
laptop with 7200 RPM hard drive, already containing an Ubuntu
installation for comparison sake.
Initial testing was done with the PostgreSQL test_fsync utility, just to
get a gross idea of what situations the drives involved were likely
flushing data to disk correctly during, and which it was impossible for
that to be true. 7200 RPM = 120 rotations/second, which puts an upper
limit of 120 true fsync executions per second. The test_fsync released
with PostgreSQL 9.0 now reports its value on the right scale that you
can directly compare against that (earlier versions reported
seconds/commit, not commits/second).
First I built test_fsync from inside of an existing PostgreSQL 9.1 HEAD
checkout:
$ cd [PostgreSQL source code tree]
$ cd src/tools/fsync/
$ make
And I started with looking at the Ubuntu system running ext3, which
represents the status quo we've been seeing the past few years.
Initially the drive write cache was turned on:
Linux meddle 2.6.28-19-generic #61-Ubuntu SMP Wed May 26 23:35:15 UTC
2010 i686 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=9.04
DISTRIB_CODENAME=jaunty
DISTRIB_DESCRIPTION="Ubuntu 9.04"
/dev/sda5 on / type ext3 (rw,relatime,errors=remount-ro)
$ ./test_fsync
Loops = 10000
Simple write:
8k write 88476.784/second
Compare file sync methods using one write:
(unavailable: open_datasync)
open_sync 8k write 1192.135/second
8k write, fdatasync 1222.158/second
8k write, fsync 1097.980/second
Compare file sync methods using two writes:
(unavailable: open_datasync)
2 open_sync 8k writes 527.361/second
8k write, 8k write, fdatasync 1105.204/second
8k write, 8k write, fsync 1084.050/second
Compare open_sync with different sizes:
open_sync 16k write 966.047/second
2 open_sync 8k writes 529.565/second
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 1064.177/second
8k write, close, fsync 1042.337/second
Two notable things here. One, there is no open_datasync defined in this
older kernel. Two, all methods of commit give equally inflated commit
rates, far faster than the drive is capable of. This proves this setup
isn't flushing the drive's write cache after commit.
You can get safe behavior out of the old kernel by disabling its write
cache:
$ sudo /sbin/hdparm -W0 /dev/sda
/dev/sda:
setting drive write-caching to 0 (off)
write-caching = 0 (off)
Loops = 10000
Simple write:
8k write 89023.413/second
Compare file sync methods using one write:
(unavailable: open_datasync)
open_sync 8k write 106.968/second
8k write, fdatasync 108.106/second
8k write, fsync 104.238/second
Compare file sync methods using two writes:
(unavailable: open_datasync)
2 open_sync 8k writes 51.637/second
8k write, 8k write, fdatasync 109.256/second
8k write, 8k write, fsync 103.952/second
Compare open_sync with different sizes:
open_sync 16k write 109.562/second
2 open_sync 8k writes 52.752/second
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 107.179/second
8k write, close, fsync 106.923/second
And now results are as expected: just under 120/second.
Onto RHEL6. Setup for this initial test was:
$ uname -a
Linux meddle 2.6.32-44.1.el6.x86_64 #1 SMP Wed Jul 14 18:51:29 EDT 2010
x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.0 Beta (Santiago)
$ mount
/dev/sda7 on / type ext4 (rw)
And I started with the write cache off to see a straight comparison
against the above:
$ sudo hdparm -W0 /dev/sda
/dev/sda:
setting drive write-caching to 0 (off)
write-caching = 0 (off)
$ ./test_fsync
Loops = 10000
Simple write:
8k write 104194.886/second
Compare file sync methods using one write:
open_datasync 8k write 97.828/second
open_sync 8k write 109.158/second
8k write, fdatasync 109.838/second
8k write, fsync 20.872/second
Compare file sync methods using two writes:
2 open_datasync 8k writes 53.902/second
2 open_sync 8k writes 53.721/second
8k write, 8k write, fdatasync 109.731/second
8k write, 8k write, fsync 20.918/second
Compare open_sync with different sizes:
open_sync 16k write 109.552/second
2 open_sync 8k writes 54.116/second
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 20.800/second
8k write, close, fsync 20.868/second
A few changes then. open_datasync is available now. It looks slightly
slower than the alternatives on this test, but I didn't see that on the
later tests so I'm thinking that's just occasional run to run
variation. For some reason regular fsync is dramatically slower in this
kernel than earlier ones. Perhaps a lot more metadata being flushed all
the way to the disk in that case now?
The issue that I think Marti has been concerned about is highlighted in
this interesting subset of the data:
Compare file sync methods using two writes:
2 open_datasync 8k writes 53.902/second
8k write, 8k write, fdatasync 109.731/second
The results here aren't surprising; if you do two dsync writes, that
will take two disk rotations, while two writes followed a single sync
only takes one. But that does mean that in the case of small values for
wal_buffers, like the default, you could easily end up paying a rotation
sync penalty more than once per commit.
Next question is what happens if I turn the drive's write cache back on:
$ sudo hdparm -W1 /dev/sda
/dev/sda:
setting drive write-caching to 1 (on)
write-caching = 1 (on)
$ ./test_fsync
[gsmith(at)meddle fsync]$ ./test_fsync
Loops = 10000
Simple write:
8k write 104198.143/second
Compare file sync methods using one write:
open_datasync 8k write 110.707/second
open_sync 8k write 110.875/second
8k write, fdatasync 110.794/second
8k write, fsync 28.872/second
Compare file sync methods using two writes:
2 open_datasync 8k writes 55.731/second
2 open_sync 8k writes 55.618/second
8k write, 8k write, fdatasync 110.551/second
8k write, 8k write, fsync 28.843/second
Compare open_sync with different sizes:
open_sync 16k write 110.176/second
2 open_sync 8k writes 55.785/second
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 28.779/second
8k write, close, fsync 28.855/second
This is nice to see from a reliability perspective. On all three of the
viable sync methods here, the speed seen suggests the drive's volatile
write cache is being flushed after every commit. This is going to be
bad for people who have gotten used to doing development on systems
where that's not honored and they don't care, because this looks like a
90% drop in performance on those systems. But since the new behavior is
safe and the earlier one was not, it's hard to get mad about it.
Developers probably just need to be taught to turn synchronous_commit
off to speed things up when playing with test data.
test_fsync writes to /var/tmp/test_fsync.out by default, not paying
attention to what directory you're in. So to use it to test another
filesystem, you have to make sure to give it an explicit full path.
Next I tested against the old Ubuntu partition that was formatted with
ext3, with the write cache still on:
# mount | grep /ext3
/dev/sda5 on /ext3 type ext3 (rw)
# ./test_fsync -f /ext3/test_fsync.out
Loops = 10000
Simple write:
8k write 100943.825/second
Compare file sync methods using one write:
open_datasync 8k write 106.017/second
open_sync 8k write 108.318/second
8k write, fdatasync 108.115/second
8k write, fsync 105.270/second
Compare file sync methods using two writes:
2 open_datasync 8k writes 53.313/second
2 open_sync 8k writes 54.045/second
8k write, 8k write, fdatasync 55.291/second
8k write, 8k write, fsync 53.243/second
Compare open_sync with different sizes:
open_sync 16k write 54.980/second
2 open_sync 8k writes 53.563/second
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 105.032/second
8k write, close, fsync 103.987/second
Strange...it looks like ext3 is executing cache flushes, too. Note that
all of the "Compare file sync methods using two writes" results are half
speed now; it's as if ext3 is flushing the first write out immediately?
This result was unexpected, and I don't trust it yet; I want to validate
this elsewhere.
What about XFS? That's a first class filesystem on RHEL6 too:
[root(at)meddle fsync]# ./test_fsync -f /xfs/test_fsync.out
Loops = 10000
Simple write:
8k write 71878.324/second
Compare file sync methods using one write:
open_datasync 8k write 36.303/second
open_sync 8k write 35.714/second
8k write, fdatasync 35.985/second
8k write, fsync 35.446/second
I stopped that there, sick of waiting for it, as there's obviously some
serious work (mounting options or such at a minimum) that needs to be
done before XFS matches the other two. Will return to that later.
So, what have we learned so far:
1) On these newer kernels, both ext4 and ext3 seem to be pushing data
out through the drive write caches correctly.
2) On single writes, there's no performance difference between the main
three methods you might use, with the straight fsync method having a
serious regression in this use case.
3) WAL writes that are forced by wal_buffers filling will turn into a
commit-length write when using the new, default open_datasync. Using
the older default of fdatasync avoids that problem, in return for
causing WAL writes to pollute the OS cache. The main benefit of O_DSYNC
writes over fdatasync ones is avoiding the OS cache.
I want to next go through and replicate some of the actual database
level tests before giving a full opinion on whether this data proves
it's worth changing the wal_sync_method detection. So far I'm torn
between whether that's the right approach, or if we should just increase
the default value for wal_buffers to something more reasonable.
--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books