Monday, March 10, 2014

Many people think buffered write (write()/pwrite()) is fast because
it does not do disk access. But this is not always true. Buffered write
sometimes does disk access by itself, or waits for some disk accesses by
other threads. Here are three common cases where write() takes longer
time (== causing stalls).

1. Read Modify Write

Suppose the following logic. Opening aaa.dat without O_DIRECT/O_SYNC,
writing 1000 bytes sequentially for 100,000 times, then flushing by
fsync().

You might think each write() will finish fast enough (at least less
than 0.1ms) because it shouldn't do any disk access. But it is not
always true.
Operating System manages I/O by page. It's 4KB for most Linux
environments. If you'd modify 1000 bytes of the 4KB page from offset 0,
Linux first needs to read the 4KB page, modify first 1000 bytes, then write
the page back. The page will be sooner or later written to disk. Yes,
reading the page is needed. This is called RMW (Read Modify Write). If
the page was not cached in filesystem cache (page cache), reading the
page from disk is needed, which may take tens of milliseconds on HDD.
This problem often happens when overwriting large files. You can easily repeat the problem by the following steps.

1. Creating a large file (cached in fs cache)

dd if=/dev/zero of=aaa bs=4096 count=1000000

2. Uncache the file (i.e. echo 3 > /proc/sys/vm/drop_caches)

3. Writing to the file (using write()/pwrite()) => the
target page does not exist in fs cache. So reading from disk. You can
verify that by iostat.

There are a couple of solutions to avoid slow Read Modify Write.

Appending a file, not updating in place

Appending a file means newly allocated pages are always cached, so slow Read Modify Write issue doesn't happen.
In MySQL, binary logs are appended, not overwritten. InnoDB log files are always overwritten so this workaround can't be used.
Note that if you need full durability (calling fsync/fdatasync() per
each write()), appending is much more expensive than overwriting for most filesystems.
It is well known that sync_binlog=1 is very slow in MySQL ~5.5, and the
main reasons were group commit was broken and appending + fsync() was
not fast. In 5.6 group commit was supported so multi-threaded write
throughput improved a lot.

Always cache target files within filesystem cache

If target pages are cached in filesystem cache (page cache), write() doesn't hit disk reads.
The obvious disadvantage is that this approach wastes memory. RAM is
expensive. If you have 128GB InnoDB log files, you won't like to give
128GB RAM for InnoDB log files. The RAM should be allocated for InnoDB buffer
pool.

Aligning write() I/O unit size by 4KB multiple

If you always write with Linux page size aligned (4KB multiple), Read Modify Write issue can be avoided.
One approach to do aligned write is zero-filling. Below is an example.

In this example, you write 1000 bytes of application data twice, but
actually writing 4KB data twice, 8KB in total. 8192-2000=6192 bytes are
zero-filled data. Disk reads don't happen by this approach, even if they are not cached in filesystem cache.
This approach needs more space. In the above case you wasted 6192 bytes.
Another approach is remembering application data offset, but writing by 4KB aligned.

2. write() may be blocked for "stable page writes"

Dirty pages

Most write commands don't flush to disk immediately. write()/pwrite()
functions write to Linux page cache and mark them dirty, unless the
target file is opened with O_DIRECT/O_SYNC. write() and pwrite() are
basically the same except that pwrite() has an option to set offset. I
mean write() here as dirty page write.
Dirty pages are sooner or later flushed to disk. This is done by many
processes/functions, such as bdflush, fsync(), sync_file_range().
Behavior to flush to disk highly depends on filesystem. I consider only
XFS here.
When a dirty page is written to disk, write() to the same dirty page is
blocked until flushing to disk is done. This is called "Stable Page Write".
This may cause write() stalls, especially when using slower disks.
Without write cache, flushing to disk takes ~10ms usually, ~100ms in bad
cases.

Suppose the following example. There are two clients. One is
writing 10 bytes repeatedly via write(), the other is calling fsync() to
the same file.

If you run on a slower disks (HDD with write cache disabled), you may
notice sometimes write() (t2-t1) takes more than 10ms. Taking time for
fsync() is expected because fsync() flushes dirty pages and metadata
into disk, but buffered write also sometimes takes time due to stable
page write.
Another annoying issue is while write() is blocked, the write()
holds an exclusive inode mutex. It blocks all writes and disk reads
to/from the same file (all pages within the same file).

Disabling Stable Page Writes

Hopefully there is a patch to disable Stable Page Write, and some Linux distributions support it.
With this patch, on most filesystems write() no longer waits for dirty
page writeback. It helps not to cause write() latency spikes.

3. Waiting for journal block allocation in ext3/4

If you are using ext3 or ext4, you may still suffer from occasional write() stalls, even if disabling stable page writes. This happens when write() waits for journal block allocation. Example stack trace is as follows. Easier workaround is using xfs, which gives no such stalls.

Summary

Question: Why does buffered write() sometimes stall? It just writes to kernel buffer and doesn't hit disk.
Answer:
1. write() does disk read when needed. To avoid this issue you need to append a file, not overwrite. Or use OS page aligned writes.
2. write() may be blocked for "stable page writes". To avoid this issue you need to use newer Linux kernel supporting disabling stable page writes.
3. If you really care about latency, I don't recommend using ext. Use xfs instead.

Many of the issues can be mitigated by using battery/flash backed write cache on raid controller, but this is not always possible, and battery often expires.
In the next post, I'll describe about why sync_file_range(SYNC_FILE_RANGE_WRITE) sometimes stalls.

No comments:

Post a Comment

About Me

I am a database engineer at Facebook.
Before joining Facebook, I was a principal database and infrastructure architect at DeNA. My primary responsibility at DeNA was to make our database infrastructure more reliable, faster and more scalable. Before joining DeNA, I worked at MySQL/Sun/Oracle as a lead MySQL consultant in APAC for four years.
You can contact me on Yoshinori.Matsunobu_at_gmail.com (replace _at_ with @).