Thursday, March 13, 2014

There is a relatively new and platform dependent flushing function called sync_file_range(). Some databases (not MySQL) use sync_file_range() internally.
Recently I investigated stall issues caused by buffered write and sync_file_range(). I learned a lot during investigation but I don't think these behaviors are well known to the public. Here I summarize my understandings.

Understanding differences between sync_file_range() and fsync()/fdatasync()

sync_file_range() has some important behavior differences from fsync().

sync_file_range() has a flag to flush to disk asynchronously. fsync() always flushes to disk synchronously.
sync_file_range(SYNC_FILE_RANGE_WRITE) does async writes (async sync_file_range()), sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) does sync writes (sync sync_file_range()). With async sync_file_range(), you can *usually* call sync_file_range() very quickly and let Linux flush pages to disk later. As I describe later, async sync_file_range() is actually not always asynchronous, and is sometimes blocked for writeback. It is also important that I/O errors can't be notified when using async sync_file_range().

sync_file_range() allows to set file ranges (starting offset and size) to flush to disk.
fsync() always flushes all dirty pages of the file. Ranges are rounded
to page unit size. For example, sync_file_range(fd, 100, 300) will flush
from offset 0 to 4096 (flushing page#1), not limited from offset 100 to 300. This is because minimum I/O unit is page.

sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) does not wait for metadata flushing. fsync() waits until flushing both data and metadata are done. fdatasync() skips flushing metadata if file size does not change (fsync() also skips flushing metadata in that case, depending on filesystem). sync_file_range() does not wait metadata flushing even though file size changes. If a file is not overwritten (=appended), sync_file_range() does not guarantee the file can be recovered after crash, while fsync()/fdatasync() guarantee that.

sync_file_range() behavior highly depends on kernel version and filesystem.

xfs does neighbor page flushing, in addition to specified ranges. For example, sync_file_range(fd, 8192, 16384) does not only trigger flushing page #3 to #4, but also flushing many more dirty pages (i.e. up to page#16). This works very well for HDD because I/O unit size becomes bigger. In general, synchronously writing 1MB * 1000 times is much faster than writing 4KB * 256,000 times. ext3 and ext4 don't do neighbor page flushing.

sync_file_range() is generally faster than fsync() because it can control dirty page ranges and skips waiting for metadata flushing. But sync_file_range() can't be used for guaranteeing durability, especially when file size changes.

Practical usage of the sync_file_range() is where you don't need full durability but you want to control(reduce) dirty pages. For example, Facebook's HBase uses sync_file_range() for compactions and HLog writes. HBase does not need full durability (fsync()) per write because HBase relies on HDFS and HDFS can recover from HDFS replicas. Compactions write huge volume of data so periodically calling sync_file_range() makes sense to avoid burst writes. Calling sync_file_range() 1MB * 1000 times periodically gives more stable workloads than flushing 1GB at one time. RocksDB also uses sync_file_range().

Async sync_file_range is not always asynchronous

Sometimes you might want to flush pages/files more earlier than relying on kernel threads (bdflush), in order to avoid burst writes. fsync() and sync sync_file_range() (sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER)) can be used for that purpose, but both takes longer time (~10ms) on HDD if RAID write cache is disabled. You probably don't want to execute from user-facing thread.
How about using async sync_file_range() (sync_file_range(SYNC_FILE_RANGE_WRITE)) from user-facing thread? It's supposed not to wait for i/o, so latency should be minimal. But I don't recommend using sync_file_range() from user facing thread like that. This is actually not always asynchronous, and there are many cases it takes time for waiting for disk i/o.
I'm showing a couple of examples where async sync_file_range() takes longer time. In the following examples, I assume stable page writes are already disabled.

Stall Example 1: Small range sync_file_range()

In example 1, with stable page write fix, write() won't wait for dirty pages written to disk(writeback). But sync_file_range() actually waits for writeback.
When stable page write is disabled, there is a possibility that a page is both writeback in progress and marked dirty. Below is an example scenario.

In this case, the second sync_file_range(SYNC_FILE_RANGE_WRITE) is blocked until flushing to disk triggered by the first sync_file_range() is done, which may take tens of milliseconds.
Here is an example stack trace when sync_file_range() is blocked.

Stall example 2: Bulk sync_file_range()

What happens if calling write() multiple times then call sync_file_range(SYNC_FILE_RANGE_WRITE) for multiple pages at once? In below example, calling write() 21 times then triggering flush by sync_file_range().

Unfortunately, sync_file_range() also may take time in this case too.
It works as below in xfs. Since xfs does neighbor page flushing via sync_file_range(), there is a possibility that a page is both under writeback in progress and marked dirty.

Note that if write volume (and overall disk busy rate) is lower enough than disk speed, page 6 should be flushed to disk before starting second sync_file_range(). In that case it shouldn't wait anything.

Stall example 3: Aligned page writes

The main reason why async sync_file_range() was blocked is that write() was not aligned by page size. What if we are doing fully aligned page write (writing 4KB multiple)?
With aligned page write, async sync_file_range() does not wait shown at Example 1 and 2, and gives much better throughput. But, even with aligned page write, sometimes async sync_file_range() waits for disk i/o.
sync_file_range() submits page write i/o requests to disks. If there are many outstanding i/o read/write requests in a disk queue, new i/o requests are blocked until there is a free slot available in the queue. This blocks sync_file_range() too.
Queue size is managed under /sys/block/sdX/queue/nr_requests. You may increase to larger values.

echo 1024 > /sys/block/sda/queue/nr_requests

This mitigates stalls at sync_file_range() on busy disks. But this won't solve problems entirely. If you submit many more write i/o requests, read requests take more time to serve (write-starving-reads) which very negatively affects user-facing query latency.

Solution for the stalls

Make sure use Linux kernels supporting disabling stable page write. Otherwise write() would be blocked. My previous post covers this topic. sync_file_range(SYNC_FILE_RANGE_WRITE) is supposed to by asynchronous, but is actually blocked for writeback in many patterns, so it's not recommended calling sync_file_range() from user-facing thread, if you really care about latency. Calling sync_file_range() from a background (not user-facing) thread would be better solution here.
Buffered write and sync_file_range() are important for some databases like HBase and RocksDB. For HBase/Hadoop, using JBOD is one of the well known best practices. HLog writes are buffered, and not flushed to disk per write(put operation). There are some HBase/Hadoop distributions supporting sync_file_range() to reduce outstanding dirty pages. From Operating System point of view, HLog files are appended, and file size is not small (64MB by default). This means all HLog writes go to a single disk with JBOD configuration, which means the single disk tends to be overloaded. An overloaded disk takes longer time for flushing dirty pages (via sync_file_range or bdflush), which may block further sync_file_range(). To get better latency, using Linux Kernel supporting to disable stable page write, and calling sync_file_range() from background threads (not from user-facing thread) are important.

This is the foremost instant very have} glimpsed you’re secured and do love declared you – it's very persuading to seem at that i'm appreciative for your toil gift card mall. however if you presumably did it in associate exceedingly} extremely easy procedure that's ready to be extremely gracious. but over all i actually elective you and positive will keep for additional mails like this. articulate feeling you most.

This is the first instant I even have glimpsed your content and do would like to provide notice you – it's extremely nice to glimpse which i appreciate your exertions. but if you most likely did it throughout a simple methodology which will be terribly pleasant payday loans. but over all I extraordinarily prompt you and certain will delay for lots of mails like this. several thanks most.

To start with the foremost necessary line of my statement – I do yearn to produce a vast as a results of the periodical esteem. terribly it's pleasing laurels wondrous work by him that i detected out laurels dependable facilitate by his/her Brobdingnagian detail and figures. I merely have to be compelled to be compelled to be compelled to be compelled to broadcast car title loans in montgomery, Brobdingnagian delight comprise it up your work. usually I’ll sit down at the facet of your posting and alter. esteem ahead to your a allotment of mails.

I simply have a glimpse here and feel pleasant to hunt out out this journal. flush content writing hand and intensely cooperative science system. i might like most of we've a inclination to organization quadrilateral assess to appear out these types of things, here we've got got got a bent to tend to tend to face persist the brink of acknowledge everything payday loans chicago. i am with the content respect associate degreed do respect him as associate honest provider. Thanks for your diligence and you too.

Set versions most fitted to the workplace commuter equipment is worn, not merely saves time but will also together with the self into a basic, to make certain that women's gown impeccable taste. Slim classy design and style appears to be really neat Parajumpers Jackets and but female, retro edition accompanied the two elegant and contemporary and shiny, supplying solid stuff will always be a way of eternal existence over and above the restrictions of your time. Three-dimensional brilliant, full of a feeling of Moose Knuckles Canada profile of wool coat, each outstanding and warm. H-type outfits Canada Goose Women Jackets and system appear neat ambiance, you can find no limit to the sizing of, but did Canada Goose Sale Canada not see the bloated, so sense added more comfortable and assured carrying. The cherry pink sweater trip to the bleak autumn delivers brilliant and vitality, with basic white skirt, exhibiting quiet natural beauty. Temperament is mostly a compliment to the lady anxious, it arises from self-confident, proficient, clever. Exceptional confront in due course as time goes by, and temperament is simply the opposite, aided by the passage of your time far more apparent. It is a visitors convincing gesture, straightforward although not very easy.

Post a Comment

About Me

I am a database engineer at Facebook.
Before joining Facebook, I was a principal database and infrastructure architect at DeNA. My primary responsibility at DeNA was to make our database infrastructure more reliable, faster and more scalable. Before joining DeNA, I worked at MySQL/Sun/Oracle as a lead MySQL consultant in APAC for four years.
You can contact me on Yoshinori.Matsunobu_at_gmail.com (replace _at_ with @).