As proposed by Chris, Dave and Jan, don't start foreground writeback IOinside balance_dirty_pages(). Instead, simply let it idle sleep for sometime to throttle the dirtying task. In the mean while, kick off theper-bdi flusher thread to do background writeback IO.

RATIONALS=========

The current balance_dirty_pages() is rather IO inefficient.

- concurrent writeback of multiple inodes (Dave Chinner)

If every thread doing writes and being throttled start foreground writeback, it leads to N IO submitters from at least N different inodes at the same time, end up with N different sets of IO being issued with potentially zero locality to each other, resulting in much lower elevator sort/merge efficiency and hence we seek the disk all over the place to service the different sets of IO. OTOH, if there is only one submission thread, it doesn't jump between inodes in the same way when congestion clears - it keeps writing to the same inode, resulting in large related chunks of sequential IOs being issued to the disk. This is more efficient than the above foreground writeback because the elevator works better and the disk seeks less.

- IO size too small for fast arrays and too large for slow USB sticks

The write_chunk used by current balance_dirty_pages() cannot be directly set to some large value (eg. 128MB) for better IO efficiency. Because it could lead to more than 1 second user perceivable stalls. Even the current 4MB write size may be too large for slow USB sticks. The fact that balance_dirty_pages() starts IO on itself couples the IO size to wait time, which makes it hard to do suitable IO size while keeping the wait time under control.

For the above two reasons, it's much better to shift IO to the flusherthreads and let balance_dirty_pages() just wait for enough time or progress.

Jan Kara, Dave Chinner and me explored the scheme to letbalance_dirty_pages() wait for enough writeback IO completions tosafeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT. Because NFS server serves COMMIT with expensive fsync() IOs, it is desirable to delay and reduce the number of COMMITs. So it's not likely to optimize away such kind of bursty IO completions, and the resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control thepause time in each balance_dirty_pages() invocations, by controllingthe number of pages dirtied before calling balance_dirty_pages(), forsmooth and efficient dirty throttling:

Users will notice that the applications will get throttled once crossingthe global (background + dirty)/2=15% threshold, and be balanced around17.5%. Before patch, the behavior is to just throttle it at 20%dirtyable memory.

Since the task will be soft throttled earlier than before, it may beperceived by end users as performance "slow down" if his applicationhappens to dirty more than 15% dirtyable memory.

THINK TIME==========

The task's think time is into account when computing the final pause time.This will make accurate throttle bandwidth. In the rare case that the taskslept longer than the period time, the extra sleep time will also becompensated in next period if it's not too big (<500ms). Accumulatederrors are carefully avoided as long as the task don't sleep for toolong time.

case 1: period > think

pause = period - think paused_when += pause

period time |======================================>| think time |===============>| ------|----------------|----------------------|----------- paused_when jiffies

case 2: period <= think

don't pause and reduce future pause time by: paused_when += period

period time |=========================>| think time |======================================>| ------|--------------------------+------------|----------- paused_when jiffies

In general there are no big IO performance changes for desktop users,except for some noticeable reduction of CPU overheads. It mainlybenefits file servers with heavy concurrent writers on fast storagearrays. As can be demonstrated by 10/100 concurrent dd's on xfs:

- 1 dirtier case: the same- 10 dirtiers case: CPU system time is reduced to 50%- 100 dirtiers case: CPU system time is reduced to 10%, IO size and throughput increases by 10%

/*- * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited- * will look to see if it needs to force writeback or throttling.+ * Don't sleep more than 200ms at a time in balance_dirty_pages(). */-static long ratelimit_pages = 32;--/*- * When balance_dirty_pages decides that the caller needs to perform some- * non-background writeback, this is how many pages it will attempt to write.- * It should be somewhat larger than dirtied pages to ensure that reasonably- * large amounts of I/O are submitted.- */-static inline long sync_writeback_pages(unsigned long dirtied)-{- if (dirtied < ratelimit_pages)- dirtied = ratelimit_pages;-- return dirtied + dirtied / 2;-}+#define MAX_PAUSE max(HZ/5, 1)

/*- * If ratelimit_pages is too high then we can get into dirty-data overload- * if a large number of processes all perform writes at the same time.- * If it is too low then SMP machines will call the (expensive)- * get_writeback_state too often.- *- * Here we set ratelimit_pages to a level which ensures that when all CPUs are- * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory- * thresholds before writeback cuts in.- *- * But the limit should not be set too high. Because it also controls the- * amount of memory which the balance_dirty_pages() caller has to write back.- * If this is too large then the caller will block on the IO queue all the- * time. So limit it to four megabytes - the balance_dirty_pages() caller- * will write six megabyte chunks, max.- */--void writeback_set_ratelimit(void)-{- ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);- if (ratelimit_pages < 16)- ratelimit_pages = 16;- if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)- ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;-}--static int __cpuinit-ratelimit_handler(struct notifier_block *self, unsigned long u, void *v)-{- writeback_set_ratelimit();- return NOTIFY_DONE;-}--static struct notifier_block __cpuinitdata ratelimit_nb = {- .notifier_call = ratelimit_handler,- .next = NULL,-};--/* * Called early on to tune the page writeback dirty limits. * * We used to scale dirty pages according to how total memory@@ -1492,9 +1463,6 @@ void __init page_writeback_init(void) { int shift;