Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure

On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:> On Mon, 15 Mar 2010 13:34:50 +0100> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:> > > c) If direct reclaim did reasonable progress in try_to_free but did not> > get a page, AND there is no write in flight at all then let it try again> > to free up something.> > This could be extended by some kind of max retry to avoid some weird> > looping cases as well.> > > > d) Another way might be as easy as letting congestion_wait return> > immediately if there are no outstanding writes - this would keep the > > behavior for cases with write and avoid the "running always in full > > timeout" issue without writes.> > They're pretty much equivalent and would work. But there are two> things I still don't understand:> > 1: Why is direct reclaim calling congestion_wait() at all? If no> writes are going on there's lots of clean pagecache around so reclaim> should trivially succeed. What's preventing it from doing so?> > 2: This is, I think, new behaviour. A regression. What caused it?>

120+ kernels and a lot of hurt later;

Short summary - The number of times kswapd and the page allocator have been calling congestion_wait and the length of time it spends in there has been increasing since 2.6.29. Oddly, it has little to do with the page allocator itself.

Test scenario=============X86-64 machine 1 socket 4 cores4 consumer-grade disks connected as RAID-0 - software raid. RAID controller on-board and a piece of crap, and a decent RAID card could blow the budget.Booted mem=256 to ensure it is fully IO-bound and match closer to what Christian was doing

At each test, the disks are partitioned, the raid arrays created and anext2 filesystem created. iozone sequential read/write tests are run withincreasing number of processes up to 64. Each test creates 8G of files. i.e.1 process = 8G. 2 processes = 2x4G etc

Large differences in this do not necessarily show up in iozone because thedisks are so slow that the stalls are a tiny percentage overall. However, inthe event that there are many disks, it might be a greater problem. I believeChristian is hitting a corner case where small delays trigger a much largerstall.

Why The Increases=================

The big problem here is that there was no one change. Instead, it has beena steady build-up of a number of problems. The ones I identified are in theblock IO, CFQ IO scheduler, tty and page reclaim. Some of these are fixedbut need backporting and others I expect are a major surprise. Whether theyare worth backporting or not heavily depends on whether Christian's problemis resolved.

Some of the "fixes" below are obviously not fixes at all. Gathering this datatook a significant amount of time. It'd be nice if people more familiar withthe relevant problem patches could spring a theory or patch.

2.6.30 replaced congestion queues based on read/write with sync/async in commit 1faa16d2. Problems were identified with this and fixed in 2.6.31 but not backported. Backporting 8aa7e847 and 373c0a7e brings 2.6.30 in line with 2.6.29 performance. It's not an issue for 2.6.31.

2. TTY using high order allocations more frequently fix title: ttyfix fixed in mainline? yes, in 2.6.34-rc2 affects: 2.6.31 to 2.6.34-rc1

2.6.31 made pty's use the same buffering logic as tty. Unfortunately, it was also allowed to make high-order GFP_ATOMIC allocations. This triggers some high-order reclaim and introduces some stalls. It's fixed in 2.6.34-rc2 but needs back-porting.

For reasons that are not immediately obvious, the evict-once patches *really* hurt the time spent on congestion and the number of pages reclaimed. Rik, I'm afaid I'm punting this to you for explanation because clearly you tested this for AIM7 and might have some theories. For the purposes of testing, I just reverted the changes.

A bisection finger printed this patch as being a problem introduced between 2.6.32 and 2.6.33. It increases a small amount the number of times the page allocator stalls but drastically increased the number of pages reclaimed. It's not clear why the commit is such a problem.

Unfortunately, I could not test a revert of this patch. The CFQ and block IO changes made in this window were extremely convulated and overlapped heavily with a large number of patches altering the same code as touched by commit 718eee057. I tried reverting everything made on and after this commit but the results were unsatisfactory.

Hence, there is no fix in the results below

Results=======

Here are the highlights of kernels tested. I'm omitting the bisectionresults for obvious reasons. The metrics were gathered at two points;after filesystem creation and after IOZone completed.

At this point, the CFQ commit "cfq-iosched: fairness for sync no-idlequeues" has lodged itself deep within CGQ and I couldn't tear it out orsee how to fix it. Fixing tty and reverting evict-once helps but the numberof stalls is significantly increased and a much larger number of pages getreclaimed overall.

Again, ttyfix and revertevict help a lot but CFQ needs to be fixed to getback to 2.6.29 performance.

Next Steps==========

Jens, any problems with me backporting the async/sync fixes from 2.6.31 to2.6.30.x (assuming that is still maintained, Greg?)?

Rik, any suggestions on what can be done with evict-once?

Corrado, any suggestions on what can be done with CFQ?

Christian, can you test the following amalgamated patch on 2.6.32.10 and2.6.33 please? Note it's 2.6.32.10 because the patches below will not applycleanly to 2.6.32 but it will against 2.6.33. It's a combination of ttyfixand revertevict. If your problem goes away, it implies that the stalls Ican measure are roughly correlated to the more significant problem you have.

We allocate during interrupts so while our buffering is normally diced upsmall anyway on some hardware at speed we can pressure the VM excessivelyfor page pairs. We don't really need big buffers to be linear so don't tryso hard.

In order to make this work well we will tidy up excess callers to request_room,which cannot itself enforce this break up.

/* * We default to dicing tty buffer allocations to this many characters- * in order to avoid multiple page allocations. We assume tty_buffer itself- * is under 256 bytes. See tty_buffer_find for the allocation logic this- * must match+ * in order to avoid multiple page allocations. We know the size of+ * tty_buffer itself but it must also be taken into account that the+ * the buffer is 256 byte aligned. See tty_buffer_find for the allocation+ * logic this must match */