On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:> The poor IO patterns thing is a regression. Some time several years> ago (around 2.6.16, perhaps), page reclaim started to do a LOT more> dirty-page writeback than it used to. AFAIK nobody attempted to work> out why, nor attempted to try to fix it.

I just know that we XFS guys have been complaining about it a lot..

But that was mostly a tuning issue - before writeout mostly happenedfrom pdflush. If we got into kswapd or direct reclaim we alreadydid get horrible I/O patterns - it just happened far less often.

> Regarding simply not doing any writeout in direct reclaim (Dave's> initial proposal): the problem is that pageout() will clean a page in> the target zone. Normal writeout won't do that, so we could get into a> situation where vast amounts of writeout is happening, but none of it> is cleaning pages in the zone which we're trying to allocate from. > It's quite possibly livelockable, too.

As Chris mentioned currently btrfs and ext4 do not actually do delallocconversions from this path, so for typical workloads the amount ofwriteout that can happen from this path is extremly limited. And unlesswe get things fixed we will have to do the same for XFS. I'd be muchmore happy if we could just sort it out at the VM level, because thismeans we have one sane place for this kind of policy instead of threeor more hacks down inside the filesystems. It's rather interestingthat all people on the modern fs side completely agree here what theproblem is, but it seems rather hard to convince the VM side to doanything about it.

> To solve the stack-usage thing: dunno, really. One could envisage code> which skips pageout() if we're using more than X amount of stack, but> that sucks.

And it doesn't solve other issues, like the whole lock taking problem.

> Another possibility might be to hand the target page over> to another thread (I suppose kswapd will do) and then synchronise with> that thread - get_page()+wait_on_page_locked() is one way. The helper> thread could of course do writearound.

Allowing the flusher threads to do targeted writeout would be thebest from the FS POV. We'll still have one source of the I/O, justwith another know on how to select the exact region to write out.We can still synchronously wait for the I/O for lumpy reclaim if reallynessecary.