Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure

Andrew Morton wrote:> On Fri, 12 Mar 2010 13:15:05 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:> >>> It still feels a bit unnatural though that the page allocator waits on>>> congestion when what it really cares about is watermarks. Even if this>>> patch works for Christian, I think it still has merit so will kick it a>>> few more times.>> In whatever way I can look at it watermark_wait should be supperior to >> congestion_wait. Because as Mel points out waiting for watermarks is >> what is semantically correct there.> > If a direct-reclaimer waits for some thresholds to be achieved then what> task is doing reclaim?> > Ultimately, kswapd. This will introduce a hard dependency upon kswapd> activity. This might introduce scalability problems. And latency> problems if kswapd if off doodling with a slow device (say), or doing a> journal commit. And perhaps deadlocks if kswapd tries to take a lock> which one of the waiting-for-watermark direct relcaimers holds.

So then why not letting the process do something about it if no writes are outstanding instead of going to sleep. It might be able totake care of its bad situation alone, maybe by calling try_to_free again.

> Generally, kswapd is an optional, best-effort latency optimisation> thing and we haven't designed for it to be a critical service. > Probably stuff would break were we to do so.> > > This is one of the reasons why we avoided creating such dependencies in> reclaim. Instead, what we do when a reclaimer is encountering lots of> dirty or in-flight pages is> > msleep(100);> > then try again. We're waiting for the disks, not kswapd.> > Only the hard-wired 100 is a bit silly, so we made the "100" variable,> inversely dependent upon the number of disks and their speed. If you> have more and faster disks then you sleep for less time.> > And that's what congestion_wait() does, in a very simplistic fashion. > It's a facility which direct-reclaimers use to ratelimit themselves in> inverse proportion to the speed with which the system can retire writes.

I would totally agree if I wouldn't have that scenario suffering so muchfrom that mechanism.

In the scenario Mel, Nick and I discussed for a while are no writes atall, but a lot of page cache reads.In this scenario direct_reclaimer runs quite frequently into the case of"did_some_progress && !page" which leads to congestion_wait calls in thecaller of direct_reclaim - eventually waiting always the full timeout asthere are no writes.

I think reclaim in this case is just done by dropping clean page cachepages in try_to_free_pages in this case -> so still no writes.For the solution it is hard to find the right layer, as the race is in direct_reclaim but the wait call is outside of it.

The alternatives we have so far are:a) congestion_wait which works fine with writes in flight in the system,but with a huge drawback for non writing systems.b) watermark wait which covers writes like congestion_wait (if they freeup enough) but also any other kind of reclaimers like processes freeingup stuff, other page cache droppers.

new suggestions:These ideas came up when trying to view it from your position. I don't know exactly if all are doable/feasible, but as we are going to wait anyway so we could do complex things in that path.

c) If direct reclaim did reasonable progress in try_to_free but did notget a page, AND there is no write in flight at all then let it try againto free up something.This could be extended by some kind of max retry to avoid some weirdlooping cases as well.

d) Another way might be as easy as letting congestion_wait returnimmediately if there are no outstanding writes - this would keep the behavior for cases with write and avoid the "running always in full timeout" issue without writes.

e) like d, but let it go to the watermark wait if no writes exist.

So I don't consider option a) a solution as we have real world scenarios with huge impacts, even putting more burden on top of kswapd's shoulders b) is still better - remember as long as writes are there its almost the same as congestion_wait, but waiting for the right time to wake up (awoken allocs will still fail if below watermark).And c-e) well I'm not sure yet, just things that came to my mind.

For the moment I would suggest going forward with Mels watermark waittowards the stable tree as it "fixes" a huge issue there (or better its symptoms) and the patch is small, neat and matching .32.We can then separately continue discuss without any pressure how we can finally get rid of all that race/latency/kswap issues at all in 2.6.3n+1