On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote:> > ==== CUT HERE ====> > vmscan: Do not writeback filesystem pages in direct reclaim> > > > When memory is under enough pressure, a process may enter direct> > reclaim to free pages in the same manner kswapd does. If a dirty page is> > encountered during the scan, this page is written to backing storage using> > mapping->writepage. This can result in very deep call stacks, particularly> > if the target storage or filesystem are complex. It has already been observed> > on XFS that the stack overflows but the problem is not XFS-specific.> > > > This patch prevents direct reclaim writing back filesystem pages by checking> > if current is kswapd or the page is anonymous before writing back. If the> > dirty pages cannot be written back, they are placed back on the LRU lists> > for either background writing by the BDI threads or kswapd. If in direct> > lumpy reclaim and dirty pages are encountered, the process will stall for> > the background flusher before trying to reclaim the pages again.> > > > As the call-chain for writing anonymous pages is not expected to be deep> > and they are not cleaned by flusher threads, anonymous pages are still> > written back in direct reclaim.> > This is also a good step towards reducing pageout() calls. For better> IO performance the flusher threads should take more work from pageout().>

This is true for better IO performance all right but reclaim does requirespecific pages cleaned. The strict requirement is when lumpy reclaim isinvolved but a looser requirement is when any pages within a zone be cleaned.

That is why I'm expecting the "shrink oldest inode" patchset to help. Itstill requires a certain amount of luck but callers that encounter dirtypages will be delayed.

It's also because a certain amount of luck is required that the last patchin the series aims at reducing the number of dirty pages encountered byreclaim. The closer that is to 0, the less important the timing of flusherthreads is.

> I'd rather take the logic as "there are> too many dirty pages, shrink them to avoid some future pageout() calls> and/or congestion_wait() stalls".>

What do you mean by shrink them? They cannot be reclaimed until they areclean.

> So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times. Let's remove it?>

This loop only applies to direct reclaimers in lumpy reclaim mode andmemory containers. Both need specific pages to be cleaned and freed.Hence, the loop is to stall them and wait on flusher threads up to apoint. Otherwise they can cause a reclaim storm of clean pages thatcan't be used.

Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reachedbut I am inferring this from timing data rather than a direct measurement.

The whole patch is a change of behaviour but in this case it also makessense to focus on just the dirty pages. The first shrink_page_listdecided that the pages could not be unmapped and reclaimed - probablybecause it was referenced. This is not likely to change during the loop.

Testing with a version of the patch that processed the full list addedsignificant stalls when sync writeback was involved. Testing time lengthwas tripled in one case implying that this loop was continually reachingMAX_SWAP_CLEAN_WAIT.

The intention of this loop is "wait on dirty pages to be cleaned" andit's a change of behaviour, but one that makes sense and testingindicates it's a good idea.