[RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible

Date

Tue, 8 Jun 2010 10:02:19 +0100

I finally got a chance last week to visit the topic of direct reclaimavoiding the writing out pages. As it came up during discussions the lasttime, I also had a stab at making the VM writing ranges of pages insteadof individual pages. I am not proposing for merging yet until I want to seewhat people think of this general direction and if we can agree on if thisis the right one or not.

To summarise, there are two big problems with page reclaim right now. Thefirst is that page reclaim uses a_op->writepage to write a back backunder the page lock which is inefficient from an IO perspective due toseeky patterns. The second is that direct reclaim calling the filesystemsplices two potentially deep call paths together and potentially overflowsthe stack on complex storage or filesystems. This series is an early draftat tackling both of these problems and is in three stages.

The first 4 patches are a forward-port of trace points that are partlybased on trace points defined by Larry Woodman but never merged. They traceparts of kswapd, direct reclaim, LRU page isolation and page writeback. Thetracepoints can be used to evaluate what is happening within reclaim andwhether things are getting better or worse. They do not have to be part ofthe final series but might be useful during discussion.

Patch 5 writes out contiguous ranges of pages where possible usinga_ops->writepages. When writing a range, the inode is pinned and the pagelock released before submitting to writepages(). This potentially generatesa better IO pattern and it should avoid a lock inversion problem within thefilesystem that wants the same page lock held by the VM. The downside withwriting ranges is that the VM may not be generating more IO than necessary.

Patch 6 prevents direct reclaim writing out pages at all and instead dirtypages are put back on the LRU. For lumpy reclaim, the caller will brieflywait on dirty pages to be written out before trying to reclaim the dirtypages a second time.

The last patch increases the responsibility of kswapd somewhat becauseit's now cleaning pages on behalf of direct reclaimers but kswapd seemeda better fit than background flushers to clean pages as it knows where thepages needing cleaning are. As it's async IO, it should not cause kswapd tostall (at least until the queue is congested) but the order that pages arereclaimed on the LRU is altered. Dirty pages that would have been reclaimedby direct reclaimers are getting another lap on the LRU. The dirty pagescould have been put on a dedicated list but this increased counter overheadand the number of lists and it is unclear if it is necessary.

The series has survived performance and stress testing, particularly aroundhigh-order allocations on X86, X86-64 and PPC64. The results of the testsshowed that while lumpy reclaim has a slightly lower success rate whenallocating huge pages but it was still very acceptable rates, reclaim wasa lot less disruptive and allocation latency was lower.