During allocator-intensive workloads, kswapd will be woken frequentlycausing free memory to oscillate between the high and min watermark.This is expected behaviour. Unfortunately, if the highest zone issmall, a problem occurs.

When balance_pgdat() returns, it may be at a lower classzone_idx thanit started because the highest zone was unreclaimable. Before checkingif it should go to sleep though, it checks pgdat->classzone_idx whichwhen there is no other activity will be MAX_NR_ZONES-1. It interpretsthis as it has been woken up while reclaiming, skips scheduling andreclaims again. As there is no useful reclaim work to do, it entersinto a loop of shrinking slab consuming loads of CPU until the highestzone becomes reclaimable for a long period of time.

There are two problems here. 1) If the returned classzone or order islower, it'll continue reclaiming without scheduling. 2) if the highestzone was marked unreclaimable but balance_pgdat() returns immediatelyat DEF_PRIORITY, the new lower classzone is not communicated back tokswapd() for sleeping.

This patch does two things that are related. If the end_zone isunreclaimable, this information is communicated back. Second, ifthe classzone or order was reduced due to failing to reclaim, newinformation is not read from pgdat and instead an attempt is made to goto sleep. Due to this, it is also necessary that pgdat->classzone_idxbe initialised each time to pgdat->nr_zones - 1 to avoid re-readsbeing interpreted as wakeups.