On Wed, Oct 12, 2011 at 02:57:53PM -0500, Christoph Lameter wrote:> > I think we've discussed switching vm_stat[] to a contention-avoiding> > counter scheme. Simply using <percpu_counter.h> would be the simplest> > approach. They'll introduce inaccuracies but hopefully any problems> > from that will be minor for the global page counters.> > We already have a contention avoiding scheme for counter updates in> vmstat.c. The problem here is that vm_stat is frequently read. Updates> from other cpus that fold counter updates in a deferred way into the> global statistics cause cacheline eviction. The updates occur too frequent> in this load.>

There is also a correctness issue to be concerned with. In the patch,there is a two second window during which the counters are not beingread. This increases the risk that the system gets too overcommittedwhen overcommit_memory == OVERCOMMIT_GUESS.

If vm_enough_memory is being heavily hit as well, it implies that thisworkload is mmap-intensive which is pretty inefficient in itself. Iguess it would also apply to workloads that are malloc-intensive forlarge buffers but I'd expect the cache line bounces to only dominate ifthere was little or no computation on the resulting buffers.

As a result, I wonder how realistic is this test workload and who usefulfixing this problem is in general?

> > otoh, I think we've been round this loop before and I don't recall why> > nothing happened.> > The update behavior can be tuned using /proc/sys/vm/stat_interval.> Increase the interval to reduce the folding into the global counter (set> maybe to 10?). This will reduce contention.

Unless the thresholds for per-cpu drift are being hit. If they areallocating and freeing pages in large numbers for example, we'll becalling __mod_zone_page_state(NR_FREE_PAGES) in large batches,overflowing the counters, calling zone_page_state_add() and dirtying theglobal vm_stat that way. In that case, increasing stat_interval alone isnot the answer.

> The other approach is to> increase the allowed delta per zone if frequent updates occur via the> overflow checks in vmstat.c. See calculate_*_threshold there.>

If this approach is taken, be careful that threshold is an s8 so it islimited in size.

> Note that the deltas are current reduced for memory pressure situations> (after recent patches by Mel). This will cause a significant increase in> vm_stat cacheline contention compared to earlier kernels.>

That statement is misleading. The thresholds are reduced whilekswapd is awake to avoid the possibility of all memory being allocatedand the machine livelocking. If the system is under enough pressure forkswapd to be awake for prolonged periods of time, the overhead of cacheline bouncing while updating vm_stat is going to be a lesser concern.

I like the idea of the threshold being scaled under normal circumstancesdepending on the size of the central counter. Conceivably it could bedone as part of refresh_cpu_vm_stats() using the old value of thecentral counter while walking each per_cpu_pageset.