Peter Zijlstra a écrit :> On Mon, 2008-12-08 at 18:00 -0500, Theodore Tso wrote:>> On Mon, Dec 08, 2008 at 11:20:35PM +0100, Peter Zijlstra wrote:>>> atomic_t is pretty good on all archs, but you get to keep the cacheline>>> ping-pong.>>>>> Stupid question --- if you're worried about cacheline ping-pongs, why>> aren't each cpu's delta counter cacheline aligned? With a 64-byte>> cache-line, and a 32-bit counters entry, with less than 16 CPU's we're>> going to be getting cache ping-pong effects with percpu_counter's,>> right? Or am I missing something?> > sorta - a new per-cpu allocator is in the works, but we do cacheline> align the per-cpu allocations (or used to), also, the allocations are> node affine.>

Then I tried to have atomic_t (or atomic_long_t) for 'counters', but got a10% slow down of __percpu_lcounter_add(), even if never hitting the 'slow path'atomic_long_add_return() is really expensiven, even on a non contended cacheline.

struct percpu_lcounter { atomic_long_t count;#ifdef CONFIG_SMP#ifdef CONFIG_HOTPLUG_CPU struct list_head list; /* All percpu_counters are on a list */#endif atomic_long_t *counters;#endif};So I believe the percpu_clounter_sum() that tries to reset to 0 all cpu local counts would be really too expensive, if it slows down _add() so much.