Scalable Statistics Counters

Overview:

In-kernel counters are used to collect statistics of various events and
render the same on requests (Eg., total packets transmitted from
a network interface, context swithces, disk usage statistics, etc). These
counters are usually not serialised by any mutex, since such counters need
to be reasonably accurate only. Depending on the purpose and load
conditons, these counters may be be updated very frequently. In this
document, a new set of interfaces are proposed, which should improve
performance of such counters in multiprocessor environments.

The basic idea behind the approach is to have per cpu versions of these
counters and update per-cpu versions. The per cpu versions can be
consolidated on a read request to the counter. This improves performance
significantly in multiprocessor environments by reducing
the bouncing of the counter cache lines among different CPUs. At the
same time, this implementation switches to code that is completely
devoid of overheads for SMP when compiled without CONFIG_SMP option
set.

Interfaces:

Counters where exactness is not important (like statistics counters)
can be declared as of type statctr_t. This is an opaque type
and users of the counters do not need to bother about its fields.
For an SMP kernel (CONFIG_SMP=y), this is a structure that
has information which can be used to quickly compute the appropriate
per-cpu counter. For uniprocessor kernel, this is just a
unsigned long and thus there is no overhead.

All statctrs must be initialized using one of these interfaces.
These counters cannot be just used by statically allocating statctr_t
types at compile time, simply because the counter is going to refer to
per cpu versions in different cachelines (which is going
to be allocated during statctr_init()). Counters allocated using
these functions must be freed using statctr_cleanup()
or statctr_ncleanup().
For, uniprocessor kernels, statctr_init()/statctr_ninit() reduce to
simply initializing the unsigned long counter or an array of them, there is no
separate allocation involved here. These functions return 0 on success and
a non-zero value on failure.

void statctr_cleanup(statctr_t *stctr)

void statctr_ncleanup(statctr_t *stctr, int count)

These interfaces must be used to free statistics counters
initialized using statctr_init() or statctr_ninit()
respectively. These function frees up the per-cpu counter(s)/internal
data structure(s) allocated during statctr_init() or
statctr_ninit(). They are NOP for uniprocessor kernels.

void statctr_inc(statctr_t *stctr)

This interface increments the counter by one. Internally, only the
per-cpu counter is incremented. For uniprocessor kernels, it simply
non-atomically increments the unsigned long statistics counter.

void statctr_dec(statctr_t *stctr)

Decrements the counter by one. Internally, only the per-cpu
counter is decremented. For uniprocessor kernels, it non-atomically
decrements the counter.

void statctr_add(statctr_t *stctr, unsigned long addend)

Adds the addend value to counter. Internally, only the per-cpu counter
is added for SMP kernels.

void statctr_sub(statctr_t *stctr, unsigned long subtrahend)

Subtracts the subrtrahend value from the counter. Internally,
only the per-cpu counter is updated for SMP kernels.

unsigned long statctr_read(statctr_t *stctr)

Returns the counter value. This function reads all of the
per-cpu versions of this counters, adds them and returns to the
caller. For uniprocessor kernels, it just returns the
value of the counter.

void statctr_set(statctr_t *stctr, unsigned long i)

Sets the statctr to the value i. If statctr_read()
is invoked on a counter after statctr_set(), the return
value should reflect the value set.

Statistics Counter Maintenance:

In order to avoid sharing of cache lines between CPUs, the per-CPU
versions of a statctr_t counter are located in different cache lines.
However, since the counter size is a lot smaller than typical
cache line sizes, different statistic counters interlaced their
per-CPU versions and thus save on space.

We use a separate allocator to allocate and maintain interlaced
per-cpu counters. Currently there are two implementations of
this allocator - Fixed and Flexible. They differ in flexibility
provided, specially with respect to handling addition or removal
of CPUs in the system. The statctr implementation uses a common
set of APIs to access the per-cpu counters irrespective of the
underlying allocator. A per-cpu counter is represented by
an opaque type pcpu_ctr_t. The APIs are -

pcpu_ctr_t *pcpu_ctr_alloc(int flag)

This allocates a per-CPU counter including all the data structures
necessary in the underlying implementation. Returns NULL on failure.

Here a statistics counter declared and initialized using
statctr_init() will be allocated from a
smp_num_cpus * SMP_CACHE_BYTES sized block with each per-CPU
version residing in a different cache line aligned chunk of
SMP_CACHE_BYTES size. The next counter initialized will share
the same block by placing the per-CPU versions interlaced.
When one set of blocks get exhausted, another set is allocated.

The statctr_t opaque type contains a pointer to the CPU #0 version
of the statistics counter. This pointer resides on line 0 of the
statctr_t block (of size smp_num_cpus * SMP_CACHE_BYTES).
The address of the per-cpu version of this counter is computed using
pointer arithmatic.

Because SMP_CACHE_BYTES will be a power of 2, reasonable
compliers would generate code without using costlier instructions.

Flexible counters:

In order to make statctr work with CPU-hotplug features, we need
to use a different approach to maintain per-cpu counters. For
CPU online/offline to work with the Fixed counters, we have to
allocate cacheline sized counter chunks (of size SMP_CACHE_BYTES)
for NR_CPUS cpus which is likely to be a waste on many SMP machines.
Instead, we use a flexible array based scheme that allows dynamic expansion
of per-cpu chunks allocated for counters.

Each statctr is represented by a an array of size NR_CPUS
that contains the pointers to the per-cpu versions of the counters.
Initially we fill out the array with the pointers
for the currently online CPUs. Onlining/Offlining would only
requiring updates to the affected CPUs entry in this array.
The data structures that contain the statctr will not need
any modification and will continue to available to other
CPUs for increment/decrement/read operations.

The allocator maintains the counters in counter blocks.
Each block consists of an array of size NR_CPUS with
each element corresponding to an online CPU pointing to
a cacheline sized, cacheline aligned counter chunk (of SMP_CACHE_BYTES).
These chunks are allocated using the existing kmem_cache_alloc()
slab allocator. The counter block
also has a free list to allocate a single counter within itself.

The fast path code (increment/decrement) involves an extra
load to get the given CPUs version of the counter, but it
isn't likely to be as significant compared to the savings
since the extra load will not result in cache line bouncing.

We may maintain a list of all such counters allocated
(linked list of the counter pointer-arrays). During onlining or
offlining CPUs, this list is to be traversed and the corresponding
pointer in the array updated. There is no locking here necessary since
different CPUs will change their own pointers. Of course locking is needed
for creating new counters.

If the number of CPUs in the system decreases, then we can live
with letting the previously allocated cache lines be around
unused. If the number of CPUs increase, then we need to allocate
an additional cache line for each additional CPU.

This scheme will also handle sparse CPU numbers.

Results:

Microbenchmarking:

Microbenchmarking, while not indicative of any real workload, is
still beneficial to verify if a certain implementation has
inherent bottlenecks. So, here are the results of cost measurement
for global counters vs. statctr averaged over 200 million operations.

Obvservations:

6 cycles per global counter increment includes the overhead
of loop termination check and a delay of couple of cycles to allow
cacheline requests from other CPUs to come in before the next
increment. This is a more realistic scenerio.

Single processor cost (11 cycles) is greater for statctr,
but the measurements were done using an SMP kernel. In UP kernel,
statctrs are reduced to a long..

The cost of statctr while greater is negated by cacheline delays
with global counters enough that we see benefits from 2 CPUs upwards
for frequently used counters.

Other than negligible associativity misses, statctr implementation
doesn't have any bad cache overhead.

Tbench over loopback:

The loopback driver uses 4 counters to maintain Tx/Rx statistics.
This measurement used tbench over lo to compare relative
advantage and disadvantage of using statctrs for these statistics.
Amount of cycles needed to increment 4 counters as well as
profile count for loopback_xmit() where measured.

Obvservations:

Cost of statctr is higher for 1 processor, but that is for
SMP kernels, for UP kernels, statctr is a long.

Effect of cacheline bouncing in loopback_xmit under
tbench load is sufficient enough to see benefit of statctr from 2
CPUs upwards.

The flattening of the cost curve from 2 CPUs upward is probably
due to the fact that loopback_xmit() is serialized. The cacheline
related delay is always limited to getting a modified cacheline
and there aren't multiple CPUs trying to modify the
cacheline simultaneously.