Distributed pools attempt to minimize the impact of "false sharing" and thread contention on a Multi-core PC. Instead of accessing
the global memory storage concurrently (which might be a bottleneck), per-thread allocated memory pools can be used along with balancing
private and global pools to leverage fast non-locking algorithms. Such distributed solutions are not simple. Latest versions of Microsoft
Windows provide optimized synchronization primitives. We might want to take advantage of them, designing simple, but still fast pool solutions.

A simplest distributed pool can be an array of pools. Application threads are "distributed" within these pools according to some
distribution rules. The rules may depend on how threads are being created and end - it can be a uniform distribution created
at the application start up, or it can be a "hash-based" distribution. To try these techniques out, we introduce a template class,
Pool_t, representing an element of a pool array. Then, we will implement a class, Array_of_pools_t, representing a distributed pool.
As for a Pool_t object, we use a straightforward implementation leveraging the Windows Interlockedxxx API for single linked lists:

This pool implementation is thread safe. It is fast on a Uniprocessor PC. On a Multi-core PC, sharing of the head
of the list m_head incurs a performance penalty. The impact depends on the hardware and may be quite significant. Introducing a distributed pool,
we want different threads to update lists' heads at different memory locations, so that thread contention and "false sharing"
could be less significant. If all threads are created at application start up, a simple solution is to uniformly distribute all the threads within an array of pools:

When a thread is created, first it gets and stores the index = get_pool_index(). This index addresses a pool within the array,
which the thread will be accessing in further pop/push calls. It is important for a thread to use the same index all the time.
Another approach of creating a thread distribution might be using a hash as a pool index, for example:

Figure 4. Using a multiplicative hash to distribute threads within an array of pools.

The test results of a single Pool_t object and an Array_of_pools_t object are summarized in Figure 5. Eight threads (4 writers and 4 readers)
were concurrently accessing pools on a Quad-core PC. The writer thread popped a buffer from the pool, filled the buffer with random bytes,
and calculated the CRC. Then, the writer pushes the buffer back. Each reader thread pops a buffer from the pool, calculates the CRC,
and compares the latter with the CRC saved by the writer (an exception was supposed to be thrown in case of a CRC error, which would
indicate incorrectness of the pool implementation). The test application is attached to this article.

The average duration per read/write in microseconds was calculated based on 50 tests; each of them included 1,000,000 and 4,000,000
iterations for a writer and a reader, respectively. Test results show that the performance of the distributed pool Array_of_pools_t is better
as compared to a single pool, considerably amended (up to ~5-6 times) when the number of pools increases from 1 to 4 on a Quad-core PC.
When the number of threads increases from 8 to 32, performance is improving more considerably, up to 10 times. Two of the factors
affecting the performance of these pools are "false sharing" and thread contention. In order to reduce the impact of thread contention,
we may try a distribution rule "per processor" in place of the distribution rule "per threads". Figure 6 illustrates "per processor"
related modifications of the Pool_t and Array_of_pools_t code:

The idea of the "per processor" distribution is to reduce a number of threads which can access shared members of a pool object
at the same time on a Multi-core PC. This should further minimize thread contention on the pool heads, and thus amend performance.
But the test results in Figure 5 show that there is not much difference in the performance of the pools having "per processor"
and "per thread" distributions. "Per thread" distributed pools are not slower. In order to see the difference in contention
between "per thread" and "per processor" distributions, we may introduce our test version of Interlockedxxx functions for single
linked lists with a "contention" counter. The x32-bit implementation below is simple and perhaps is not too far from what Microsoft
does for x32-bit single linked list API:

Despite the higher values of "contention" counter for the "per thread" distributed pools, they are not slower, but even faster
than pools with the "per processor" distribution, see Figure 5. If a distributed pool is using fast synchronization primitives
(such as the Windows slist's API), the thread contention does not appear to be the only factor affecting performance.
Moreover, the test results of "per thread" distributed pools in Figure 9 shows that the increase in the number of pools greater than
4 does not further improve performance, even if the number of threads increases from 8 to 32.

If the number of threads equals 32, and the number of pools equals 4, eight threads are concurrently accessing a pool in the array.
In this case, we might expect improving performance along with further increase in the number of pools. But, as it is shown in Figure 9,
both the "contention" counter and performance stays the same on a Quad-core PC when the number of pools reaches 4. Practically, it means
that there is no sense to create a private memory pool per each application thread (as sometimes it is suggested). Depending on the hardware,
creating a distributed pool with a size (number of pools) greater than the number of processors/cores might also be a waste of resources.

Conclusion

Using a simple array of pools may considerably improve x32/x64 application performance, developed with Microsoft compiler
VC++9.0 for Microsoft Windows Vista, running on a Multi-core PC.

It might be a disadvantage to create more pools in an array than the number of processors/cores.

Acknowledgments

Many thanks to Dmitriy V'jukov and Serge Dukhopel. It was their idea to try "per processor" distributed pools on a Multi-core PC,
including methods of thread disstribution within pools. Dmitriy V'jukov also made some valuable comments on the usage and limitation of the code in this article.