Introduction

Concurrent containers are important for multi-threaded applications. Third party libraries, such as the award-wining Intel Threading Building Blocks (TBB) C++ library [1], offer parallel containers for various hardware, Operating Systems, and compilers. The TBB library seems to be the first choice to employ. On the other hand, if thread concurrency is not a high priority, we may consider simple techniques to add some concurrency flavor to our containers.

For such relatively large containers like arrays, vectors, or maps, it is unlikely for all the application threads to access the same elements at a time. We can try to leverage element scope fine-grained locking along with simple lock-free techniques, or fine-tuning concurrent access techniques similar to [2], while performing element scope operations (reading/writing element values). A bottleneck might be the global scope operations (inserts/deletes). Recently, Microsoft has offered an effective slim read/write lock API object, SRWLOCK. For the scenarios when global scope operations are not frequent, we might want to take advantage of this API to facilitate container global scope locking.

To try out these techniques, we will create and test containers using the STL std::map, stdext::hash_map, and different locking algorithms. Then, we will compare performance with the TBB tbb::concurrent_hash_map. When the test application starts, a thread creates a container having 1000,000 elements, and then rebuilds it, inserting/deleting elements. In parallel, 4 to 32 threads concurrently read and write element values. A writer thread fills an element buffer with random bytes, and calculates the CRC. A reader thread reads the buffer, and recalculates and compares the CRC with the value saved by the writer (an exception is supposed to be thrown in case of a CRC error, which would indicate incorrectness of the implementation). To compare the performance of different containers, we calculate the average duration per read/write in microseconds based on 50 tests; each of them includes 1,000,000 iterations for a thread inserting/deleting elements, and 4,000,000 iterations for writer and reader threads. The test application is attached to this article.

A simple approach underlies tested containers:

When a thread inserts or deletes an element, it calls the global scope exclusive lock. On acquiring the global exclusive lock, no other thread can access the container (unfortunately, it may denote no parallelism, no scalability, and might be a bottleneck depending on the global scope lock implementation). Inserts are demonstrated in Figure 1:

Figure 1. Using a global scope exclusive lock when changing the container structure.

When a writer thread changes an element value, it uses two-step locking. First, the writer calls the global scope shared lock. On acquiring the global shared lock, no thread can insert or delete elements. However, other writer/reader threads can still go parallel. Then, the writer acquires the element scope exclusive lock, and updates the element value (depending on the element scope locking, writes can be more or less parallel and scalable). The pseudo-code in Figure 2 demonstrates the updates:

Figure 2. Using a global scope shared lock and an element scope exclusive lock when writing a container element value.

A reader thread behaves similarly to the writer's, except it calls the element scope shared lock in place of the element exclusive lock. Reads can be parallel and scalable. Figure 3 demonstrates reads.

A simple implementation of the global scope lock is a wrapper of the Windows API SRWLOCK. Custom read/write locks might be a better choice in a scenario when the writer's performance matters [3]. We use, however, a "standard" object, SRWLOCK, for the sake of the test result's clarity and comparability.

As for element scope fine-grained locking, we may try the simplest spin-locks per element. Internally, a container in Figure 4 stores _internal_value<value_type> objects, which encapsulate both the element m_value and the correspondent spin-lock variable m_lock. The spin-lock status variable m_lock (1 or 0) controls m_value access per element.

The macro ELEMENT_SPINLOCK defines the Spin_lock class. We intentionally use the simplest implementation of the Spin_lock, in order to spot out how thread contention (even supposedly low at first glance) impacts the performance. Only one thread is allowed to access an element at a time, which might be a disadvantage for multiple readers. Figures 5, 6 show performance degradation (more than 3 times as compared with the Intel TBB tbb::concurrent_hash_map class) when the number of threads increases up to 32 (16 reader and 16 writer threads).

We could slightly improve this solution by using spin-counters, SwitchToThread() calls, and other amendments which are usually done for spin-locks. However, it will not facilitate more concurrency, in particular, for multiple readers.

A simple solution, which might pave the way for better reader parallelism and scalability, could be the usage of reader/writer locks in place of basic spin-locks. No changes of the template class SRWL_map_t are needed – we implement a reader/writer lock class, Spin_rwlock, and change the macro ELEMENT_SPINLOCK to use Spin_rwlock. An unsophisticated implementation of the Spin_rwlock can be alike with a reader/writer spin-lock from the article [3]:

The HIWORD of the m_rw_status variable (0 or 1) indicates that a writer is active. The LOWORD counts active readers. Multiple readers now access the same elements in parallel, which improves reader performance (up to 2 times for 32 threads, see Figures 5, 6). The disadvantage of the Spin_rwlock is degradation of writer performance, when increasing the number of threads; see writer performance in Figures 8, 9. Both spin-locks and read/write spin-locks work fine when thread contention is low, but for highly concurrent containers, they do not. For such scenarios, we might try to leverage more efficient locks, for instance, Windows SRWLOCK objects.

A wonderful, almost 10-year old, solution, providing fine-tuned concurrent element access for large containers, was introduced by Ian Emmons in his article [2]. The solution uses a simple technique of hashing the container element addresses to a lock object, an element of the array of 256 locks. This approach underlies the CritSecTable_locker_t and CST_map_t classes we implemented for the test purposes. The container shows great performance for both readers and writers, see the CS_table results in Figures 5, 6 and 8, 9. The advantage of this approach is that it facilitates the usage of complex and effective objects for fine-grained locking in place of basic spin-locks. This amends performance in a wide range of scenarios. Fine-tuned concurrent access in the article [2] is not, however, a fine-grained locking technique, because it is possible for different threads to hash and share the same lock, while accessing different elements. The more hash collisions we have, the more "fine-tuned access" stays away from fine-grained locking.

To facilitate fine-grained locking, we can use a pool of locks in place of hashing element addresses. When a thread needs access to a container element, it retrieves a lock object from the pool. Other threads address this lock, while accessing the same container element. The lock is pushed back to the pool when no threads are accessing the element. In other words, we need a pool of shared objects (locks). The minimum capacity of such pool equals the maximum number_of_threads (supposing that all the reader/writer threads might access different elements). The implementation of the Map_t container, which uses a pool of locks, is shown in Figure 10. For the sake of the test code simplicity, we use the same template argument lock_type for both pool locks and the global lock. The template could be easily adjusted, if we wanted to use different lock classes for global- and element-scope locking. To ease usage of a lock-pool in Map_t, we introduce a helper class Container_locker_t, see Figure 11.

Internally, the container holds Container_locker::internal_value<value_type> objects, encapsulating both the m_value and the correspondent m_lock per element. The HIWORD of the m_lock member is the index of the lock object (pool element index), which controls thread access to m_value. The LOWORD is the number of threads currently sharing the lock. When a thread needs to access a container element, it retrieves the index of a free lock object within the pool, and writes this value to the m_lock HIWORD. The LOWORD counter is incremented, when threads use this index to address an actual lock from the pool. The counter is decremented, after a thread has finished accessing the element. Finally, the lock index is pushed back to the pool, when both HIWORD and LOWORD are zero (no threads are accessing the lock).

To make these operations thread safe, we implement a pool wrapper Pool_of_shared_elements, see Figure 12. The elements m_lock values are passed to the Pool_of_shared_elements pop/push methods (as the reference_counter argument), where both HIWORD and LOWORD are set in an atomic way. A simple implementation of a base pool class, Pool_t, can be similar to the pool from the article[4]. We slightly adjusted that code, in order to address elements by indexes. The size of the Pool_t is defined by the template class parameter as pool_size = (2*number_of_threads–1). We use greater pool capacity than the minimum "number_of_threads", for the sake of simplicity of the Pool_of_shared_elements code. For each thread pair per m_lock, two pool objects (locks) may be held by one thread in a pair. As a result, two threads may hold three locks. Three threads - five locks. Having considered all the thread combinations by pairs, the pool capacity will be 2*number_of_threads–1 (to illustrate this scenario, we additionally attach the code of a test class, Pool_of_shared_elements_crash_test).

Two of the factors affecting the performance of Pool_t are "false sharing" and thread contention. To make pool operations faster on a multi-core PC, a distributed pool solution might be a better choice [4]. The Array_of_pools_t class, borrowed from [4] and having "per processor" distributed pools, is used in this article when implementing the Map_t class. The container shows high performance for both readers and writers, see the Lock_pool results in Figures 5, 6 and 8, 9. The advantage of this approach is that it better facilitates fine-grained locking for readers and writers, amending performance and scalability.

Conclusion

The simple locking techniques considered in this article can be used to facilitate more concurrency and scalability for element scope operations.

Locking techniques using the simplest spin-locks per container element can be leveraged for large, not highly contented containers. For the scenarios with "reader favor", using reader/writer spin-locks considerably improves reader concurrency and performance.

Using pools of locks is a fast and robust solution for highly concurrent containers on a multi-core, giving reasonable parity of scalability and element scope reader/writer performance.