It is a known fact that multi-threaded applications do not scale well with standard memory allocator, because the heap is a bottleneck. When multiple threads simultaneously allocate or de-allocate memory from the allocator, the allocator will serialize them. Therefore, with the addition of more threads, we find more threads waiting, and the wait time grows longer, resulting in increasingly slower execution times. Due to this behavior, programs making intensive use of the allocator actually slow down as the number of processors increases. Hence standard malloc works well only in single-threaded applications, but poses serious scalability issues with multi-threaded applications running on multi-processor (SMP) servers.

Solution: libumem, an userland slab allocator

Sun started shipping libumem, an userland slab (memory) allocator, with Solaris 9 Update 3. libumem provides faster and more efficient memory allocation by using an object caching mechanism. Object caching is a strategy in which memory that is frequently allocated and freed will be cached, so the overhead of creating the same data structure(s) is reduced considerably. Also per-CPU set of caches (called Magazines) improve the scalability of libumem, by allowing it to have a far less contentious locking scheme when requesting memory from the system. Due to the object caching strategy outlined above, the application runs faster with lower lock contention among multiple threads.

libumem is a page based memory allocator. That means, if a request is made to allocate 20 bytes, libumem aligns it to the nearest page (ie., at 24 bytes on SPARC platform -- the default page size is 8K on Solaris/SPARC) and returns a pointer to the allocated block. As these requests add up, it can lead to internal fragmentation, so the extra memory that is not requested by application, but allocated by libumem is wasted. Also libumem uses 8 bytes of every buffer it creates, to keep meta data about that buffer. Due to the reasons outlined in this paragraph, there will be a slight increase in the per process memory footprint.

Quick tip:Run "truss -c -p <pid>", and stop the data collection with Ctrl-c (^c) after some time say 60 sec. If you see more number of system calls to lwp_park, lwp_unpark, lwp_mutex_timedlock, it is an indication that the application is suffering from lock contention, and hence may not scale well. Consider linking your application with libumem library, or pre-load libumem during run-time, for better scalability.