If you have multithreaded application which allocates a large number of objects (using standard
new and delete operators) then most likely you were faced with performance
degradation problem at multiprocessor systems. You program is running fine at computer with single CPU.
Then you try to run it at the system with 2 or more processors and instead of expected
increase of performance 2 or more times, you find out that you application runs ten or more times slower!
What is happening? Why adding more CPU significantly decrease speed of your application?
One of the possible answers is memory allocator. Standard system memory allocator use
mutexes to prevent concurrent access to allocator structures and preserve their consistency.
Locking mutex is not a problem when there are no lock conflicts - it is just few atomic assembler instructions.
But at multiprocessor platform, different threads concurrently invokes memory allocator and as a result
we have a large number of lock conflicts. So most of the time is now spent is locking/unlocking mutexes.
Even if threads in your application are working autonomously (thread is accessing objects created only by
itself) and so require no synchronization, there will be still a lot of conflicts between threads
caused by memory allocator. As a result instead of expected linear increase of performance with adding new
CPUs, you will have exponential degradation of performance.

How to solve the problem? The obvious solution is to avoid creation of objects in dynamic memory at all
and allocate objects on stack instead. It is almost the ideal solution fro performance point of view.
Unfortunately it is not always possible or convenient (for example thread stack has also limited
size so we can not create large number of objects in the stack). Another solution is to
have separate memory allocator for each thread, so that each allocator can allocate memory without
interaction with other allocators. The product contains implementation of such allocator.

This allocator is optimized for allocation of small objects. It maintains chains of blocks with the same size.
Number of chains and maximal size of allocated object can be customized. By default, allocator
uses 5 chains with block size 16, 32, 48, 64 and 128 bytes. Objects with size greater than 128 are allocated
using standard malloc. When allocator is requested to allocate memory, it rounds requested size
to nearest block size. If chain of block with this size is not empty, object is allocated by just unlinking
block from the chain. If the chain is empty, then allocator allocates new page (default size is 4Kb),
split it into blocks of requested size and add all these blocks to the correspondent chain.
Page is also allocated using standard malloc. When object is deallocated, it is linked in
the correspondent chain.

So you can see, that most frequently allocate and free operation requires no synchronization at all -
them are just implemented as L1 list link/unlink. So in case when thread works locally
(objects are deallocated from the same thread which allocates them), performance is very high. And adding
extra processors can only speedup you application. But sometimes data exchange between threads is needed
(so objects created in one thread are sent to another thread where them are deallocated). In this case
it is not possible to avoid synchronization at all. But it is possible to reduce it overhead.
Instead of using one centralized mutex, this algorithm creates one mutex per thread. When free method
find out that object was allocated by some other thread, it lock the mutex of this thread and add
this block to the pending requests list. Each thread checks this list at each allocate
operation. When number of pending requests exceeds some specified threshold, thread lock the mutex,
store head of the list, reset the list, unlocks the mutex and then deallocate all these elements.
So time of holding mutex is minimal, number of locks is significantly reduces and locking can block only
one thread (instead of centralized lock which can block all running threads).

When thread is terminated, algorithm checks counters of live objects at each page and deallocate pages with
zero counter of used objects. Other pages will be deallocated later by other threads which will deallocate
these living objects.

You can use them instead of standard malloc, free and realloc in your C application.
Just include "threadalloc.h" header file and "threadalloc.obj" file
in the linker list for your project.

In C++ application you can redefine default new and delete operators if you
compile your application with included everywhere "threadalloc.h" header file
and defined REDEFINE_DEFAULT_NEW_OPERATOR macro.

This allocator can be build using make utility. Makefile for Microsoft Visual C++ (makefile.mvc)
and GCC (makefile) are provided. Command file make.bat can be used
to invoke Microsoft nmake utility for code>makefile.mvc.

Make will build allocator object file and four tests: stdalloc_test1, stdalloc_test2,
threadalloc_test1, threadalloc_test2. Test 1 implements model where thread works in isolation from
each other (allocates and deallocates objects locally). In this case performance benefit
comparing with standard allocator are most meaningful. Test 2 implements model of communication
between two threads: object are created in one thread (producer) and deleted in another thread (consumer).
The table below summarize results for single and multiprocess systems (time of test execution in seconds).
The tests were performed under Scientific Linux 2.6.15.5 for x86_64.

Test

Description of test

Intel(R) Pentium(R) 4 CPU 3.00GHz [Hyperthreading=On]

Intel(R) Pentium(R) 4 CPU 3.00GHz [Hyperthreading=Off]

Dual Intel(R) Xeon(TM) CPU 3.00GHz [Hyperthreading=On]

Dual Intel(R) Xeon(TM) CPU 3.00GHz [Hyperthreading=Off]

stdalloc_test1

standard allocator with 4 threads working in isolation

6.596

7.948

4.797

5.014

stdalloc_test2

standard allocator with producer and consumer threads

10.853

9.986

11.200

9.613

threadalloc_test1

this allocator with 4 threads working in isolation

2.711

1.690

1.079

0.837

threadalloc_test2

this allocator with producer and consumer threads

8.719

7.848

7.804

6.966

As you can see from these result, proposed allocator is 2-3 times faster than standard allocator
at single processor system and almost 5-7 times faster at dual processor platform! Certainly there are many
other factors which are limiting scalability of real application except memory allocator, but I hope
that using this allocator can really help you in improving performance of your application.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the Software), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions: