Introduction

Anyone who works with applications that use vast allocation and freeing of dynamic memory during their run knows that this can cause decrease in performance, because this operations needs to be guarded with some kind of a global lock, so different modules can affect one another's performance. Further more, over time the memory is fragmented from all sizes of allocations, so if an application is kind of a resident process on a machine, its performance deteriorates over time.

In this article I will try to give a partial solution for this problem. Why partial? 'Cause I'm not trying to give a better implementation of the Memory Manager above the OS level. What I can optimize is modules that use vast allocation and freeing of buffers of the same size, which from my experience are the most common.

The solution I'm suggesting is a buffer pool that will allocate and free buffers of a predefined size, the complexity of the allocation and free operations is O(1), and each buffer pool object is self synchronized, so it wont affect other modules using another buffer pool. Furthermore the dynamic allocations of the buffer pool are done in segments of memory the size of N buffers, so the fragmentation of the memory over time is reduced significantly. More then that, because each segment is allocated as one block of memory, the paging of memory will be reduced also.

Using the code

First declare a CBufferPool object

CBufferPool BuffPoolObj;

Next create the Buffer pool by calling the Create method.

BuffPoolObj.Create(nBufferSize, //Buffer Size
nBuffersPerSegment, //Number of buffers in each segment
nInitialNumberOfSegments, //Initial number of segments
nMinNumberOfSegments, //Minimum number of segments during run
nMaxNumberOfSegments, //Maximum number of segments during run
nFreeBuffersPercentForSegmentDeletion);
//Percentage of all free buffers relative to
//size of a segment that allows a segments deletion

The last parameter nFreeBuffersPercentForSegmentDeletion is not so intuitively understood, so I'll try to explain its purpose. In the implementation of the buffer pool, the buffers are given from a segment that is pre-allocated. There is that 'gray zone' where if a buffer is the first allocated buffer in a segment, we allocate a new segment for it, and if in the next operation we free this same buffer, we should free the whole segment and so on. This will cause a decrease in the performance that is worse than if we'll allocate the buffers in the traditional way. To prevent this from happening, I'll free a segment if I'll have a safe number of free buffers from whole segments. This safe distance is determined by this parameter.

Then you can call Allocate method to get the required buffer, and the Free method to return it to the pool.

The CBufferPool has a synchronization scheme, so this methods can be called asynchronously. To destroy the buffer pool object call the Destroy method. This method will free all segments, in debug it will check that all buffers where freed.

Working the demo

The demo project demonstrates how the buffer pool can improve performance of a program compared to simple new and delete operators. Furthermore it can be used to check if your system can be improved, and the approximate amount of improvement that can be achieved with a certain configuration.

There are 3 tests that you can perform with the demo project, test the buffer pool and then the global allocation (new and delete) to see the improvement the buffer pool achieves over global allocation, and the interaction test that can illustrate the severe decrease in performance effect that the global allocation suffers from, in comparison to the buffer pool.

The parameters in the demo are divided into three, session parameters that are used to run both the buffer pool and the global allocation, and the two other sections for each of the two. For the parameters that are not intuitively understood, you can understand using the code. The buffer pool code is very well documented, and I tried to keep the demo application code simple.

For those who compile the demo in debug mode, let me just say that the buffer pool is added in debug mode tests to ensure the buffers or segments that are freed will not be used and buffers that are being allocated are not already in use. And these tests (commonly memsets to trash the memory out for it to be unusable) will decrease the performance beyond measurement, so do the tests in release.

Implementation

I will just try to explain the implementation in general, the buffer pool is constructed of a doubly linked list of segments, each segment has a single linked list of free buffers. The link between the buffers uses the buffer itself. Each buffer has apart from its required memory, a header that keeps its index in the segment, so the free operation will be direct with no searching and its complexity O(1). All the memory for the segment and buffers is allocated as one block. The segments list is ordered during the run so if the segment is fully used, it will be inserted to the end of the list, and if it has one or more of its buffers freed, it will move to the head of the list. The allocation operation just pops one free buffer out of the list. When a buffer is free a calculation is made to extract it's segment, a segment is allocated each time there is no free buffer, and freed if we have more then nFreeBuffersPercentForSegmentDeletion multiplied by nBuffersPerSegment amount of free buffers all together in the buffer pool. This is the basic idea for specifics. See the code, it is really well documented.

Object pool

The object pool is an implementation of a pool for object that is a direct descendent of the buffer pool. What is needed in order to use it is to declare an Object pool with a declared type.

CObjectPool<SomeType> SomeTypeObjectPool;

Next create The object pool by calling the Create method.

SomeTypeObjectPool.Create
( nObjectsPerSegment, //Number of objects in each segment
nInitialNumberOfSegments, //Initial number of segments
nMinNumberOfSegments, //Minimum number of segments during run
nMaxNumberOfSegments, //Maximum number of segments during run
nFreeObjectsPercentForSegmentDeletion);
//Percentage of all free objects
//in relative to the size of a segment that allows a segments deletion

The object pool implementation has very little code, it uses a new and delete "placement" (as the documentations calls this method) that receives the pointer to a buffer pool which is initialized with the object size as the buffer size at the Create method. I think the implementation is very exciting because of this new\delete "placement" method that is rarely used.

A light example of the object pool is shown in the OnInitDialog function in the CDemoDlg.

You can use the CBufferPool \ CObjectPool on any platform. All you need to do, is replace the critical section object with the other platform dependent mutual exclusion lock.

Thanks to George Anescu for the CPreciseTimer class used in the demo project.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Comments and Discussions

The project was a high-speed data server for stock market quotes and trades. With this type of server the information is continuously updated, removed, added to over the course of a day - 100 million or more messages per day.

This specific application was designed for multi-processor Win2k systems and used asynchronous thread processing - threads were assigned to different processors and had different priorities.

Two tools were important for analyzing and optimizing the program. The first is "Performance Monitor" and the other was NuMega TrueTime (part of the Bounds Checker suite).

One of the biggest performance robbers is locking for multithread access - it can cause a multithread application to process a single thread at a time due to contention over the same resouce - the threads are not asynchronous anymore.

To overcome this contention, we used seperate lists for the free and data items; we only needed to lock one or the other. This eliminated a lot of contention. It also eliminated the overhead of using a double-link list; a single-link list is twice as fast. Free items should be placed at the head of the free list; it increases the chance that a memory reference will be in the CPU cache.

To reduce contention even further, we designed the application level to work with lists of data or free blocks; we locked the manager once to retrieve 50 or 100 blocks.

This still left contention when we were accessing the data... to make sure it wasn't freed from underneath us by another thread.

We eliminated this by putting freed data on another list; the items on this list were "expired" after a reasonable time period that ensured there were no references to the item and then it was placed on the free list.

Actually my initial inclination was to use seperate heaps too; ie, HeapCreate, HeapAlloc, etc. One of the advantages of seperate heaps is protection from corruption. But the class for using virtual memory was already in place.

So I did some reading, and then I did some testing. It is much faster to do a VirtualAlloc and then partition it into blocks than it was to use HeapAlloc/HeapFree. And for our application, we needed the ability to have large memory blocks that were locked into memory.

We did use seperate allocations for each type of object, and seperate manager objects for each type.

I would guess that pre-allocated memory from a local heap would be just as fast as pre-allocated memory using the virtual functions, though. But the heaps are serialized by default so if you're going to do your own management you might want to turn this off.