Introduction

What do you expect when running your application on a dual-processor computer? That it will run 2 times faster than on a single-processor computer? 1.5 times faster? And what if it runs slower?

With this article, I wanted to open a discussion on writing scalable server applications for SMMP computers (SMMP states for Shared Memory MultiProcessor). There is not much information I managed to find on the subject, so here I would like to share my experience and invite you to compel and discuss. Actually this concrete article discusses only one of many problems that a developer could face when programming for multiprocessor computers. My experience is not enough to fully cover the subject. I just believe you will find the story interesting.

Program that cannot scale

What does it mean that a program can scale?.. It means that the program takes advantage of all resources the hardware (and money) can provide. If there are 2 CPUs, the program will do the same work faster. If there are two separate computers, two instances of the program can be run to utilize both of them and do the work faster.

It happened that one of our project's background services is implemented around a 3rd party COM object DLL. This COM object performs a kind of statistical analysis, usually consuming up to 100% CPU and huge amounts of memory. Since it is a black box, its performance is a constant for us (to be honest, I must tell here that the performance of the component itself is quite good).

When working with multiple threads on single-processor servers, our background service performed very predictable. 2 or 3 parallel worker threads took slightly larger time to process X documents than one worker thread. And time of a single document processing increased linearly with amount of worker threads, so there is no point in having more than one worker thread per CPU.

It was natural to expect that a modern dual-processor server would improve the performance if not twice but at least 1.7 times. To our surprise, it appeared to be that 2 parallel worker threads run about 1.5 times slower than one worker thread! It run even slower than the same parallel worker threads running on a single-processor computer! The second CPU somehow made it slower. The service was not able to scale.

First we thought about a possible synchronization issue in the 3rd party component. But according to its developers. it has very little internal synchronization between object instances and it should not be the problem. And even if there is a problem, there is no way to fix it or replace the component. I was forced to search for other possible reasons.

After reading some books and sources on Internet, I started to wonder if the problem might come from the run-time library memory manager. A multithreaded application must use a multithreaded memory manager which allocates and frees blocks of memory for all threads from the same memory pool. If an application creates several in-process COM objects implemented in a DLL, each COM object will compete for the same memory manager.

Because of it, parallel threads in a memory intensive application must wait to each other until the memory manager is released. This is not a big problem on a single-processor system, because anyway only one thread can execute at one moment. But this becomes a problem on a shared memory multiprocessor system because of the memory manager which will stall all other CPUs while executing on one of them. Even on a single-processor computer it might be a problem, if we take into account the new hyper-threading feature of Intel Pentium 4 CPU.

Okay, it could explain why the application does not improve from having multiple processors. But why its performance degrades? I'll continue discovering it in the section 4. Now let's look at the hard numbers.

Tests and results

In order to confirm that there IS a problem with the run-time memory manager, I wrote a simple test application that is supplied with the article.

There are 4 types of tests I thought about:

Pure computational test.

This test is a mix of mathematical functions that should scale very efficiently on multiple CPUs. It does not implement any real algorithm. I only wanted this test to fit into the cache and do not access the main memory.

Pure memory allocation/destruction test.

If the guess is correct, this test should poorly scale on multiple CPUs. To simulate numerous allocations, I simply took the STL std::list<char> container that allocates a long linked list of chars. Since it is not a sequential container, it will allocate and then free each single character one by one - exactly what I need. Note that there is nothing wrong with STL! Any linked list that uses the run-time library memory manager will behave exactly the same.

Mix of memory allocation/destruction and read/write.

This test may scale and may not scale - it depends on a balance between two parts. Its implementation is almost the same as test 2, I just added iterating through the container and doing some computations. During that loop, the memory manager should be free for other threads. I did not pay much attention to balancing this test.

Pure memory read/write test.

This test should scale almost the same as the pure computational test. It may be just a bit less efficient because of the memory bus concurrency.

Each single test iteration was designed to take about 4-6 seconds on the test computers: a single-processor Pentium III 850 MHz with 256 Mb RAM and a dual-processor Pentium 4 2.4 GHz with 2 Gb RAM (it actually took 4-6 seconds on Pentium III and about two times less on Pentium 4). Any test data is generated on fly and it fits into RAM, so HDD cannot affect the results. By the way, I noted that my Pentium III 850 was only about two times slower than Pentium 4 2.4 when running a single thread. I guess it is because of the different priorities for foreground tasks that you can configure in Windows 2000 for desktop and server computers.

I executed 24 iterations with different counts of parallel worker threads and repeated the entire set of tests 3 times to see the distribution of the results. There were 1, 2, 3, 4 and 6 parallel threads:

Total time of a single test execution was about (5 sec * 24 "documents" * 5 combinations * 4 tests * 3 iterations) = 2 hours and more on Pentium III.

On a single-processor computer, the results are very predictable. The tests 1 and 4 take almost the same time no matter how many parallel threads work together. They may loose a bit on context switching, but I could not even notice it. Just note that at the same time, parallel threads increase the time of each single test iteration processing linearly. Both the tests with memory allocation start loosing performance quadratically. It happens because of the CPU cache that cannot contain all data required by parallel worker threads. The more memory is accessed by the parallel threads, the less efficient caching becomes. Note that the test 4 is unaffected by this problem because all parallel threads access the same memory blocks, thus fully utilizing the CPU cache.

I normalized the results to percentages instead of giving milliseconds because it significantly simplifies the analysis. I also chose, from each three sets of results, one that looked most appropriate. You can find the original results attached together with the source code.

On a dual-processor computer the results are much more interesting! The pure computational test gets 200% of performance boost with two threads and 300% with four threads. Here I have to tell that this computer has hyper-threading feature enabled, so Windows 2000 even thinks there are four processors and displays four CPU windows in Task Manager. The memory access test performs a bit less efficient exactly as it was predicted (there are two processors and two caches that compete for one memory bus).

At the same time, the pure memory allocation test starts significantly degrading in performance once there is more than one thread executing in parallel. Four parallel threads process the same amount of documents in twice more time! Do not confuse it with four parallel threads on a single-processor computer - here we have two physical or four virtual processors (taking into account hyper-threading). It should not surprise that 6 parallel threads optimize the picture - having more than 2 threads per physical CPU allows more efficient caching of the memory manager's data structures (it may look like I contradict to myself, but for 2 caches, the cache coherence has much more serious performance impact than a single cache inefficiency). The mixed test first gains in performance for two parallel threads but then repeats the pattern of the pure memory allocation test.

Now let's try to discover why the test application performance degrades so significantly on a multiprocessor system. First, it is necessary to explain the most basic features of multiprocessing on Intel x86 architecture and Microsoft Windows operating system.

A bit of theory

Starting from NT 4.0, Microsoft Windows is a SMP operating system (SMP states for Symmetric Multi Processor). Simply speaking, it means that its kernel is able to process interrupts on any CPU available. (Windows 3.51 implemented asymmetric multiprocessing and was able to process interrupts only on CPU 0, which is less effective since this CPU becomes a bottleneck under specific workloads). It also means that all threads with the same priority are distributed evenly between available processors. When allocating CPU time for a thread, operating system prefers the CPU where the thread executed for the last time, thus making caching more efficient (this effect is called CPU affinity). If at any single moment, there is only one active thread executing on a multiprocessor computer, having multiple processors will not make any difference - the second processor will not be used. Usually only server computers that process numerous parallel requests feel the difference.

Intel processors family implements Shared-Memory Multiprocessor architecture (abbreviated as SMMP). This means that while each CPU has its own L1 cache, L2 caches and internal bus, all CPUs work with the main memory using one shared memory bus. In practice, it requires any change in the internal cache of one processor to be written back to the main memory right away if only any other processor tries to use this memory. This is done very efficiently using cache coherence protocol called MESI in Intel's terminology (look for a very good explanation of cache coherence in Windows 2000 Performance Guide, section 5.2). Unfortunately, anyway, two or more threads that actively read and write from/to the same memory location cause the affected CPUs to frequently flush their caches' content back to the main memory, stalling their execution pipeline. It does not happen on a single processor since it has no reason to write the cache content to the main memory until it needs the cache space for other tasks.

Theoretically, Windows 2000 working on a dual-processor computer without hyper-threading can work up to 1.7 times faster than an identical single-processor computer (look at Windows 2000 Performance Guide, section 5.2). A four-processor system could work about 3 times faster (very similar to our results in the pure computational test). Instead, in our example with the pure memory allocation test, excessive data dependency between parallel worker threads makes the cache coherence seriously impacting the application performance. Fun thing is that this data dependency is not imposed by incorrect algorithms or sharing our program data structures. We are perfectly clean in terms of our programming language (C++) - we use only the basic services like new/delete operators and each thread operates with its own independent data. It is the run-time library which shares its internal data structures between parallel execution threads.

Most of the run-time library routines do not use shared data structures, except the memory allocation routines. Run-time memory manager allocates large blocks of virtual memory from the operating system and supplies all threads with requested memory blocks. It is much more efficient than satisfying memory allocation requests using the operating system memory allocation facility. C++ language is unaware of parallel threads and each pointer must be freely accessible from any line of code. Therefore the run-time library memory allocation routines use the shared internal data structures to manage allocated memory blocks. There is the multithreaded version of C++ run-time library that synchronizes access to these internal data structures in order to protect a multithreaded application from corrupting them. The problem is not only in the synchronization but also in the fact that all threads allocate memory using the same internal data structures. When the application intensively allocates and frees memory blocks, these data structures become the real bottleneck on a multiprocessor system.

Of course, it does not mean that multiprocessor computers are inefficient. It does not mean also that the programming language is inefficient (may be only the way we use it). It does mean that a solution that works perfectly on a single processor system may not improve and even degrade in performance when moving it to a multiprocessor system.

The solution to enable scaling of our application is to isolate parallel execution threads one from another. More precisely, it is to isolate as much as possible standard memory allocations from parallel execution threads. This will minimize the spin locks and cache coherence issues and enable effective usage of all available CPUs.

Solution #1: Per-object memory management

The most obvious and most effective solution would be having a custom memory manager for internal needs of each object instance. Note that I'm still speaking about my concrete example, where each object is created to perform complex statistical computations. Using a custom memory manager, it would be possible to allocate all the memory required for its computations much more efficiently. It would be also possible to pre-allocate the memory, reuse it as needed, and free it all at once upon completion. And it would not need any synchronization.

There are numerous custom memory allocation schemes developed for concrete applications. You can simply use HeapCreate function to create a dynamic heap for each thread and/or object. Windows XP/2003 introduces a new low-fragmentation dynamic heap built on top of HeapCreate. You may find interesting the corresponding article from "Modern C++ Design: Generic Programming and Design Patterns Applied" by Andrei Alexandrescu. Or you might like a set of memory allocator classes from ACE framework. There are several articles on the subject here on CodeProject as well.

In my specific case, this solution is not relevant, because the problematic component is a 3rd party DLL. I started to search in a different direction.

UNIX-thinking vs. Windows-thinking

For a long long time, UNIX community lived without any thread support. Spawning a process in UNIX is a fast and efficient process, so any background task can be executed as a separate process. It is well supported by UNIX philosophy of fine-grained autonomous command-line utilities which can be combined using pipelining. A separate process runs in a separate address space thus protecting other processes from crashing, in case of any failure. A disadvantage of spawning is that any data exchange between processes requires some more or less complex inter-process communication protocol (IPC protocol).

Windows, almost from the beginning, supported threads. Threading has an advantage of more efficient data exchange because all threads within the same application have access to the same memory. But this also can become a disadvantage as in our case - a need for coordinating multiple thread activities complicates design and may negatively affect performance and stability.

Nevertheless, when properly used (as any other technology), threading is a very powerful feature. Many of UNIX flavors now support threading (for instance, in Linux it is quite a recent addition). A properly designed multithreaded server application in most cases will be more efficient than a similar multiprocess application (because of no need for any IPC protocols).

On the other hand, spawned processes are better isolated one from another and do not need synchronization of internal RTL functions. This feature, actually, gave me an idea of how to work around the problem. If each parallel processing task would run in a separate process and create only one instance of COM object, even the multithreaded run-time library that implements synchronization of its basic functions will run very efficiently since there is no concurrent access. Each process will attach a "separate instance" of the same DLL, so they will not interfere.

Solution #2: Spawning processes

Implementation of this solution was pretty simple. The only thing I needed was a suitable IPC protocol for managing multiple processing tasks. Also, I needed to solve a problem with expensive process creation - creating a new process for each processed document would ruin the performance of the entire system (there may be hundreds of input documents per second). There was an additional complication imposed by the client application - it is written in Java and uses JNI (Java Native Calls) interface which requires a stateless reentrant DLL (not exactly true because I can call back to the Java object properties and even change them, but then I need to map pointers to some Java data types and I was not sure it is a safe way).

What I knew for sure was that the input and output information is compact (about 1 Kb) and there are only few processing tasks that will execute in parallel (anyway it is not efficient to have more than 2-3 parallel running processes per 1 CPU). Each processing may take from tens of milliseconds to even minutes, depending on the input information, and in most of cases it varies from 0.1 to 10 seconds.

I decided to exchange input and output data using a shared memory table implemented in the JNI DLL that Java requires anyway. This DLL declares a shared memory table, creates new processes if necessary, passes input parameters and gets the results. Access to the shared memory table is synchronized by named mutexes and events. The processing tasks also load this DLL and register themselves. The system is self regulated - if a processing task crashes, the DLL will create a new one; if a processing task does not get input requests for specific timeout time, it will terminate automatically. The shared memory table has indeed one limitation - it is a bottleneck by itself :-). I decided that it is not significant under the system requirements I have: a four-processor server is able to run up to 8 parallel processing tasks (assuming hyper-threading). Managing 8 entries in the shared memory table would not impact performance noticeably.

But before running into design and implementation, I needed to prove that this solution will enable scalability of the component. In order to check it, I ran the same test application in parallel processes instead of parallel threads. Look at the diagram, they speak by themselves. The memory allocation test scales perfectly now.

A couple of important notes should be put there. Note that the best effect is seen when there are four parallel threads/processes because of the hyper-threading feature that makes two processors working like four. Another important thing is that while we gain 200% speed boost with four parallel processes, the time of each single "document" processing increases twice. This makes having six parallel threads/processes inefficient - while the total time processing time changes insignificantly, the single "document" processing time increases.

I also executed a similar test with parallel processes on a single processor computer. I executed only the test 2 so I do not provide the diagram here. It is just important to note that multiple processes behave exactly like the pure computational test in the first test - the graphic is a line parallel to the X-axis. The results of this test are also attached together with source code.

In real life, this solution did work for the background service I developed. By working around a 3rd party component that we could not change, we got an average 1.5-2.0 speed improvement. More important, this solution allowed scaling the component and the service.

Conclusion

As a result of this small research, I would summarize the following:

The standard multithreaded C++ run-time library memory manager may have serious impact on application performance on multiprocessor servers. This impact is easy to miss during the application design stage because it is hidden in the run-time library design.

It is especially important for applications that allocate complex data structures and are supposed to run in parallel threads (OOP techniques often encourage intensive memory allocation).

In most of the cases, the problem can be avoided by using a custom memory manager. When it is impossible or difficult to replace the C++ run-time memory manager, there is a way to work around the problem by parallelizing the job to parallel processes instead of threads.

I will appreciate your feedback on the article. Did I miss something obvious? Am I mistaken? Are there any other options?

Appendix A: What's next?

There are some Win32 API basic services that can significantly impact application performance when applying them to multiprocessor computers. For example, once we found a scalability problem with some component we used, and the company who developed it confirmed that there is a scalability problem because of overusing of GlobalAlloc() function. They plan to fix it in future releases. There may be other, not so trivial, problems and solutions. It would be nice to see more information and articles covering scalability issues.

Appendix B: Attached files

The attached archive contains the source code and a compiled executable of the test application. The test application is a console command line utility. Its source code is located in the sources sub-folder. The test application is developed on Visual C++ 6.0 SP 5 and you should not experience big problems moving it to newer versions of Visual Studio. For debugging of this test application, I used an improved set of debug macros (QAFDebug.h/cpp) described in my article Code that debugs itself.

There are several batch files developed for Windows 2000/XP command shell.

run_threads.bat executes the complete test (3 sets) with parallel threads. This batch file does not require parameters and it calls to run_threads_impl.bat to do the work. This test produces three log files named run.log.<X>, where <X> is the test set number.

run_processes.bat executes a partial test with parallel processes. It is difficult (if possible at all) to wait for completion of several processes using Windows shell command, therefore I ran each iteration manually. This batch file requires to specify four parameters and it calls to run_threads_impl.bat. It produces many small files named test.<T>.<N>.<P>.log, where <T> is the test number, <N> is the count of parallel processes, and <P> is the process number. The batch file parameters are:

run_processes.bat <T> <N> <I>

where <I> is count of iterations for each parallel process, <T> and <N> already described.

Also it contains four sub-folders with the results:

threads1cpu_results - contains the results of the parallel thread tests on a single-processor computer.

threads2cpu_results - contains the results of the parallel thread tests on a multiprocessor computer.

processes1cpu_results - contains the results of the parallel process tests on a single-processor computer.

processes2cpu_results - contains the results of the parallel process tests on a multiprocessor computer.

Appendix C: References

This book has an excellent chapter on multiprocessing (chapter 5, to be exact). This book actually led me to the solution of the problem I had. The section 4 of this article is built mostly on this book's material.

An excellent book on performance optimization. It does not cover multiprocessor systems but it explains the architecture and performance issues of modern Intel processors and memory chips. Unfortunately, to the moment, this book is published in Russian only, so you will not find it on Amazon (let me know if you do).

Effective STL by Scott Meyers, 2001

This book covers interesting aspects of custom memory allocators in STL and provides useful references. Look at the items 10 and 11.

In this section, you will find information on dynamic heap functions and new low-fragmentation heap introduced in Windows XP/2003.

Distributed Application Programming in C++ by Randall A. Maddox, 2000

This book discusses very basic things but it has a couple of paragraphs popularly describing the difference between multithreading and multiprocessing (at the end of chapter 11).

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Share

About the Author

Programming computers since entering the university in 1992, but dreaming of programming long time before putting hands on my first computer.

Experienced in cross-platform software development using C++ and Java, as well as rapid GUI development using Delphi/C#. Strong background in networking, relational databases, Web development, and mobile platforms.

Like playing guitar, visiting historical sites (not in the Internet, in the car ) and cooking meat with friends (sorry about vegetarians). Look for more information on www.schetinin.com

Thanks very much for this article. It is becoming more of a requirement to scale modern day applications on available hardware. My initial idea was to make use of Threading, but never considered the implications of "Shared memory". This article has definately helped a "beginner" like myself.

Rolling your own custom solution is not necessarily the best approach. Consider using Hoard[^]). It's a plug-in replacement for malloc/free, and not only scales well for multiple processors/cores, it's also much faster than the NT allocator. Incidentally, there are lots of other problems with the NT allocator beyond heap contention; the Hoard web site has more info.

You are right, there is really no point in developing an in-house solution, at least without serious reasons.

But there is a point in using a single-threaded memory allocator, or a specialized fixed record-size memory allocator for specific purposes (like caches or buffers). I suppose they will beat any general purpose multi-threaded memory allocator, no matter how efficient it is.

In this forum, there were many messages about Hoard and other similar libraries, some commercial (check the forum for their titles). There is no good comparition of these libraries, only those done by their developers, with quite predictable results

I think that your point - that it is a good idea to use a special-purpose ("custom") memory allocator for specific purposes - is a widely-held misconception. In fact, custom allocators appear to be of only very narrow usefulness when compared against a good memory allocator (read: not the Windows allocator). I wrote a paper on this topic in 2002 called "Reconsidering Custom Memory Allocation" that shows that a good general-purpose allocator (namely, the Lea allocator) matches the runtime performance of most custom allocators. There's one important exception -- "region" or "pool" like allocators can do better, although not much better than the generalized allocator ("reap") presented in the paper. Here's a link to the paper.

As you can see many choices made by the NT design team were flawed, untested and un workable.“Dave” wasn’t as smart as he though he was.

For one thing, all io processing should have been on one cpu.

The "bottle neck" argument is pure fantasy. For one thing, device drivers are only allowed to execute a very small number of instructions at interrupt time. Most of the work of a device driver is done on a DPC (the kernel equivalent of an APC). So lack of parallelism on interrupt handlers would make no discernable difference, compared to the HUGH problem of synchronization on an SMP in device drivers. On a single machine, all you need to do is raise IRQL to DPC – on SMP you need to use a lock across all CPUS.

In fact you may find that older device driver cause a BSOD on SMP systems.

I voted excellent for this article. Not so long ago, I develloped a portable parallel processing program. I tested it on Solaris, AIX and XP. The program was very scalable on UNIX but on XP I was puzzled to see that it made no difference if was using more than 1 processor.

Even if each thread was not accessing exactly at the same memore locations, they were using the same memory array. Your article gives me some leads on how to solve my XP performance problem.

I agree with you on scalability on Win XP platform. Even I upgrade VS6 to use VS2003 also face same problem.

My program run faster in single CPU than in multiple CPU system. In certain cases(Initial programming), the system actually run slower. I was bizarre and things just didnt make sense and i begin to suspect the memory manager as the culprit.

In my case, i developed image processing algorithms running parallel on 2 dual core AMD opteron system. Spawning 4 threads may not increase the performance linearly. Since my image is usually larger than the cache size of the CPU, and the algorithm requires access on different x,y on the image, the parallel algorithm cannot work. Now my system is left on the store. I have no ideas at the moment.

This problem has obsessed me for a while but I had to let go since it was just for a university assignment.

The problem was to compute values for a grid during x iterations. Each cell needed its neighbor values to compute. So the algorithm was to split the grid by the number of proc. First variant I tried was to use a shared grid and sync the subgrids edges access. It didn't work. I though that it could be because all the threads were working on the same array. Then I tried to use private grids for each thread where the edge values where transmitted to other threads with memcpy after each iteration. It didn't work either. I even tried to fiddle with the thread affinity to force the thread assignement to a given CPU. It made no difference.

Also it is important to note that in my program no malloc was performed during the parallel processing. malloc was called only during initialisation.

The biggest difficulty I had during debugging is that I had not access to profiling data. I develloped the program at my place with VC++6 which come with a profiler but on the SMMP system I had access they had VS2003.NET which comes with no profiler and installing VC++6 on that machine was not an option. Anyway, I'm not even sure that VC++6 profiler behaves correctly on a SMMP system.

I guess the key to make it work is to have a good multhread profiler. Only when you know what is wrong you can fix it. However, I'm terribly disapointed by Windows performance. It should not be that hard to make a parallel program works on XP when it works perfectly well on other platforms!

In my findings, I found that the problem is not related to allocation and deallocation during processing. The issue is memory access.

During my loop, each CPU (set with affinity) take data from the main memory to be processed. While it is being processed, the thread for that particular CPU might be preempt to other task for a short moment. This might also cause part or the whole cache to be flushed out. Rendering a new data to be fetched in to the CPU.

Additionally, when using 2 units of dual core CPU. I had tried running 2 threads on CPU2 and voila, I notice some performance increase but only merely 20-30%. After this, i make an assumption where each multi core CPU will perform faster when data has close proximity access within multiple threads.

Hence I still do not have any ideas to improve my system performance when using SMP system without remodifying my entire image processing algorithms.

hmm. I do not have any more suggestions to offer. I wish you good luck to find a solution and when you find one, I hope that you will think to report it back here. I would be very interested in the outcome of your multithread experience as it looks very similar to mine.

"Solaris, AIX, and XP" sounds like "SUN, AIX, and Intel". I assume they have very different multi-processing organization. I've also heard that parallel processing on SUN is way better than on Intel (usually from DBA guys).

Nice article, thanks. I'd say that VC6 is somewhat outdated to the moment. I experienced more than 2 times speed boost by simple recompiling one of my projects to VC++ 2005 (25 msec execution time lowered to 10 msec execution time). I'd bet that MS guys revised their memory management routines writen in 96-97.

And I found extremely interesting the comment posted by Emery Berger, one of inventors of Hoard library for memory management. What he is saying is that while VC++ RTL memory management is not that effective, it is better to choose another (more effective) general purpose memory management library, instead of inventing a custom one. The research is quite convincing. I quote it here.

Emery Berger wrote:

Hi Andrew,

I think that your point - that it is a good idea to use a special-purpose ("custom") memory allocator for specific purposes - is a widely-held misconception. In fact, custom allocators appear to be of only very narrow usefulness when compared against a good memory allocator (read: not the Windows allocator). I wrote a paper on this topic in 2002 called "Reconsidering Custom Memory Allocation" that shows that a good general-purpose allocator (namely, the Lea allocator) matches the runtime performance of most custom allocators. There's one important exception -- "region" or "pool" like allocators can do better, although not much better than the generalized allocator ("reap") presented in the paper. Here's a link to the paper.

That's quite interesting The memory address crosses 32-bit boundary.That would mean working in 64-bit architecture. Or is it mistyping?

I'm afraid, few programs today may be recompiled for 64-bit architecture without proper testing. There are too many assumptions that types are 4 bytes long in today's sources. I had some bad experience with that

If you have the program's sources, run the program in a debugger, and reproduce the problem. There should be an invalid pointer somewhere...

The article at http://www.tecchannel.de/hardware/986/5.htmldescribes, that Windows initializes the stackpointer of each new thread 1MB above the stackptr of the last thread -> the lower 20bit of the stackptrs of different threads are equal. The Cachecontroller decides which cacheline to use for mirroring the main memory by the lower 16bits of an address -> both thread-stack-startaddresses will by mirrored in the same cacheline and the cacheline has to be exchanged with every switch between the threads (if they access local variables).A solution to this would be moving each thread's stackptr manually as described in the article mentioned above. I don't know if this is correct, but it seems to fit in here.

Interesting... I'm wondering, should not MS engineers know about such a problem? If it is a serious problem, they would work around it years ago. May be the gain of performance does not cost the price? May be there are problems with it?

I am a consultant working in the NYC area for a financial company and we are experiencing scalability problems that we believe are associated with limitations on standard memory allocation libraries on the OS (Windows 2000). Our application simply does not scale at all.

Has anyone done any performance evaluations of the available SMP memory managers?

If so, could you share your data? If not, I will consider doing an evaluation and posting an article about our finds.

What test program should be used to run a thorough performance evaluation on the available libraries? Any ideas?

Should I pluck an already existing test program or develop my own?

I'm thinking that a useful test program would test a range of allocation sizes, like from 1-100, or 1-1000, or 100-10000 or something like that. The number of threads should be selectable at startup. Each thread should have it own pointers that it will utilize to malloc and free.

Any ideas?

I think this would be rather interesting to sort out from wheat from the chaf, so to speak.

There are may be different memory utilization scenarios. Different memory managers may be more suitable for different scenarios. A draft organization of a test program could look like the following...

By allocation size:

1. Allocations of fixed size (stacks/queues, say, for network packets, mostly small)2. Allocations of irregular size (file/string processing, may allocate large blocks)3. Mixed allocations, but many of them of several fixed sizes (OOP program with lists or trees of objects)

Point #2 though, testing by allocation frequency, I don't understand the 'processing' portion of it. How does that test an allocator? I can see a) Many allocations, then, b) mant deallocations. I have trouble with 'processing part.

Point #3, by contect thread. Not sure I understand that one as well. Hoe does it test the throughput of an allocator if one threads does the allocations and another thread do the deallocations?

What I envision is something like the following:

1) Do some number of preallocations.2) enter a tight loop and then call free() and then reallocate with a different size than was just freed.

Pretty simple I think.

I am looking at a program written by Larson and Krishnan called larson.c that somes with hoard. Not sure I like this program as is needs user input via scanf()'s. I need something that can be run in a batch file or a UNIX shell script. Although I might go with a slightly modified version of this one for starters. Modified to take it's inpout from commnd line args instead of scanf();s so that it can be run in a batch file of from within a UNIX shell script. Or better yet, so that it can be run as a PERL script.

I want to test everything from 1-200 or so threads. Allocation sizes from 100-5000 bytes. preallocate anywhere from 100-3000 pointers per thread. Things like that.

We'll see if this larson program is up to the task or if I need to seek out another, or write my own. Will post within the next day or 2 about what I decide with this program.

Still looking for someone who already has results though as that would save me time. Buy I will do this on my own if need be as I can apply the output to our work related scalability problems.

For Point #2 - you are right, no need in "processing". But there is a clear need for these two scenarios. One will allocate many objects and then free all of them at once. Another scenario might allocate/deallocate objects of different sizes randomally. It loads the memory manager differently since it has to compact holes in its memory buffers.

For Point #3 - when all allocations and deallocations are done from the same thread (per-thread), Hoard and alike will perform very efficiently. But once you allocate the memory in one thread and then pass pointers to another thread which will process and then free them, Hoard will work as slow as the regular CRT memory manager.

Regarding number of threads - I see little meaning in testing 200 threads. Having more than 1 thread per processor is inefficient for CPU- or memory-bound threads. Only I/O-bound threads may possibly run efficiently with more than 1 thread per processor. Therefore it very much depends on the count of CPU on the test computer. Normally, you should test different memory managers with count of threads less or equal to the count of CPUs on your test computer. Having more threads than CPUs will just ruin the performance.

Andrew, your last comment on Point #3, where the allocations are done by one thread, and the frees are done by another thread, I believe, based on my experience, is out of the ordinary. All of the multithreaded server based applications that I've been involved with, I've never seen that done. Not once, that I am aware of. Therefore, based on my own experiences, I believe that that the situation where a single thread does a buildup with malloc(), then once it's has the memory it needs, starts freeing and reallocationing. When complete, it will then free all of it's memory. I believe that this situation is more common.

As for the number of threads, where I stated that testing should include 100's of simultaneous threads, again, based on my experience, the number of threads running in an application was never limited by the number of CPU's in the system. I've worked on applications where there were literally between 100 and 200 threads running at any time. I've never workong anything larger than that, but I've heard of such applications.

Basically, the application I'm currently working on can have around 100 threads at any given time. Each of those threads is placing some pressure on the memory allocation libraries to some degree. Therefore, for me to be confident that any allocator that we plug into our application will work and scale properly, I'll need to test it with at least 100 threads. But I plan on testing beyond that. If the allocator falls on it's face after 50 threads in a test environment, what confidence do I have that it'll solve my currrent problem with 100 threads? More than likely, not much, even though our application would never place as much pressure on the memory allocator as a test program would.

As for me looking into the 'larson' program as shipped with hoard, I've canned that. It's the command line parameter thing. If need be, I'll write my own test program to do this. But I'm going to continue looking for alternatives that have already developed in the meantime.

Again to Point #3, it was Thomas George who provided an example of his application with such memory usage scenario when the memory is freed by another thread. So there is an application.

About hundreds of threads - if they are all I/O-bound and most of time they are waiting for something, than it's okay. It just that what I've learned from my experiments: when I have, say, 3 memory- or CPU-bound threads working on 1 physical CPU, they take the same time to do the work as 1 thread doing 3 tasks one by one, but in the case of 3 threads each one of them takes 3 times more to finish.

SO I've given up my search for an acceptable test program for performance testing memory management libraries and have decided to write my own. Once I complete the tests, I plan on writing an article based on the results and post it here at the Code Project. It'll be my first article.

I plan on testing out the following libraries because they were the ones mentionsed here:

I'll be asking fora demo copy of smartheap and winheap most likely sometime this afternoon. I imagine that by the end of the week, early next week I'll have the results from the performance testing.

What follows is the guts of te program I have written to do the testing. Please comment on it if you'd like. I think it addresses most of my concerns about how and what to test, along with some of Andrew's concerns as well. There are some things missing from the testing, but overall, I think it will be a good test.

I don't know how it's going to look as I'm cutting and pasting it into the box. This is not the entire program, but the guts:

Sigh. Thursday of last week I have been moved from the scalability project to fixing crucial bugs, possibly at a customer site(s) over the next few weeks. Not sure when I'll have a chance to get back to this. Not even sure if we'll address the scalability issue.

More seriously, I did look at the test program that comes with hoard and it's crap. What I came up with might be similar, but what I wrote is readable. What comes with hoard is not readable. It's crap. I'd be ashamed to put my name on it. In fact I wouldn't put my name on it.

Haven't seen the winheap test program. If anyone gets around to do this benchmarking, perhaps we should use the one that comes with winheap? If you have it could you send it to me?

When you do get around to running those benchmarks, add ESA from cherrystone software as well. We did a performance comparision late last year with it against 1 of the libraries from your list and ESA blew the doors off of it. I won't say which one, but I'd like to see benchmarks against all of the others as well. We're still in DEMO mode with ESA and if any library can touch ESA, we'd happily consider using it instead.

I have to echo Steve's recomendation of ESA. Great product. Nothing can touch it in terms of speed. The scalability is fantastic in our product running on upto 32 CPU's. Haven't had any problems with it either.

But, after it was all said and done, they sent us collared shirts with the Cherrystone logo on them. That was unexpected, and a very pleasant surprise. Never had a software company do that for me before.

Well I am sorry to say that I never did get around to writing that paper on memory allocator performance, but we did complete our internal review of SMP based memory manager libraries. We ended up hiring another company to do the testing for us as we were just flat out too busy to do it ourselves. I can't publish the paper that comcast paid this company for but can summarize the results.

The follwoing libraries were tested on Windows 20000/2003, Solaris and Linux.

The hands down winner in speed was ESA from cherrystone. It was almost unbelievably faster than both smartheap and hoard. It was pretty much exactly 2x faster than smartheap in the SMP tests while utilizing less memory than smartheap. And the ESA to hoard comparison was hardly a comparison with ESA blowing it out of the water by as much as 10-20x in most cases. Hoard wasn't even considered for our project in the end because of this.

In terms of memory usage, hoard was the hands down winner. Neither ESA or smartheap could touch it. ESA overall used less memory than smartheap.

So, we chose ESA for our project. I just wish I could have written and published the paper, but time got in the way.

Any chance at all you can somehow manage to post the paper? I'd be very interested in it.

Did this 3rd party company by chance test ESA vs smartheap utilizing the EsaSetLock() functionality of ESA? This adds locking to the various memory pools that the threads utilize. I'd be curious how ESA fared against smartheap in this regard.

Peter, I am sorry, but I can't post the paper. I went to management twice and was turned down both times. I can however talk about the results without posting the paper.

To answer some of yuor questions, yes, they did test ESA in both locking and non-locking modes. ESA is faster then smartheap either way. For instance, on Windows 2000 on a 4-way XEON machine, here's a summary of the results, lower numbers are better:

Threads: 20CPU's: 4

ESA (non-locking): 29 secs ESA (locking): 44 secs Smartheap: 64 secs.

Threads: 50CPU's: 4

ESA (non-locking): 47 secs ESA (locking): 67 secs Smartheap: 95 secs.

So in general, ESA is twice as fast as smartheap if you's dont need the locking, still much faster even if the locking is on.

The locking mode of ESA, from what I understand, isn't needed by all applications. We have several multithreaded applications that needed it, and a few multithreaded apps that didn't. And if you don't need it, by all means don't use it, it slows up ESA a bit, but ESA is just plan faster no matter how you slice/dice it. Significantly faster.