I recently discovered that dynamic memory allocation is pretty slow, so I've been thinking of a way to get around that. I've been thinking about going into low level video game engine programming after college/uni (studying Maths/Physics in college, then Computer Science at Uni), and it occurs to me that dynamic memory allocation is a fairly important thing in the engine and therefore, if it's slow, it's a potential bottleneck.

I imagine one of the problems with dynamic memory allocation is that you have to ask the OS for the memory, then wait for it to find some and give it to you (please feel free to correct me if I'm mistaken). So I came up with the idea of allocating the memory to a buffer during initialisation, then I can use my own custom methods to manage that memory throughout the application. Since I have already allocated the memory, I'm simply passing around addresses from my buffer rather than asking the OS for more memory. I can't imagine this is something new, though, so I'd appreciate it if anyone can give me some advice, show me some material to look at, or point out any problems with this idea (e.g. is there actually a significant performance boost using this method?).

This is something I quickly threw together as an example, it is by no means supposed to be a real world implementation:

So I came up with the idea of allocating the memory to a buffer during initialisation, then I can use my own custom methods to manage that memory throughout the application.

You've just reinvented what malloc does, only you haven't figured out to how free memory yet. Nor have you determined a way to handle fragmentation, nor a way to make your approach cache and TLB friendly. You're also not thread-safe. And you have no way to handle more than 1GB nor a way to even detect you've gone beyond that limit. And if you fix that, you'd need to fix it again to handle more than 2GB on common systems.

Granted, maybe you intended to leave these out for a performance gain. Have you bench marked your implementation against the standard library?

Since I have already allocated the memory, I'm simply passing around addresses from my buffer rather than asking the OS for more memory.

In modern OS implementations, asking for 1GB of memory doesn't mean it gets allocated. You first have to use it - and that means page faults for your implementation just like malloc.

Quote:

I can't imagine this is something new, though, so I'd appreciate it if anyone can give me some advice, show me some material to look at, or point out any problems with this idea (e.g. is there actually a significant performance boost using this method?).

There are many ways to get it wrong, so unless you have code which hits some strange corner case in the default implementation you're probably wasting your time. Even then you're as likely to make it worse than better unless you have a lot of time to spend on it. It would be an interesting learning experience, though, but remember that you're competing against something that's been in development for decades.

08-22-2011

Ushakal

Quote:

Originally Posted by KCfromNC

Granted, maybe you intended to leave these out for a performance gain. Have you bench marked your implementation against the standard library?

Like I said, the code I posted isn't meant to be a real world implementation; it's simply an example of what I mean (and the actual amount of memory allocated isn't of importance right now). I want to know if allocating memory into a buffer within my application in order to reassign that memory using my own methods would see a performance gain over using new/delete all the time. Naturally I'd have to account for all of those situations you listed in a real world example, I just want to know if there's anything to gain here. I don't mean to try and replace malloc/new/delete, simply "pre-allocate" the memory so it's readily available, and reassign it when needed.

Quote:

In modern OS implementations, asking for 1GB of memory doesn't mean it gets allocated. You first have to use it - and that means page faults for your implementation just like malloc.

What if I initialise the buffer then? Will that avoid that problem? I tried it with my simple example, by adding this line in main:

On my PC, it takes about 2300-2700 ms to initialise the entire GiB, which may be worthwhile if the performance gains later on in the app are notable. Does doing this fully load the physical memory I've allocated, or will the memory be unloaded if it isn't used again for a while?

Naturally the result may vary depending on the machine/OS, but the results look promising (unless I've made a mistake somewhere). I think I'll look into Memory Pools and Custom Allocators some more information anyway (thanks for the direction laserlight).

Quote:

There are many ways to get it wrong, so unless you have code which hits some strange corner case in the default implementation you're probably wasting your time. Even then you're as likely to make it worse than better unless you have a lot of time to spend on it. It would be an interesting learning experience, though, but remember that you're competing against something that's been in development for decades.

Yeah, I know that far more experienced coders than I have been mulling over a multitude of different ways to improve memory allocation performance, but like you said it could still be a good learning experience to investigate (so long as it's not too much of a waste of time).

I'm interested in low level systems programming, so things like this are quite interesting to me. Incidentally, do you know of any other good articles that cover other low level systems programming topics?

08-22-2011

jinhao

Code:

for(int i = 0; i < 1024 * 1024 * 1024; ++i)
g_buffer[i] = 0;

This snippet of code is real a performance killer. There are some goods way to do it

Naturally the result may vary depending on the machine/OS, but the results look promising (unless I've made a mistake somewhere). I think I'll look into Memory Pools and Custom Allocators some more information anyway (thanks for the direction laserlight).

OS will lock the operation and search the free memory during dynamic memory allocation. It means OS will block your allocation till other process which is allocating finishes allocation. Allocating a big-size memory in advance and managing it by yourself is a way to reduce the race condition.
Additionally, if your program is not allocated/deallocated frequently, the increase of performance is not obvious by using memory pool.

That is just wrong. Memory may still be paged in and out even after using it.

That said, happily, paging has nothing to do with the performance differences between average use of a general purpose allocator and appropriate use of a purpose built allocator.

Soma

08-24-2011

Ushakal

Quote:

Originally Posted by phantomotap

That is just wrong. Memory may still be paged in and out even after using it.

Can you point me in the direction of any decent articles about memory paging? I know it's wrong, I just don't know why. What are the consequences of a page fault and why do they occur? Are there any good programming practices that result in fewer page faults?

08-24-2011

AndrewHunter

Quote:

Originally Posted by Ushakal

Can you point me in the direction of any decent articles about memory paging? I know it's wrong, I just don't know why. What are the consequences of a page fault and why do they occur? Are there any good programming practices that result in fewer page faults?

OS will lock the operation and search the free memory during dynamic memory allocation. It means OS will block your allocation till other process which is allocating finishes allocation. Allocating a big-size memory in advance and managing it by yourself is a way to reduce the race condition.

Additionally, if your program is not allocated/deallocated frequently, the increase of performance is not obvious by using memory pool.

Both things you did here are examples of (bad) pre-mature optimization that won't speed things up at all, because any half decent modern optimizer will do this for you.
Look up "partial loop unrolling" and "common sub-expression elimination".

It only makes the code harder to read, maintain, and potentially slower, because the compiler/optmizer may get confused by the more complex code, and fail to perform more optimizations.

Learn what the optimizer can do first, and if you want to optimize further, focus on what it CANNOT do. Most of the time that means algorithmic optimization.

Also, this is not race condition. Race condition is something else. This is just synchronization bottleneck.

08-24-2011

cyberfish

Quote:

I want to know if allocating memory into a buffer within my application in order to reassign that memory using my own methods would see a performance gain over using new/delete all the time.

All modern C standard libraries do this for you already. On POSIX systems malloc() uses sbrk() to allocate large blocks from the OS, and divide them up and give them to you. Something similar probably happens on Windows.

The reason why malloc() is slow (relatively speaking) is because it knows nothing about your memory usage pattern, and must work reliably and be reasonably fast and space-efficient in all situations (imagine a program that allocates alternating 10GB and 2 bytes blocks, and free them in random order).

The only way you can make something faster is if you have additional information about your memory usage pattern that you can exploit -
For example, all allocations are in 4KB blocks, you always deallocate in reverse order of allocation, etc.

If you can just write a better general purpose allocator, they would have made it malloc() already.

08-25-2011

Elysia

If you do many small allocations, then malloc/new will generally be rather slow. This is a known problem, and that is why memory pools exist.
If you do larger, but fewer allocations, then new will generally be efficient.