That is almost 400x as slow! Is this an anomaly of me running windows on a virtual machine, or is new, delete always this slow as compared with local variable alloation? I also did not find that simply dereferencing j caused any significant slowdown in the code.

In the first case, it is in the worst scenario doing a loop that has nothing to do; the local int is going to have been assigned to a register and at worst a space on the stack. In either case setting it to 0 is like nothing at all. Most likely your compiler is not even bothering to set it since it's unused, and may even be removing the loop altogether as it's not doing any work.

In the second case, each time you call 'new int' it's calling the operating system to ask it for space and performing lots of computations to choose the best place to allocate the int, with perhaps further logic to maintain the heap, same with delete in reverse.

A lot of this behaviour in C++ is implementation dependent but as a general rule in a nutshell:- stack allocation requires incrementing a pointer by sizeof(WhatIWant)- heap allocation requires synchronizing over all threads and walking some kind of list until a suitable chunk of memory is found, modifying the list accordingly, and returning the pointer.Note that this applies to C++ only. On modern Java implementations for example, new is silly cheap. Also note that the Debug runtime of MSVC for example does a lot of extra work (like initializing the allocated memory with a special value and padding the allocated memory with another special value).

Also note that an optimizing compiler will probably be able to completely eliminate your passing function since it can detect that nothing in there has a side effect. I would not trust the benchmark as written. Either the compiler is not optimizing (in which case it's probably the default Debug configuration in case of MSVC and as such completely useless for any benchmark) or the function and containing loop was probably completely removed.

I started programming in 1982 and didn't write anything with dynamic memory allocation until 1992. Dynamic-memory allocation is a high-level construct that isn't obvious to implement in assembly, and therefore it's not a surprise that it's expensive. You should probably avoid using it in tight loops.

I didn't say it does not come with a price at a different place. If Java-new were just completely better without any negative side effect at another point everyone would be stopping to use C++ and start using Java (well, not really). The important point up there was that new being expensive is something related to C++ only. It comes with prices and benefits all over the place. Other languages, even if they have the exactly same looking new-keyword, do that differently and things come with different prices and benefits.

That is almost 400x as slow! Is this an anomaly of me running windows on a virtual machine, or is new, delete always this slow as compared with local variable alloation?

There are two ways in which these results can be explained:

Allocation of local variables will typically be done on "the stack", though strictly speaking this is an architecture-specific implementation detail (I don't believe C++ mandates a stack explicitly, only "automatic storage"). Allocation of objects with new and delete will typically be done on "the heap" (technically, "the free store" according to the C++ standard), which normally involve function calls that look for "gaps" in virtual memory (crudely speaking).

A modern compiler will likely eliminate the first loop entirely. You might find a tool-chain that could eliminate the second loop, too, though this is less likely as I guess it would have to be done as a global optimization to ensure new/delete haven't been overridden.

I also did not find that simply dereferencing j caused any significant slowdown in the code.

If you want a thorough understanding about what's going on here, a really good exercise is to find out how to look at the assembly code your compiler generates and see if you can follow what it has done. Reading it will necessarily involve understanding what registers are, what "the stack" means on x86 and a bit about the calling convention(s) used.

Apart from the already mentioned obvious (the first function is no-op), I don't consider 787 cycles for a general allocator slow. If you subtract the cycles for your for loop and for the non-inlineable library call, that's around 700-750 cycles alltogether.

Actually, for a non-specialized allocator which has to deliver regardless of size and situation, and is running in debug mode (unless you use the world's worst compiler in existence, this is demonstrably a non-optimized build), that's a pretty awesome performance.

In comparison, a single cache miss is on the order of several hundred cycles. Thus, realistically, for non-trivial code (one that actually uses a million allocated memory locations), the overhead of the allocator wouldn't matter much.

I was actually trying to generate cache misses with this code. I have read all about them and would like to experience them for myself. I wanted to know if it was better to send in a pointer to a structure for a often called piece of code, or to just pass the structure by value to avoid a potential cache miss. Can you advise on which you think would be better in that case? I might not be getting cache misses because I have a 15MB L2(3?) cache, but I know not all systems have that.

It depends on the underlying allocator, but most allocators are sophisticated enough to properly optimize for that. All but the most stupid implementations will add a freed block to a kind of free list (some even do this per-thread), and return an element from there on the next allocation.

Therefore, if you allocate/free/allocate/free/allocate an integer, you will very likely get the same address or one of a handful of addresses every time. Thus, cache misses are not going to happen.

To provoke cache misses, you need to either allocate one large block and index into that randomly, or do some ten thousand allocations without freeing, and dereference the returned pointers (preferrably in random order). Any access pattern that is either localized or an access pattern with a constant stride (e.g. accessing every 64th or 253th byte) will very likely not give a lot of cache misses due to either several accesses landing in the same cache line anyway or triggering the CPU's built-in prefetch mechanism to anticipate your requests.

I was actually trying to generate cache misses with this code. I have read all about them and would like to experience them for myself.

Without wanting to sound too disheartening, I wouldn't worry about this kind of stuff for the time being. Concentrate on writing solid code and understanding the language (for instance, free store vs automatic storage semantics is particularly important point in a value-based language such as C++).

If you really care about how much you're missing the cache, it's not something you want to be working out in your head, for the most part. You first want to write code sensibly and then use profiling tools to tell you where you might be able to improve it in terms of performance. Intel's PCM will give you really low-level information of this kind.

I wanted to know if it was better to send in a pointer to a structure for a often called piece of code,

Neither. Typically use reference-to-const for non-primitive types. Tweak according to profile results. Again, IMHO getting the basics of the language down first is much more worth while.

or to just pass the structure by value to avoid a potential cache miss.

Whether or not an object is on the stack or heap makes a difference, as does its size, and the amount of indirection inherent in its data structure. The full answer is extremely complicated.

I might not be getting cache misses because I have a 15MB L2(3?) cache, but I know not all systems have that.

Alvaro: I assumed that by copying the data into the stack then accessing it there, I would certainly not miss the cache. Because the CPU is fast this should be fast also.
edd: Actually my real goal is to implement an Observer scheme where I have, for example, DigitalButtonObservable, AnalogButtonObservable, 2DAxisObservable, etc., and then have a single InputPolyObserver class built into a vector in each UIController. By doing that I could, for example, map the jump of my playable character to three separate actions, i.e. Space Bar, X (360 controller) and maybe a mouse move for some Indie coolness. I know that it is not exactly logical to have the same action map to such disparate input methods, but I want to start with a completely generic setup for the engine code, then simplify when making the actual game. I really want to avoid having to code the input system twice, but rather script the input in any way I please, including mixing input types i.e. Digital vs. Analog.

The reason I can't pass by const ref in this case is that, for example, a digital button would return a bool for buttonDown and a mouse would return its int or float x, and y coords as a struct or whatnot. The only way I could think of to implement this as a single overload of a function in the abstract Observer class, named something such as GetObservableState(...), is to pass something like a void*, then requiring my client code to interpret the event, perhaps by also passing a type, such as an enum specifying whether the button was a digital press or an analog press i.e. DIGITAL_PRESS, ANALOG_PRESS.

I could then simply load up an observer with a function pointer for its code and link it with any button or combination of buttons from any controller, keyboard, or mouse, and the code would simply work as requested. Do you see another way to implement this?

Alvaro: I assumed that by copying the data into the stack then accessing it there, I would certainly not miss the cache. Because the CPU is fast this should be fast also.

It doesn't matter how fast the CPU is (when it comes to this cache/memory stuff); the bottleneck is the memory speed, not the CPU speed. Also, if there is a cache miss when using the object, there will be cache misses when copying the object as well. Just an FYI.

I fail to see what any of that has to do with performance. Input is, relative to your game loop, a rare event. There is usually little point trying to optimise it, there are typically much "hotter" areas where you can actually reap measurable benefit.

There is a lot of stuff going on in this thread. First you were talking about the allocator, then the cache, now it seems to be about design. Try to keep a thread coherent, and create additional threads if necessary for unrelated topics.

Alvaro: I assumed that by copying the data into the stack then accessing it there, I would certainly not miss the cache. Because the CPU is fast this should be fast also.

But where are you copying it from? Any memory you read from there has to go through the cache hierarchy too. I don't want to start sounding like a broken record, but I'd say get the basics down first.

The reason I can't pass by const ref in this case is that, for example, a digital button would return a bool for buttonDown and a mouse would return its int or float x, and y coords as a struct or whatnot.

I still don't see why you can't use references to const, or even which part of the whole system you're talking about here. There are surely a whole bunch of functions involved, any of which could take arguments by value, reference or pointer(?).

The only way I could think of to implement this as a single overload of a function in the abstract Observer class, named something such as GetObservableState(...), is to pass something like a void*, then requiring my client code to interpret the event, perhaps by also passing a type, such as an enum specifying whether the button was a digital press or an analog press i.e. DIGITAL_PRESS, ANALOG_PRESS.

It sounds like you're perhaps doing the observer pattern backwards. The thing being observed tells the things observing it what's going on. You really don't want to start falling back to void*.

I could then simply load up an observer with a function pointer

void* and function pointers are all that's available in C. In C++, use virtual functions instead.

Do you see another way to implement this?

We're wandering well off-topic here (for the second time), and my suggestion may not fit what you're doing, but I'd go with something like this:

So a mouse might have an "observable<mouse_position>", for example. Anything that wants to know when the mouse has moved registers an observer<mouse_position> implementation. This is a very noddy version of something like boost::signal<>.

Thank you very much edd, that is very instructive. I never considered the template version. Can you tell me how I can quote others and post clean and colored code like you did? Also, I have never heard the word "noddy."

Thank you very much edd, that is very instructive. I never considered the template version. Can you tell me how I can quote others and post clean and colored code like you did?

Under each post, there's a button that says "Quote". Click it to quote someone. Alternatively, add [ quote ][ /quote ] (without spaces) tags in your post to create your own quote (but it won't display their name or post or timestamp if you don't click "Quote"). To get formatted code, use [ code ][ /code ] tags (again, without spaces)