I have a 3D physics engine, which has a couple of performance critical sections (broadphAse collision detection, narrowphase detection and collision response). I've pushed them quite far (as far as possible from an algorithmic standpoint) and would like to try and extract just a little more out of all the systems.

I would like to start looking at the likes of cache misses and cache hits in the program , in specific sections. Using visual studio (professional) how would I g o about detecting these things, and trying to optimise my methods? Note that I know there are better ways to get performance boosts, however this is a leaning exercise.

There are many other tools in case you want to try them out. Each one has its own strengths and weaknesses.

Intel has vtune, which is not free but extremely powerful, and 'pin tools', which are mostly BSD-licensed.

Often people with a budget for both time and money will use the AMD tools on AMD boxes, and the Intel tools on Intel boxes.

You mentioned Visual Studio, but if you are going cross platform, valgrind, gprof and 'perf tools' are Linux profiling mainstay tools. Two of the Valgrind systems, cachegrind and callgrind, do what you describe very nicely, but are linux only.

The last of the big ones, for better or worse, is the Windows Performance Analysis Toolkit. It is often difficult to use well, and the results are usually very verbose. It is an interesting beast. It either gives you exactly the details you are looking for in great abundance, or a huge amount of data that doesn't help. Combine all those facts and it makes the toolkit frustrating; when learning the tools it is difficult to determine if the enormous pile of data you are looking at is useful or not. I'd recommend it if you are comfortable with performance tools, but avoid it if you are still learning.

Once you have performance metrics in hand, make one single change, then get new performance metrics. ALWAYS verify that your changes give better results, and be prepared to revert the changes in source control if they don't help.

I've use the profiler in vs a bit, and I've used codexl for opengl debugging, but the only thing I could make from it was"here's where your method is spending the most time" which stops being helpful. Can I look at cache hits or
Misses? That's pretty much the only other thing I'm aware I should be optimising for. I'm not too concerned about x platform (although my code does work on mac as well, not tested on Linux) but I'm most familiar with windows as a dev environment, so I'd like to start here. I've come across vtune before but figure it's not gonna tell me anymore than the vs profiler with my current level of knowledge.

Step through your code and see what the memory access patterns are like. Have you taken cache behaviour into account when designing the code so far? Can it be redesigned so that the memory access patterns are either more linear or have much more locality? Do you run a single algorithm at a time, progressing in stages, or do you jump around a lot with things like calling virtual void OnHit() in the middle of collision resolution?

CodeXL has support for cache line utilization if you run that type of profile session. Make sure you have the current version, the docs say it was added to the program last November. That is why I mentioned both CodeXL and CodeAnalyst, CodeXL is not mature in some ways, so having both is a good option.

Profilers give you numbers, not really recommendations. For cache use, it will give you a percentage called "cache line utilization", also "Total L1 eviction", "Bytes/L1 Eviction", "Accesses/L1 Eviction"

But that brings us to the other issue...

the only thing I could make from it was"here's where your method is spending the most time" which stops being helpful. ... [Cache is] pretty much the only other thing I'm aware I should be optimising for.

Optimizing is an art and science. It usually requires a lot of experience, plus a broad knowledge base about programming generally. You need to take profiler results, a bunch of numbers that give you statistics about performance, and turn them into actionable items. It is rather hard, even for people who do it regularly.

ALL NUMBERS IN THIS THOUGHT EXPERIMENT ARE MADE UP, BUT APPROXIMATELY CORRECT.

Sampling profilers just pause the program every few microseconds, look where it is, and unpause it. That means numbers are not exact. It also means you need to reason through the system very carefully. Often you will need to get several profile runs where the system is approximately the same, but it will still be slightly different every run, and you'll need to think in terms of generalities, approximations, averages, and Fermi estimates.

So you first run a sampling profile for one second during the middle of the game. You see a bunch of sample numbers. main() is taking 99.93% of the time. That is good, probably. It means most of the time the profiler stopped the program, it was stopped inside the program and not doing some other task. RunPhysics was sampled 44 times, DrawBoard was sampled 42 times, and Sprite::Draw was sampled 1536 times. Is that good? Or bad? Matrix44::Multiply was sampled 52134 times and is taking 47% of the time. Is that bad?

The problem is that the numbers might be good or bad, and it takes a deeper understanding of the system to understand it.

So with RunPhysics 44 times and DrawBoard 42 times, you probably had 45 or so frames. So you think about the frames. Is approximately 1536 calls to Sprite::Draw a reasonable number? It means about Is that about how many times you would expect it to be called? Is it being called too many times? Perhaps you know you have 32 sprites on the screen, and 1536 / 45 = 34, so it is approximately right. Even if the number matches the number of frames, you need to ask the deeper question of if that is reasonable for the hardware. Does that mean 1536 texture swaps? Is that reasonable for the hardware? Does that mean 1536 calls passing raw information over the bus, or is it 1536 calls using data already stored on the card?

You look at the number of matrix multiply operations. The 52134 seems pretty high, it means roughly 1200 matrix multiplies per frame. For only 32 sprites, that is probably a bit high. So you need to hunt down the call stacks to see if there is some unexpected work going on. Is it reasonable for RunPhysics to sample 1142 matrix multiplies per frame?

Is the length of time spent in MatrixMultiply reasonable? You can see that the 52134 samples takes 79472 CPU clocks. Is that good? Is that bad?

The profiler enables you to ask questions about your code. Interpreting the answers gets very tricky.

So let's say you think your MatrixMultiply is taking too much time. You then need to figure out an alternate solution. So you do some search on Google, find a better algorithm written with SIMD intrinsics.

You put the changes in to the program, run the program with the profiler, get to about the same spot in code, run the profiler for about a second, and you get a totally different set of performance numbers.

This time you do the numbers and figure you got approximately 55 frames. MatrixMultiply had 64129 samples and only took 48726 CPU clocks. Well, that is certainly an improvement. It is roughly half the time it used to be. Is that good enough? Can you do better? Since it is an improvement you might as well submit the change to version control. You look it over, figure you don't see how to improve the algorithm you found online, so move on to another item.

Your next oddity is the number of matrix multiplies in RunPhysics from the first run. When you divided it by 45 frames you estimated about 1142 matrix multiplies per frame. This time dividing the total by 55 frames it looks like 1092 matrix multiplies per frame. Again, is this reasonable? You think it is a bit high so you review the algorithms in your physics. After a bit of study you see something that is suspicious. For this thought exercise, we'll say it is a loop where every physics object is tested against every other physics object for collisions. You think that is a bit odd. That's an N-squared algorithm, so each new physics object greatly increases.

You think some more. What are your alternatives. An all-to-all test is pretty bad for large numbers, but in this case you have 32 physics objects. That means about 1024 comparisons. Is that really too much? Your game will never go bigger than that. There are several ways to reduce the number. You could add a spatial tree to only check nearby objects for collisions, but that means writing a spatial tree, keeping it updated on every physics change, and using it in various places. You can limit the tests to only objects marked as having moved, but that requires adding some flags and some extra processing elsewhere.

Continuing the thought, you also have the option to do nothing. Having 1024 tests really isn't that high of a number. The cost of adding and maintaining a spatial tree is pretty high, plus you'll need to spend the time to debug it, so that is probably out. Having a list of "dirty" objects, objects that have moved and require collision tests, it requires more tests than the spatial tree would require, but fewer tests than the all-to-all test. Since objects are frequently moving in your game, you figure that usually 15 to 20 tests will normally be required every frame, so it probably isn't worth the effort right now. You still make a note about it: "// TODO: Possibly add a spatial tree rather than all-to-all comparison." You also make a note on your trusty paper pad, so the code comment doesn't get forgotten in oblivion.

Continuing on since you mentioned cache specifically, you run a new profile sample for cache usage. You find a function that has a high number of L1 evictions. One of your search routines had about 12000 L1 evictions over the sample. So you start asking the questions again. Is that actually bad? Assuming it is actually bad, what techniques do you know that can help fix it?

You need to decide if the results can be improved, estimate how much you can improve them and how long it will take, and decide if the effort is worth it. It means you need to go mentally through your treasure trove of experience. What techniques exist that are cache friendly? Why are they cache friendly? How can we reduce the work? How can we eliminate the work entirely? Might those specific techniques work in this situation? You can also search online, looking for algorithms that match.

And THAT paragraph is the critical one for optimization.

What experience do you have in alternative solutions when the first one doesn't work? Some algorithms you can just plug into a search engine. Let Google find you a well-written matrix multiply, it is such a common algorithm, it is well documented, and there are some really cool implementations out there. But you can only rely on the Internet for a small number of algorithms. The job of programmers is to come up with algorithms to solve problems. Your game is unique, your implementation is unique, there are many unique challenges that nobody else has ever optimized, or if they have, they probably didn't blog about it.

The crux of optimization is that you MUST have enough experience to know alternatives. You also MUST have a knowledge of how all the parts work.

Only having one isn't enough. You may know something is slow but if you don't have the experience to form an alternative solution, you are stuck with the slow existing solution. Or, if you do know some alternative algorithms but you don't know how the parts work know why they have the performance characteristics they do, you are left to just guessing at each of the alternatives without any knowledge, blindly trying them until one hopefully is better.

If you have both experience and broad knowledge, you can use it in optimization. Let's say for some specific sorting condition you might have a lot of background knowledge. You can recall about 20 different sorting algorithms from your studies. Bubble sort is potentially an incredibly slow algorithm, with a large number of potential passes. But if you have a bit of knowledge, such as knowing that your data set is probably completely sorted, and if not sorted then nothing is off by more than a single swap, a slightly modified bubble sort could be your ideal algorithm. You also have knowledge that a single linear pass through the data is very cache friendly. You can take both the experience and knowledge, and construct a single pass algorithm that is based on the bubble sort, but in your specific case due to the pre-sorted collection you can guarantee that a single pass is enough. You can then convert a potentially horrible algorithm into a potentially amazing solution for a specific problem.

But in order to do that, you need both knowledge and experience. With both, you can wrap your brain around problems, figure out alternatives, estimate the costs and benefits of the alternatives, test them out. If they work, submit them, if they don't work, try again.

That is the process. It is also why profiling and optimizing code is a difficult task, usually left to the more experienced developers. There are almost always a few easily identified and easily corrected changes. It is can be obvious that when you have millions of calls to strlen() something is wrong and you can hunt down why. You can dabble in optimization and try to learn the steps. But once the easy stuff is gone, to really do it well requires some fairly rare skill sets, broad experience, broad knowledge, patience, and lots of thinking.