Cross-platform vectorization.

Right now, my physics library implements it's own 2D vector operations. In short, what are the options for vectorizing this code in a cross-platform manner? I'm not even sure what to Google for at this point.

Yeah, I'm pretty proud of it. It's mostly based on Erin Catto's contact persistence idea, but is implemented from the ground up to store the contacts in a different way. I've also been tuning my spatial hashing code for about a year now.

Running Shark on the box demo again, the main impulse solver function uses 30% of the CPU time, and the functions that apply the impulses are using almost. fminf() and fmaxf() are using 12%. (6% more for dyld_stub_fmaxf()) I declared the impulse application functions as extern inline, but they've never seemed to work. What's the deal? Also, aren't fminf() and fmaxf() supposed to be built in functions? I suppose I could just write my own and inline them.

After adding a midphase aabb check I was able to take off another couple percent bringing it to a 30% drop in CPU use. Not bad for less than an hour of work. I didn't expect to get that much more out of it even with vectorization. Maybe I was just barking up the wrong tree.

However, now the function that calculates the collision impulses is using 70% of the CPU in the box stacking example. Unsurprisingly, it does a lot of vector operations. If vectorizing the vector operations even speeds them up by 10%, that would still probably lead to another 5% overall speedup.

Oh, and what ever happened to auto-vectorization? I thought that was supposed to be a big deal.

What happened to it? it exists in the compiler; you can turn it on if you like (Xcode has a checkbox, or you can add -ftree-vectorize to your CFLAGS).

The auto-vectorizer in GCC 4.0 (that Apple uses) is next to useless. There's so little code it can vectorize, and when it does, it's just as likely to slow it down as speed it up. The auto-vectorizers in GCC 4.1 and the soon-to-be-released 4.2 are much better, but who knows when Apple'll upgrade.

Despite the bleak description of the -ffast-math option in the GCC man page, I tried it. It gives another sizable gain in performance on my G5. (another 6-10%) It doesn't even seem to affect the simulations at all either. The gain is minimal if non-existant on my MacBook though.

I'm a bit confused though, according to the man page, it assumes that I'm not using a lot of things that I am (Inf's, etc.), and warns that the results aren't exact. Yet it still works verbatim (near as I can tell) on both my PPC and Intel machines. Strange then that the PPC and Intel versions run slightly differently, but using -ffast-math doesn't seem to have any affect.

Should I be worried that this is going to cause problems in the future? It sounds like the optimizations should cause the math I'm doing to not work at all.