[Test] Simple x87 vs SSE2 Performance Test With Matrix Multiplication

In this news, we learnt that the current version of the PhysX engine was compiled with the x87 instruction set and is not using modern sets like SSE2. I found here a simple ready-to-use matrix multiplication code sample that will allow us to see the speed difference between x87 and SSE/SSE2 sets.

As the author of this code says it, it’s a very artificial scenario, but it shows well the difference between the differents math instruction sets.

I limited the main loop counter to 1 million of matrix multiplications which is enough.

In the vs2005 project properties, I set Optimization to Minimize Size (/O1) and changed the Enchanced Instruction Set to test the different sets.

You can download the test pack here:
[download#141#image]

There are three exe in the pack: x87_test.exe, sse_test.exe and sse2_test.exe. I added a batch file for each exe in order to have a pause at the end.

Test 1 – Instruction Set: No Set or x87
– Elapsed time: 2373 ms

Test 2 – Instruction Set: SSE
– Elapsed time: 2368 ms

Test 3 – Instruction Set: SSE2
– Elapsed time: 1112 ms

No doubt, SSE2 is the way to get fast math. A recompilation of the PhysX engine with SSE2 instruction set would be very nice. But as I said in this news, a simple recompilation might lead to some incorrect calculation results, so NVIDIA will have to test such a recompilation before, which may take some time…

Here is the assembly output of the core of matrix multiplication of the different instruction sets:

In my experience auto-vectorization (compiler generates SSE code automatically) doesn’t really work that well. This means that even if you tell the compiler to generate SSE(2/3/4) you won’t always get SSE instructions. SSE code needs to be coded by hand (preferably with intrinsics) and often requires some modification to adapt the code to SIMD instructions.
Only in very simple scenarios can the compiler do this automatically.

@Mars_999: good question! For my part, SSE and SSE2 are the only instructions sets available with vs2005. And I just looked at vs2010 and only SSE/SSE2 are available too. Actually, I think the Intel C++ Compiler is required to get SSE3/4…

@Mars_999: SSE3 & SSE4 are _extensions_ of the older SSE implementations, so they don’t replace SSE2 functions with faster ones, instead they add new ones. So it depends on your code if any SSE is feasible for you. And Vector/Matrix math is mostly covered with SSE & SSE2.

I changed a little bit the code (removed the windows part) and tested on a puny atom.
Instead of using the timeGetTime function I have used the linux time command, the result is the time that the program spend in user space. This will count also the initialization but is on a 0.00001% of the whole time. An acceptable error.

A big improvement without optimization, but with optimization with or without SSE is basically the same time.
Some explanation? Probably the atom exec the sse instruction quite slowly.
By the way, watching the assembly the SSE version is a lot shorter. 🙂

That not looks too good… i think there is a problem in the code, because if SSE2 is really slower than x87 it would be completely unnecessary, and shouldn’t be even implemented to amd cpus. But it’s implemented and used nearly in everything, and intel disabled SSE2 usage in their compiler when the compiled program runs on amd, and obviously if it would be really slower they wouldn’t disable it, because their goal was to make programs run slower on amds, and not faster.