(This article was originally written in late July, as a response to David Kanter’s article, to be published on another website. Due to circumstances it was never published, so I am posting it here now)

If we want to start at the beginning, we have to go back a few years, when a new company by the name of Ageia introduced the first physics accelerator, a Physics Processing Unit, or PPU. They named it PhysX. The accelerator was driven by a physics library that was originally known as NovodeX. Ageia acquired NovodeX to power their accelerator hardware, and renamed the software to PhysX as well.

This caused a stir in the world of computers. On the one hand it threatened high-end CPUs, which had been performing the heavy physics operations up to that point. On the other hand, it also threatened other physics libraries, since PhysX was an all-in-one solution for both conventional CPUs (and consoles) and the new PPU, where other libraries could not support hardware-acceleration through the PPU.

Obviously the use of physics accelerators, be it PPUs or GPUs, was not a very nice prospect for CPU vendors Intel and AMD. Since AMD acquired ATi, the problem solved itself on AMD’s side. Intel decided to approach it from the other direction, and acquired Havok.

For AMD it was no longer a real issue. If Intel was going to focus on CPU-only physics, then AMD’s CPU division would automatically benefit as well. And if Havok would continue with GPU accelerated physics, then it was AMD’s GPU division instead. A classic win-win situation.

Not much was heard of Havok’s GPU acceleration since Intel’s acquisition, unsurprisingly. This left nVidia high and dry. They have had a small glimpse of GPU-accelerated physics, and realized the potential it had. With the introduction of the GeForce 8800-series, physics was also one of the features they promoted. So nVidia decided that they could do the same that Intel did, and acquired Ageia. nVidia quickly ported the PhysX library to their Cuda GPGPU framework, and killed off the PPU hardware.

This changed the situation around completely. Physics acceleration no longer required additional hardware. PhysX now ran on standard GeForce 8-series cards, which already had a very large installed base. nVidia was also a much more familiar name than Ageia, and nVidia made PhysX support part of their The Way It’s Meant To Be Played program. nVidia’s marketing team quickly made PhysX into a household name.

AMD realized that they could no longer just bet on Intel and Havok pushing CPU physics. nVidia was determined to make GPU physics a success, and AMD needed some kind of answer. Initially that answer came in the form of OpenCL. The new standard for parallel computing would potentially be interesting for Intel to support, and would allow AMD to implement GPU-acceleration. So AMD and Havok announced that they were working on OpenCL acceleration.

AMD would also use the open character of OpenCL to point out that Cuda and PhysX were proprietary (implying that this was a bad thing), and AMD was supporting open standards. However, on the one hand nVidia announced full OpenCL support with their GPU drivers before AMD. On the other hand, nothing OpenCL-related was ever heard from Havok again, after some initial demonstrations. This made AMD’s arguments less and less convincing. nVidia already had a working solution with PhysX, and if Havok would ever release OpenCL support, nVidia would support it out-of-the-box anyway. It is not too difficult to see why Havok never released anything, given that Intel owns Havok, and Intel has nothing to gain from OpenCL, and a lot to lose.

With Havok slipping, AMD instead announced a partnership with the open source Bullet Physics library, again supporting OpenCL. However, Bullet is nowhere near as popular as Havok, so even if Bullet will get OpenCL support, it’s going to be very hard for AMD to actually get game developers to support their GPU-accelerated solution on a large scale.

So apart from promoting their own (non-existing) OpenCL-powered solutions, AMD also has to do damage control towards PhysX. They have to make PhysX look bad in every way they can. An example was earlier this year, with an interview with AMD’s Richard Huddy.

nVidia respondedby explaining that PhysX supports multithreading (and there are various examples using multithreaded PhysX, such as 3DMark Vantage and Fluidmark), but the multithreading is not automated yet. The library is thread-safe, but developers have to manage the threads themselves. (it has always been like this, even back in the Ageia and NovodeX days, so nVidia did not disable anything, let alone deliberately. Huddy simply made some false accusations). nVidia is working on automated threading as part of the Apex framework, and this will be released with the upcoming PhysX 3.0 SDK.

Kanter then makes claims about the gains that can be had from converting the code to SSE. He claims that SSE could quadruple performance in theory, and in reality a boost of more than 2x would be possible. Kanter claims that a modern optimizing compiler can easily vectorize the code for SSE automatically, and such gains could be had from just a recompile. So nVidia is just leaving all this performance on the table. What’s more, if PhysX would indeed be 2-4 times faster on CPU, it would actually be a threat to GPU-accelerated physics. Kanter claims that PhysX is hobbled on the CPU, and that nVidia is deliberately doing this to make GPU physics look good.

He suggests that this can be verified by compiling an open source physics library such as Bullet to x87 and SSE. Kanter however does not bother to actually perform the recompilation and publish the results to support his case. This is where experienced developers would get suspicious. Firstly, they will know from experience that it’s not that simple to get such gains from SSE, especially not from just recompiling. Kanter’s estimates sound about as optimistic and theoretical as Intel’s own SSE marketing. And downloading and compiling Bullet will only take a few minutes, so why did Kanter not bother to do it himself? Or did he? Well, some developers figured that it was easy enough to make sure (the situation with Bullet is actually the reverse of PhysX. It compiles to SSE by default, with a few optimizations in the sourcecode as well. The SSE switch and the optimizations in the sourcecode have to be disabled to compile to an x87 version. This means that Bullet is probably actually a more favourable case for SSE than PhysX would be).

The results speak for themselves. Although there certainly is some performance left on the table by nVidia, it is nowhere near as dramatic as Kanter claimed. In synthetic tests, there is about 8% to be gained from recompiling. This is nowhere near the 2-4x figure that Kanter was using. In fact, 8% faster PhysX processing would mean even less than 8% higher framerates in games, since PhysX is not the only CPU-intensive task in a game. Perhaps the net gain in framerate would be closer to 3-4%, depending on the game. In other words, recompiling PhysX with SSE would not make CPUs threaten GPU physics. Not even close. The difference would be lost in the margin of error, most likely.

Now, ofcourse one could go beyond just using a compiler, and manually optimize the code with SSE. But even then, it seems like 2-4x gains overall will be hard to achieve. Kanter’s claim that SSE uses 4-way packed arithmetic, and therefore can be up to 4 times as fast, is true. But this is a very theoretical figure, which is only true when you can actually use 4-way SIMD code without running into other bottlenecks. Certain parts of a physics library may certainly benefit considerably from SIMD, and speed up 3-4 times. But other factors such as calling overhead, branching, memory bottlenecks and thread synchronization will probably bring the overall performance gains down considerably. 4x gains would be extremely optimistic. 1.5-2x might be achievable with well-optimized code. However, this obviously requires a lot of code rewriting on nVidia’s behalf, and goes way beyond just ‘flicking a switch and recompile’ to get the ‘performance left on the table’ as Kanter suggests. Even so, assuming a completely hand-optimized PhysX runs twice as fast as it does today, it’s not going to be a replacement for GPU accelerated PhysX on the more high-end cards (at the end of the day, no matter how you slice it, the fastest GPUs have about an order of magnitude more GFLOPS of raw processing power than the fastest CPUs). Again, we are just talking about the physics workload, the overall framerate in a game will not double if only the physics run twice as fast while the rest of the code remains the same.

It doesn’t stop there however. Kanter’s article, wrong as it may be, is linked on many news sites and forums all over the web, and many discussions ensue. Most people buy into Kanter’s article, and some sites make even more bold claims than Kanter himself, referring to Kanter’s article as ‘absolute proof’ of nVidia’s evil actions. This is exactly what AMD needs.

Even on forums where there are more industry-insiders and developers who should realize the mistakes of Kanter’s article, such as Beyond3D, oddly enough there does not seem to be much of a mention of these mistakes. It would appear that Kanter is actually being protected by the members of Beyond3D, perhaps since he is a regular there (using the name DKanter). After a lengthy email exchange with Andrew Lauritzen, he finally decides to verify the Bullet compilation, and comes to the same conclusion that recompiling only gains about 8%. He doesn’t specifically state that Kanter’s claims are way off however, so basically Kanter just gets away with it. Many other sites tend to look at Beyond3D as a reliable source of information, and in this case, the small post of Lauritzen is all there is that could be any indication that Kanter’s article is unreliable, but it just doesn’t get noticed.

It would seem that it’s much easier to spread this misinformation around than it is to get the truth out. Kanter was contacted by email repeatedly, pointing out the mistakes in the article. But Kanter outright refused to do anything about the article. When pointing out that just recompiling with an SSE switch doesn’t do much for performance, pointing out the results, he insisted:

“And actually it is just a flick of the switch to get SSE instead of x87. I’m sure your familiar with GCC:

-mfpmath=sse

-march=prescott

Both of those would radically improve the situation. fastmath would as well.

At any rate, I’d prefer to continue this discussion in our forums.”

Apparently he is not a programmer himself. This is something that he has been told by someone, and he is sticking with it. Had he tried it, he would have known that it wouldn’t work the way he claims. A few more emails did not change his mind. What’s more, there are already posts on his forum, pointing out that using gcc to compile to SSE will not result in his claimed gains, and arguing against various other points in the article. But Kanter has done nothing to amend his article.

He intends to stick with his original article. But he knows about his errors, there is no doubt about that after the email exchanges and the forum posts. So he is deliberately spreading this misinformation, for whatever reasons he may have. And he is actually getting away with it as well. So much for integrity.