Saturday, November 20, 2010

Analysis: PhysX On Systems With AMD Graphics Cards

Rarely does an issue divide the gaming community like PhysX has. We go deep into explaining CPU- and GPU-based PhysX processing, run PhysX with a Radeon card from AMD, and put some of today's most misleading headlines about PhysX under our microscope.
The history and development of game physics is often compared to that of the motion picture. The comparison might be a bit exaggerated and arrogant, but there’s some truth to it. As 3D graphics have evolved to almost photo-realistic levels, the lack of truly realistic and dynamic environments is becoming increasingly noticeable. The better the games look, the more jarring they seem from their lack of realistic animations and movements.
When comparing early VGA games with today's popular titles, it’s amazing how far we’ve come in 20 to 25 years. Instead of animated pixel sprites, we now measure graphics quality by looking at breathtaking natural occurrences like water, reflections, fog, smoke, and their movement and animation. Since all of these things are based on highly complex calculations, most game developers use so-called physics engines with prefabricated libraries containing, for example, character animations (ragdoll effects) or complex movements (vehicles, falling objects, water, and so on). Zoom
Of course, PhysX is not the only physics engine. Up until now, Havok has been used in many more games. But while both the 2008 edition Havok engine and the PhysX engine offer support for CPU-based physics calculations, PhysX is the only established platform in the game sector with support for faster GPU-based calculations as well.
This is where our current dilemma begins. There is only one official way to take advantage of PhysX (with Nvidia-based graphics cards) but two GPU manufacturers. This creates a potential for conflict, or at least enough for a bunch of press releases and headlines. Like the rest of the gaming community, we’re hoping that things pan out into open standards and sensible solutions. But as long as the gaming industry is stuck with the current situation, we simply have to make the most of what’s supported universally by publishers: CPU-based physics.

Preface
Why did we write this article? You might see warring news and articles on this topic, but we want to shine some light on the details of recent developments, especially for those without any knowledge of programming. Therefore, we will have to simplify and skip a few things. On the following pages, we’ll inquire whether and to what extent Nvidia is probably limiting PhysX CPU performance in favor of its own GPU-powered solutions, whether CPU-based PhysX is multi-thread-capable (which would make it competitive), and finally whether all physics calculations really can be implemented on GPU-based PhysX as easily and with as many benefits as Nvidia claims.
Additionally, we will describe how to enable a clever tweak that lets users with AMD graphics cards use Nvidia-based secondary boards as dedicated PhysX cards. We are interested in the best combination of different cards and what slots to use for each of them.

We used a new test system for this article, since it supports up to quad-GPU graphics, an overclockable CPU, huge amounts of memory, and a powerful PSU.

Zoom
This configuration fares well by modern gaming standards and should stay suitable for heavy 3D gaming into the near future.

Relevance of the CPU PhysX solution
Let’s first examine the fact that Nvidia currently only allows GPU-accelerated PhysX on its own graphics cards, thus forcing everyone else to calculate the PhysX instructions implemented in games using the CPU. The result for non-Nvidia gamers is usually an unplayable game when you turn PhysX on without a GeForce card installed. Obviously, the goal of this article is not to judge business decisions, but rather to understand the lack of performance experienced on systems not equipped with Nvidia graphics cards. ZoomWhy is CPU PhysX so much slower than GPU PhysX in modern games?

Assuming that a calculation can be parallelized, a GPU with its multiple shader units is faster than a conventional CPU with two, three, four, or even six cores. According to Nvidia, physics calculations are two to four times faster on GPUs than CPUs. That’s just half of the truth, though, because there are no physics features that couldn’t be implemented solely on the CPU. Quite often, games use a combined CPU + GPU approach, with the highly parallelizable calculation,s such as particle effects, performed by the GPU and the more static, non-parallelizable calculations, such as ragdolls, performed by the CPU. This is the case in Sacred 2, for example. In theory, the ratio of highly parallelizable calculations should in many cases be too low to really take noticeable advantage of the immense GPU speed.
But then why is the difference often so drastic in practice?
There are at least two reasons for this. The first one is that, in almost all of the games tested, CPU-based PhysX uses just a single thread, regardless of how many cores are available. The second one is that Nvidia seems to be intentionally not optimizing the CPU calculations in order to make the GPU solution look better. We’ll have to investigate multithreading at a later time with a suitable battery of benchmarks. Right now, we want to explore Nvidia deliberately leaving its code in a state where CPUs just can’t compete with GPUs.

CPU PhysX and Old Commands
In an interesting article by David Kanter at Real World Technologies, he explored using Intel’s VTune to analyze CPU-based PhysX. Looking at the results, he found loads of x87 instructions and x87 micro operations.

Explanation: x87 is a small part of the x86 architecture’s instruction set used for floating point calculations. It is a so-called instruction set extension, a hardware implementation providing essential elements for solving common numerical tasks faster (sine and cosine calculations, for example). Since the introduction of the SSE2 instruction set, the x87 extension has lost much of its former importance. However, for calculations requiring a mantissa of 64 bits, only possible with the 80-bit wide x87 registers, x87 remains important.

David speculated that optimizing PhysX code using the more modern and faster SSE2 instruction set extension instead of x87 might make it run more efficiently. His assessment hinted at 1.3 to 2 times better performance. He also carefully noted that Nvidia would have nothing to gain from such optimizations, considering the company’s focus on people using its GPUs.
We reconstructed these findings using Mafia II instead of Cryostasis, and switching back to our old Intel-based test rig, since VTune unfortunately could/would not work with our AMD CPU.Assessment
Our own measurements fully confirm Kanter's results. However, the predicted performance increase from merely changing the compiler options is smaller than the headlines from SemiAccurate might indicate. Testing with the Bullet Benchmark only showed a difference of 10% to 20% between the x87- and SSE2-compiled files. This might seem like a big increase on paper, but in practice it’s rather marginal, especially if PhysX only runs on one CPU core. If the game wasn’t playable before, this little performance boost isn’t going to change much.
Nvidia wants to give a certain impression by enabling the SSE2 setting by default in its SDK 3.0. But ultimately it’s still up to developers to decide how and to what extent SSE2 will be used. The story above shows that there’s still potential for performance improvements, but also that some news headlines are a bit sensationalistic. Still, even after putting things in perspective, it’s obvious that Nvidia is making a business decision here, rather than doing what would be best for performance overall.

Does CPU PhysX Really Not Support Multiple Cores?
Our next problem is that, in almost all previous benchmarks, only one CPU core has really been used for PhysX in the absence of GPU hardware acceleration--or so some say. Again, this seems like somewhat of a contradiction given our measurements of fairly good CPU-based PhysX scaling in Metro 2033 benchmarks.

Graphics card

GeForce GTX 480 1.5 GB

Dedicated PhysX card

GeForce GTX 285 1 GB

Graphic drivers

GeForce 258.96

PhysX

9.10.0513

First, we measure CPU core utilization. We switch to DirectX 11 mode with its multi-threading support to get a real picture of performance. The top section of the graph below shows that CPU cores are rather evenly utilized when extended physics is deactivated.
In order to widen the bottleneck effect of the graphics card, we start out with a resolution of just 1280x1024. The less the graphics card acts as a limiting factor, the better the game scales with more cores. This would change with the DirectX 9 mode, as it limits the scaling to two CPU cores.
We notice a small increase in CPU utilization when activating GPU-based PhysX because the graphics card needs to be supplied with data for calculations. However, the increase is much larger with CPU-based PhysX activated, indicating a fairly successful parallelization implementation by the developers.
Looking at Metro 2033, we also see that a reasonable use of PhysX effects is playable, even if no PhysX acceleration is available. This is because Metro 2033 is mostly limited by the main graphics card and its 3D performance, rather than added PhysX effects. There is one exception, though: the simultaneous explosions of several bombs. In this case, the CPU suffers from serious frame rate drops, although the game is still playable. Most people won’t want to play at such low resolutions, so we switched to the other extreme.
Performing these benchmarks with a powerful main graphics card and a dedicated PhysX card was a deliberate choice, given that a single Nvidia card normally suffers from some performance penalties with GPU-based PhysX enabled. Things would get quite bad in this already-GPU-constrained game. In this case, the difference between CPU-based PhysX on a fast six-core processor with well-implemented multi-threading and a single GPU is almost zero.

Assessment
Contrary to some headlines, the Nvidia PhysX SDK actually offers multi-core support for CPUs. When used correctly, it even comes dangerously close to the performance of a single-card, GPU-based solution. Despite this, however, there's still a catch. PhysX automatically handles thread distribution, moving the load away from the CPU and onto the GPU when a compatible graphics card is active. Game developers need to shift some of the load back to the CPU.
Why does this so rarely happen?
The effort and expenditure required to implement coding changes obviously works as a deterrent. We still think that developers should be honest and openly admit this, though. Studying certain games (with a certain logo in the credits) begs the question of whether this additional expense was spared for commercial or marketing reasons. On one hand, Nvidia has a duty to developers, helping them integrate compelling effects that gamers will be able to enjoy that might not have made it into the game otherwise. On the other hand, Nvidia wants to prevent (and with good reason) prejudices from getting out of hand. According to Nvidia, SDK 3.0 already offers these capabilities, so we look forward to seeing developers implement them.

Preface to the PhysX Hybrid Solution
We devote this part of the article to those who use a Radeon as their main graphics card, but still want to enjoy hardware-accelerated PhysX. As of this writing, our methods here work just fine. Refer to the links below, however, because as Nvidia releases new driver updates, new versions of this hybrid solution tweak will have to be released as well.
This dodgy game with Nvidia unfortunately only has one losing side: the users. It makes commercial sense for Nvidia to exclude its competitors through driver limitations, but the company’s economic welfare might not be the biggest concern for AMD users who desire the admittedly impressive benefit of PhysX.ZoomSystem Requirements

A primary graphics card has to be used as the image output device. With version 1.04ff of the tweak, the dedicated PhysX graphics card no longer needs to be connected to a monitor. Among other benefits, this frees more resources for physics calculations. The graphics card does not have to be SLI-capable, but check your PSU to confirm that it can output sufficient power. The Software
You can get the latest tweak from nqohq.com: Download and information. The necessary drivers and PhysX downloads are offered by the respective manufacturers’ Web sites. We haven't offered a direct link to the tweak for two reasons: the continuous driver updates will make the link obsolete, and we respect the work of the developers enough to link to the original source.The InstallationThe procedure:

Shut down the computer and unplug the power

Plug in the Nvidia card that will be used for PhysX

Start up your computer

Install the appropriate drivers from Nvidia and AMD

Check and install the appropriate version of PhysX (see the list)

Download the appropriate version of the tweak (Windows 7, 32- or 64-bit)

Extract the files from the RAR archive

Run the tweak as Administrator (right-click the file and choose “Run as Administrator”)

Reboot your computer

Switching modes via the CMD file:

The relevant files are in the subdirectory

Run the desired function as Administrator

That’s it. You can now test PhysX with GPU-Z or Fluidmark. Enjoy running PhysX and AMD Radeon cards at the same time! Even if the overhead and additional power requirements are somewhat disturbing, it's quite worth it.Important Notice
Every time you install a new version of the graphics driver or PhysX, you will have to apply the tweak again. Only the currently-tested version is compatible. We take no responsibility for any overload of components or the future operation of the tweak. This is a guide, not a recommendation.

Test Sequence and Combinations
We start by combining our test subjects and benchmarking them in the following configurations:

AMD main graphics card + GPU PhysX (Nvidia card)

Nvidia main graphics card + GPU PhysX (Nvidia card)

A single graphics card running GPU-based PhysX

CPU-based PhysX

Instead of using the games Metro 2033 and Cryostasis for benchmarks, we opted for the recently-published Mafia II. Its ratio of graphics to physics is quite balanced, and it allows us to make a direct reference to a current game so our recommendations are more relevant.

OS

Windows 7 Ultimate x64

Game

Mafia 2 via Steam

Version

Updated 08.09.2010

Below is the chart we created using the different combinations of graphics cards and manufacturers:
As expected, using a dedicated graphics card for PhysX makes a difference. Pairing it with a high-end model from each camp results in a rather even playfield. The GeForce GTX 480 can neither pull ahead much from the Radeon HD 5870, nor really make the GeForce GTX 460 and Radeon HD 5850 eat its dust. All of the GPU + GPU combinations are significantly faster than using just a single Nvidia card for both graphics and PhysX.
The single cards are already dangerously close to the lower limits of playability. The chart shows the average frame rates, but obviously the difference will be seen most clearly in minimum frame rate numbers. Most of the time you will be walking around, and the frame rates will be the same regardless of whether you are using a dedicated PhysX card or not. But as soon as something happens that requires physics calculations, that's where the difference lies. Since this happens only briefly and occasionally, we chose to show you the overall picture instead.

How much do you need?
Generally, faster is better. Of course, it would be nonsense to use a GeForce GTX 480 as a dedicated PhysX card. Even using a rather expensive GeForce GTX 285 could hardly be called economically sensible. But let's take a look at our Mafia II benchmarks.
Again, we chose this game because of its very good compromise between physics and traditional graphical effects. Cryostasis uses a disproportionate amount of PhysX. Conversely, Metro 2033 is too heavy on graphics to make a good gauge of PhysX-based performance.
Looking at the graph, you can see very clearly that a card slower than a GeForce GT 240 or 9600 GT makes little sense, even if it should be able to support PhysX in theory. Using a GeForce 8400 GS is actually 15% slower than using a single GeForce GTX 480, which is extremely counterproductive. We therefore left those results out of the chart.What PCIe slot is good enough?
A popular question centers on how fast the PCIe slot for the PhysX card needs to be. We used a motherboard with PCIe slots of different speeds, measuring speed simply by moving the PhysX card around between them.
Clearly, a faster card is slightly bottlenecked by a x4 slot compared to the other two. The difference between x8 and x16 is so marginal that it can be disregarded. A GeForce GT 220 is too slow to notice any difference, as would be a GeForce GT 240 and a 9600 GT. Even the GeForce GTX 285 doesn't suffer that badly. A x4 slot seems to be OK, though a x8 slot is the safer bet for faster cards.Assessment
In the end, it comes down to cost. Spending $80 on a used GeForce GTS 250 will bring your computer with a Radeon HD 5870 to the same level of PhysX performance as a single GeForce GTX 480 card. However, the combined cost of these two cards is higher than the single GTX 480. Real added value is obtained only by using an additional GeForce GTX 260 or better. This is where costs get out of hand and scare everyone but true enthusiasts away. We would only recommend adding an additional card if you already have a spare lying around due to a recent upgrade, for example. Then the effort might be worthwhile, even if the extra idle power consumption might gnaw at your consciousness.

CPU-Based PhysX summary
To summarize the headlines of the last few months and summarize the test results, we can conclude the following:

The CPU-based PhysX mode mostly uses only the older x87 instruction set instead of SSE2.

Testing other compilations in the Bullet benchmark shows only a maximum performance increase of 10% to 20% when using SSE2.

The optimization performance gains would thus only be marginal in a purely single-core application.

Contrary to many reports, CPU-based PhysX supports multi-threading.

There are scenarios in which PhysX is better on the CPU than the GPU.

A game like Metro 2033 shows that CPU-based PhysX could be quite competitive.

Then why is the performance picture so dreary right now?

With CPU-based PhysX, the game developers are largely responsible for fixing thread allocation and management, while GPU-based PhysX handles this automatically.

This is a time and money issue for the game developers.

The current situation is also architected to help promote GPU-based PhysX over CPU-based PhysX.

With SSE2 optimizations and good threading management for the CPU, modern quad-core processors would be highly competitive compared to GPU PhysX. Predictably, Nvidia’s interest in this is lackluster.

The AMD graphics card + Nvidia graphics card (as dedicated PhysX card) hybrid mode
Here, too, our verdict is a bit more moderate compared to the recent hype. We conclude the following:Pros:
One can claim that using the additional card results in a huge performance gain if PhysX was previously running on the CPU instead of the GPU. In such cases, the performance of a Radeon HD 5870 with a dedicated PhysX card is far superior to a single GeForce GTX 480. Even if you combine the GTX 480 with the same dedicated PhysX card, the lead of the GTX 480 is very small. The GPU-based PhysX solution is possible for all AMD users if the dedicated Nvidia PhysX-capable board is powerful enough. Mafia II shows that there are times when even a single GeForce GTX 480 reaches its limits and that “real” PhysX with highly-playable frame rates is only possible with a dedicated PhysX card. Cons:
On the other hand, we have the fact that Nvidia incorporates strategic barriers in its drivers to prevent these combinations and performance gains if non-Nvidia cards are installed as primary graphics solutions.
It's good that the community does not take this lying down, but instead continues to produce pragmatic countermeasures. But there are more pressing drawbacks. In addition to the high costs of buying an extra card, we have added power consumption. If you use an older card, this is disturbingly noticeable, even in idle mode or normal desktop operation. Everyone will have to decide just how much money an enthusiast project like this is worth. It works, and it's fun. But whether it makes sense for you is something only you can decide for yourself.