At Game Developer Conference 2016 (GDC), NVIDIA has announced the GameWorks 3.1 development kit, which introduces several new physics simulation solutions – PhysX GRB and NVIDIA Flow. Let’s take a look at them more closely:

PhysX GRB

PhysX GRB is the new GPU accelerated Rigid Body simulation pipeline. It is based on heavily modified branch of PhysX SDK 3.4, but has all the features of the standard SDK and almost identical API. PhysX GRB is currently utilizing CUDA and requires NVIDIA card for GPU acceleration.

Unlike previous implementations, PhysX GRB is featuring hybrid CPU/GPU rigid body solver, and the simulation can be executed either on CPU or GPU with almost no difference in behavior, supported object types or features (GPU articulations are not implemented yet, however).

GRB provides GPU accelerated broad phase, contact generation, constraint solver and body/shape management. In addition, it introduces new implementations of island management and pair management that have been optimized to tolerate the order of magnitude more complex scenes that can be simulated on GPU compared to CPU. New mechanisms to parallelize event notification callbacks and a new feature to lazily update scene query asynchronously are also provided.

GPU acceleration can be enabled on a per-scene basis by setting specific flags on the scene and theoretically, any application that uses the PhysX 3.4 SDK or later can choose to run some or all of its rigid body simulation on an NVIDIA GPU with no additional programming effort.

GPU rigid body simulation takes advantage of the massive parallelism afforded by modern GPUs and can provide speed-ups in the region of 4x-6x faster and above compared to CPU simulation in the scenes with large amount of objects.

However, when simulating smaller scenes (commonly less than 500-1000 bodies depending on the hardware involved), simulating on the GPU tends to be slightly slower than simulating on the CPU, because there is a fixed cost associated with simulating a scene on the GPU. This cost is related to the overhead of DMAing input data to the GPU, dispatching the kernels required for GPU simulation, DMAing back the results and synchronizing with the GPU.

The following graph shows a performance comparison between an i7-5930k CPU (6 threads are used) and a GTX 980 GPU when simulating a grid of stacks of 4 convex hulls. This scene does not put a massive amount of strain on either the broad phase or the narrow phase so most of the load in this demo is borne by the constraint solver.

Provided by NVIDIA

As can be seen, the results are completely skewed by the inferior CPU performance when processing 16384 stacks. GPU simulation shows up to 10-15x performance improvement.

Provided below is a graph demonstrating results for 1-4096 stacks to enable further performance analysis.

Provided by NVIDIA

As can be seen, the cross-over point (where GPU simulation becomes faster than CPU simulation) lies somewhere in the 2ms range, between 256 convex stacks and 1024 convex stacks.

Second test scene simulates a drop of a pile of random convex shapes and puts a lot of pressure not only on the solver, but also on the broad phase and contact generation because it involves a far larger number of contact pairs that must be processed.

Provided by NVIDIA

From these results, we can see that GPU simulation outperforms the CPU in the larger scenes.

Omitting the results from 27648 convexes in a pile, we see that the CPU and GPU seem to take roughly the same amount of time when simulating 1728 convexes and that the GPU is significantly faster than the CPU when simulating either 6912 or 13842 convexes.

Provided by NVIDIA

The cross-over point for performance lies somewhere in the 2-3ms range, after which the GPU’s performance differences are substantial.

PhysX GRB SDK and demo should be released to public in the following weeks.

NVIDIA Flow

NVIDIA Flow is the new computational fluid dynamics algorithm that simulates combustible fluids such as fire and smoke.

Flow is featuring a dynamic grid-based simulation and volume rendering. It also includes a hardware agnostic DX11/DX12 implementation.

Really impressive, thanks for the article. Should see more impressive scenes with a higher rigid body count with that performance increase on the CPU.

“GPU acceleration can be enabled on a per-scene basis by setting specific flags on the scene and theoretically, any application that uses the PhysX 3.4 SDK or later can choose to run some or all of its rigid body simulation on an NVIDIA GPU with no additional programming effort.”

Hopefully we’ll see an exponential amount of games with a GPU acceleration option too!

Have they mentioned if they plan on releasing those particular demos on the dev page with the SDK?

Spets: Hopefully we’ll see an exponential amount of games with a GPU acceleration option too!

There is a catch here – as you may noticed, there a cross-over point between CPU/GPU performance, and it lies at 1000+ bodies – a lot more than usually used in current games.
So you can’t simply enable GPU acceleration anywhere and get a boost, and considering that CPU GRB path is also faster than standart PhysX 3.3/3.4, cause it has also recieved optimizations..
But, GRB will be very usefull for VFX, certain types of games (physics playgrounds like Besiege), cloud computing, etc

Spets: Have they mentioned if they plan on releasing those particular demos on the dev page with the SDK?

In coming weeks, as I heard.
But it is still to be decided if PhysX GRB SDK will replace standart PhysX SDK or not. It is similar but also different (so it may “break stuff” im middleware integrations), and is not very usefull on platforms like consoles and mobiles, so most likely GRB and standart SDK will coexist for quite some time.

The beauty of GRB being a hybrid simulation is that you can transition between CPU-only simulation and hybrid CPU/GPU simulation at run-time. We’re still ironing out how to best take advantage of this but it effectively means that, when there isn’t much going on, you could potentially run CPU-only simulation and, as the number of active bodies increases, the simulation can transition to the GPU, effectively removing the concern about smaller scenes being slower on the GPU. It should also be noted that the numbers involved are still very small (< 2ms) and that 2ms is the total time between calling simulate() and fetchResults(), rather than the time that the GPU is actually doing any work. The majority of this time in smaller scenes is host-side (CPU) overhead marshalling data and dispatching kernels rather than actual time that the GPU is busy doing anything. It is something that potential users should be aware of but, when simulating simpler scenes, the intention is that GRB should eventually never be slower (although the current state should still not cause significant problems because the performance in tiny scenes is still extremely quick).

The beauty of GRB being a hybrid simulation is that you can transition between CPU-only simulation and hybrid CPU/GPU simulation at run-time. We’re still ironing out how to best take advantage of this but it effectively means that, when there isn’t much going on, you could potentially run CPU-only simulation and, as the number of active bodies increases, the simulation can transition to the GPU, effectively removing the concern about smaller scenes being slower on the GPU. It should also be noted that the numbers involved are still very small (< 2ms) and that 2ms is the total time between calling simulate() and fetchResults(), rather than the time that the GPU is actually doing any work. The majority of this time in smaller scenes is host-side (CPU) overhead marshalling data and dispatching kernels rather than actual time that the GPU is busy doing anything. It is something that potential users should be aware of but, when simulating simpler scenes, the intention is that GRB should eventually never be slower (although the current state should still not cause significant problems because the performance in tiny scenes is still extremely quick).

Will there be a demonstration that actively shows the transitioning between CPU/GPU anytime soon (or currently)?

We plan to add support for articulation later but, at the moment, we are limited to just rigid bodies and joints.

The CPU-GPU hot-swap feature is still under development and the version present in the demo is a very early, inefficient prototype that was introduced only to support CPU/GPU switching in the demo. The final feature should be fast-enough to not cause noticeable hitches when a switch occurs.

I’m back after a few years! so lemme get this straight back in the udk days and physx 2.8.4(ancient by now) the grb’s were not cuda and did not utilize a correct balance on the cpu and gpu?it appears to me this new sdk is strictly cuda bound where as the older apex physx sdk used more internal cpu code but yet was accelerated on a gpu. maybe i got something wrong there but thats what it seems to be.

So, I was wrong about the post above.After doing extensive research on how GRB’s are used I found out that UDK used a algorithm called radix sort and this radix sort was implemented into CUDA a nvidia feature. from the radix sorting algorithm and the CUDA developer framework it was implemented into UDK and apparently a new sorting algorithm is being implemented into UE4?