Recommended Posts

I was looking at some GL 3.3 code and I noticed that the person that wrote it sent a single MVP matrix to the shader instead of a model view and projection matrix separately. I found this kind of odd. The GPU is better suited at doing these kind of calculations (matrix multiplication, in this case) quickly. The only thing I can think of is that sending 32 floats instead of 16 is slower than the time it would take to complete the matrix multiplication and send the result. Is this the case?

0

Share this post

Link to post

Share on other sites

The question is not whether it is faster to calculate the final MVP matrix on the CPU or the GPU, but whether it is faster to calculate it matrix once per model on the CPU or once per vertex on the GPU. But maybe the compiler could, theoretically, precompute the product just before the program is executed, but who knows.

You also have to consider where the bottle-neck is. Assume, for a moment, that is is actually faster to do the calculation in the shader on the GPU. If the GPU is saturated already, then offloading the multiplication to the CPU is a net-win, even if the individual operation is performed slower.

As always, benchmark to see where the problem is and how to rectify it.
Edited January 14, 2013 by Brother Bob

6

Share this post

Link to post

Share on other sites

The question is not whether it is faster to calculate the final MVP matrix on the CPU or the GPU, but whether it is faster to calculate it matrix once per model on the CPU or once per vertex on the GPU. But maybe the compiler could, theoretically, precompute the product just before the program is executed, but who knows.

Share this post

Link to post

Share on other sites

I was looking at some GL 3.3 code and I noticed that the person that wrote it sent a single MVP matrix to the shader instead of a model view and projection matrix separately. I found this kind of odd.

It is actually quite standard/common.
Firstly, there is overhead in sending data to shaders as uniforms. Not only does the amount that should be sent need to be kept to a minimum, actual updating of things should be reserved to those uniforms that have actually changed.

Ignoring the dirty flag for each matrix since they will likely be dirty just as frequently either way, updating 3 uniforms instead of only 1 is already likely to be slower itself than doing the matrix math on the CPU.

So CPU side is already either winning or fairly close. Then if you have 3 uniforms, instead of 1, the GPU falls behind in performance for every single vertex you have (where, for each, 3 matrix multiples will be done instead of 1).

It is true that you sometimes need to upload world and view matrices separately anyway, but while that still leaves the bandwidth performance the same, the GPU would fall behind by that much more if it was doing the task of combining any of those matrices itself for each vertex.

In general, there are almost (or are no) situations in which it is a winning move to combine matrices on the GPU for each vertex rather than once on the CPU side. If a single matrix multiply/upload on the CPU side can replace 2 matrix shader uploads followed by more efficient vertex shaders, it is always the way to go.