Dienstag, 24. Februar 2009

Donnerstag, 19. Februar 2009

Today I measured the performance of the actual implementation. The result: The scene on the right has about 8.0M RLE elements in the view frustum, out of which 4.7M are not culled and 280k are visible, rendered as 450k pixels. This, at a frame-rate of about 25 means the renderer processes about 117M RLE elements/second.

My graphic cards maximum untextured triangle performance is 280M/s in case the triangles share vertices, and about 133M/s in case the triangles have independent vertices. Maximal vertex transform rate is about 400M/s.

This means, if the landscape would be visualized using splats, each rendered as single triangle, then at least 8M triangles would be required. Without any culling, this would lead to a performance of about 133/8=16 fps. Here, perhaps the geometry shader might be used to accelerate the rendering. It would be possible to send only one vertex from which the geometry shader generates a quad or triangle.

I case we would visualize each voxel inside the landscape using conventional polygons, we would have to use at least 2 triangles for each to create a quad. This means, taking shared vertices into account, We would have to render at least 16M quads, resulting in a theorethic frame rate of 280/16=17.5.

If some of you think of writing a CUDA program, here a couple of things to keep in mind:

1.) Reduce the number of used registers to run more parallel threads2.) Reduce the number of memory accesses3.) Store runtime variables in registers4.) Do not use local arrays in your code like int a[3]={1,2,3} - better use variables such as a0=1;a1=.. etc if possible.5.) Write small kernels. If you have one large Kernel, try to split it up into multiple small ones - it might be faster due to less used registers.6.) Use textures to store your data where possible. Texture reads are cached - global memory reads aren't.7.) Conditional jumps should branch equal for all threads8.) Avoid loops which are run only by a minority of threads while the others are idle9.) Use fast math routines where possible10.) A complex calculation often is faster than a large lookup table11.) Writing your own cache manager that uses shared memory for caching might not be an advantage12.) Try to avoid multiple threads accessing the same memory element (accesses get serialized - also for shared mem)13.) Try coalescence of global memory accesses.14.) Try to avoid bank conflicts for reading memory15.) Small lookup tables can be stored in shared mem16.) Experiment with the number of parallel threads to find the optimum. In case you run out of registers, use --maxrregcount=...

17.) If you can implement you method using GLSL, it might be faster than CUDA. In GLSL you get a lot of calculations for free like alpha blending, fog, z buffer testing, interpolation of variables between pixels and perhaps a better thread handling too. Also you not have to copy around the rendered image as PBO and you'll save development time since there is no bluescreen from a bad pointer.