Hello all, I am attempting to make a 2D isometric tiling rendering algorithm. I am using z buffers, frustum culling and I am currently planning to use instancing to render the tiles (attempting to draw 256x256 individual quads resulted in 65536 updates 60 times per second, really bad idea).

I get the core idea. This question is about one misunderstanding of the said tutorial.

I am attempting to recreate the instancing code from the above tutorial. It is stated that:

"we only have to update a small buffer each frame (the center of the particles) and not a huge mesh. This is a x4 bandwidth gain!".

So it meant 1 center position (4 GLfloats) for every particle instance. However I thought in OpenGL it is required to have 1 set of position per vertex? So if I am drawing a quad with 4 vertices (using EBO), I must then also provide 4 center values, not 1.

This implies for every particle created, there will be exactly 4 GLfloats for it - namely x y z and the size. Is this possible? I thought of setting the stride to 0 but it simply tells OpenGL to find the stride itself from the tightly packed data.

If it's possible to define a single quad with only the center position (3 GLfloats), that would be absolutely godsend. :D

Any help on this would be greatly appreciated.

Dark Photon

04-05-2017, 05:55 PM

So it meant 1 center position (4 GLfloats) for every particle instance.

Right.

However I thought in OpenGL it is required to have 1 set of position per vertex? So if I am drawing a quad with 4 vertices (using EBO), I must then also provide 4 center values, not 1.

Ultimately 4 positions must be produced per quad, yes.

What geometry instancing gets you is being able to extract what's different between the instances (quad center positions) and then represent the part that's common between them all just once (e.g. 4 vertex positions centered around "some" center position). By telling the GPU how to combine these two pieces, you can have it generate 4 distinct vertices for each quad on the GPU.

Think of this instancing problem like having square cookie-cutter. You only have one cookie cutter (4 vertices) -- this is the instance definition. But you can apply it multiple times in different places to generate multiple square cookies. All you need is an position offset for the center of each cookie -- this is the per-instance varying data.

Geometry instancing is more flexible than this, but you get the idea.

GClements

04-06-2017, 05:58 AM

Instancing allows you to generate MxN vertices from M+N elements. Think of it as a 2D table; each attribute is either per-row (instanced) or per-column (non-instanced).

What instancing gets you is a four-fold reduction in the data; to draw 65536 quads, you'd have arrays holding instanced attributes with 65536 elements and arrays holding non-instanced attributes with 4 or 6 elements (depending upon whether you use quads or triangles, and in the latter case whether you use glDrawElements or glDrawArrays).

But why would you be updating all 256x256 tiles every frame? If you are, that suggests a more fundamental problem than something that will be solved by instancing. Also, note that instancing requires OpenGL 3.1 (and 3.2 for it to be useful), which seems a bit excessive for 2D.

OTOH, if you're going to be requiring at least OpenGL 3.2 for other reasons, look into glDrawElementsBaseVertex (or glDrawElementsInstancedBaseVertex if you need instancing). That makes it trivially easy to render a fixed-size rectangular region of a 2D grid, needing to change only the base vertex in order to "scroll" the region.

johnsonpapa

04-07-2017, 03:37 PM

Dark Photon:
Thank you very much. I understand the concept of instancing now. I wasn't aware that OpenGL can perform such task, so it is possible to reuse them after all through glVertexAttribDivisor. (This is the best thing I've ever heard).

Thus I managed to get it working now, it is much more efficient than needing to call render 256x256 times per frame. It's simply beautiful. :)

Also, loving the cookie-cutter metaphor. It helped.

GClements:
Yes you are indeed correct. It does save a lot of overhead of storing redundant data that could be simply reproduced for all instances.

I am already using EBO and indices for drawing the quads. Thank you. :)

I have actually gained more than 4 times the data reduction because I have set glVertexAttribDivisor to 0 for textures, vertices (all tile sizes are the same), colour and the normal direction (all facing the same direction in 2D). The only thing that is changing is the instance's centre values. This has cut memory usage by thousand folds.

I was updating 256x256 objects per frame because I have represented each tile as an actual quad object, with its own defined VBOs and VAO with vertices, texture, etc. I have recently converted from immediate mode drawing to core profile so it is a horrible beginner's mistake to assume that the GPU can process function calls as fast as the CPU (data transfer bottleneck...).

Having implemented instancing is a miracle solution to this problem, though I cannot use frustum culling with instancing (only back face culling rip). I think segmenting the map into different chunks and load only 9 at once would be a good solution to this (similar to how minecraft does it).

Thank you for the clarifications guys. You have been very helpful. :)

Dark Photon

04-08-2017, 07:07 PM

...Thank you very much. I understand the concept of instancing now.

Good deal! Glad you got it working.

I have recently converted from immediate mode drawing to core profile so it is a horrible beginner's mistake to assume that the GPU can process function calls as fast as the CPU (data transfer bottleneck...).

It's worse than that. The GPU can't process those immediate mode function calls you're sending down directly. The GL driver has to queue up your immediate mode calls and dynamically translate them to a completely different representation (one closer to what you get with VBOs) before it can submit the work to the GPU. Result = lots of CPU time wasted by all this processing in the GL driver.

Having implemented instancing is a miracle solution to this problem, though I cannot use frustum culling with instancing (only back face culling rip).

You actually can do per-instance frustum culling. However, if the aggregate cost of running the vertex shader for all the out-of-frustum instances (for the worst case) doesn't amount to much, then it's probably not worth it.

You can easily determine what this potential savings is by finding a worst-case scenario (max number of off-screen instances), and then benchmarking performance when rendering 1) all instances, vs. 2) just those in the view frustum. Take the frame times for each part and subtract them. For part #2 in this simple test, just do a pre-process of the instances on the CPU and dynamically generate your instance list to contain only those instances which would be in-frustum in this case.

If you do see that there's a worthwhile savings to be had here, you can move that per-instance view-frustum culling to the GPU using transform feedback to possibly save some of that time. If you'd like more info on that, just say so.

johnsonpapa

04-11-2017, 03:48 AM

Dark Photon:
Yes, the bottleneck of the communication between the CPU and the GPU is really great. I was initially deterred from using Core Profile completely because my aim was to solely create a 2D game as my very first project. Before the tiling algorithm immediate mode I simply had the following:

Calculate the bounding box (startX, startY, endX, endY) of the the scene is visible to the camera
Convert the bounding coordinates to tile position indices in a 2D array (by simply dividing by the tile sprite size)
Render tiles in the 2D array with the indices obtained

The performance of the above algorithm in immediate mode is quite fast, with a small overhead of recalculating camera position once every frame. Although devoid of the benefits of all the processing units that the GPU provides, I would say as far as being comparable to that implemented in Core Profile.

However now due to the greatly increased complexity (3D, custom shaders, rendering using z-buffer, etc,) the performance gain of storing the tile vertices in the GPU for static drawing is far greater, since most of the rendering is used up by tiles (making up almost 90% everything).

My current implementation uses the z-buffer in orthographic view to sort tiles from front to back. For each tile sprite, I use instancing to render all of the same tiles at once (having one set of texture coordinates per batch and to store and update for animation, saving space and time cost). This is extended to static objects, i.e. trees. Given that z-buffer is used, render ordering is also no longer a concern and no re-ordering is needed.

Currently in a 64x64 (4096) tiled map, the CPU usage is about 2 to 4% in total (AMD Phenom X4 Quad Core 3.4GHz with GTX 750Ti, so 8 to 16% on a single core). Other factors includes updating uniforms for global and diffuse lighting, and updating the camera coordinates every frame. Although the cost is not too high, however in larger tiled maps (i.e. 128x128x32), the cost would exponentially increase. It would be interesting to further optimize it because I am a bit particular about CPU usage.

I am unaware that we can perform frustum culling per instance. Maybe it can be done by the vertex (or is it geometry?) shader. I am not aware how to do it.

Any performance increase is a wonder increase for me. Thank you for your help. :)

Dark Photon

04-12-2017, 06:37 AM

I am unaware that we can perform frustum culling per instance. Maybe it can be done by the vertex (or is it geometry?) shader. I am not aware how to do it.

You can, either on the CPU or the GPU. But like I said, first step is to establish if/how much you could potentially save by doing a per-instance view-frustum culling pre-pass before you do your instanced draw pass. Then consider whether that gain is worth going after. If it is...

Since you've already got instancing working, I don't need to go into as much detail. But basically it involves changing your current instanced draw pass into:

The way I've done this before is this: Pass 1 is an instanced draw call of POINT primitives where the instance list is your full list of instances, and the "culled-in" set of instances is captured with transform feedback. In the instance data for each point, you provide info on the bounding sphere of each instance, and you use a "geometry shader with selective emission" to selectively write-out data to a buffer object for only those instances that pass the view-frustum cull test. Of course you pass in your frustum planes in shader uniforms so the shader can actually do a cull test on each instance's bounding sphere. This pass gives you a list of in-frustum instances stored in a buffer object on the GPU.

The Pass 2 draw is an instanced draw call like you've already done before which operates on this instance list, except that it sources the draw call's arguments from a buffer object rather than the application draw call (check out glDrawElementsIndirect (https://www.khronos.org/opengl/wiki/GLAPI/glDrawElementsIndirect)).

In case you're interested, here's a post that collects links to other info you may find useful. In particular, read rastergrid's articles for a great primer, keeping in mind that there are even more efficient ways to do things nowadays: