Primitive restart extended functionality

Let's start from the problem: it is common to have a mesh built with different types of primitives, but gl*Draw* functions let you draw only one type of primitives per call. So you are either forced to convert all the geometry to one type of primitive (in most cases, GL_TRIANGLE_STRIP is the most optimal solution) or call gl*Draw* multiple times sorting the primitives by their type. Well, primitive restarting indices helped to separate primitives "on the fly" without bothering to call gl*Draw* for each individual primitive or inserting multiple index clones, but still the optimization on the mesh have to be done to combine individual triangles into primitives of the same type. Howether, merging triangles of abstractly-shaped geometry all into strips only, with blind machine algorithms (even advanced ones), results in a considerable quantity of short strips. Some of that "garbage" may even better fit into GL_TRIANGLE_FAN rather than GL_TRIANGLE_STRIP, or even rendered as individual triangles. But as we forced to use strips for all primitives, we still have to use separation indices for each of those small chunks.

So here is my solution for that mess: let's extend the functionality of the primitive restart index(es) to let it(them) change the primitive mode. In order to do that we need more than one value of index to have a special meaning rather then referencing the corresponding set of vertex attribute. I think, the most elegant way is to state that once the extended functionality of Primitive Restart Index (shortly, PRI) is enabled, then any index, which value is equal or higher than the value specified by glPrimitiveRestartIndex will have the special meaning. If the index is equal to the specified PRI, then the primitive type stays the same (just restarted); if the index is higher than PRI, then the type of the primitive changes, so all subsequent indices will be used to construct the primitive of the new type.

So what are the rules of converting the special index value into the desired primitive type? Let's have a look on the defined values of all currently known primitives (let's keep legacy types in that table just for consistency purposes):
GL_POINTS 0x00
GL_LINES 0x01
GL_LINE_LOOP 0x02
GL_LINE_STRIP 0x03
GL_TRIANGLES 0x04
GL_TRIANGLE_STRIP 0x05
GL_TRIANGLE_FAN 0x06GL_QUADS0x07GL_QUAD_STRIP0x08GL_POLYGON0x09
GL_LINES_ADJACENCY 0x0A
GL_LINE_STRIP_ADJACENCY 0x0B
GL_TRIANGLES_ADJACENCY 0x0C
GL_TRIANGLE_STRIP_ADJACENCY 0x0D
GL_PATCHES 0x0E
So far there are 15 values only, defined as consecutive numbers. The straightforward approach is to think of the index just above the PRI value as a base offset for those numbers. So once the index encountered which is equal to PRI, the primitive will be just restarted, but it's type left unchanged; if the index value is equal to (PRI+1), then the primitive type will switch to GL_POINTS; if index is (PRI+2), then the new primitive type will became a GL_LINES and so on.

The way I described this may be raw and messy, I know, but I hope the idea will find the support.
The extended primitive restart functionality will let the meshes to be constructed with primitives of different types and greatly simplify their drawing and storing in files. The quntity of gl*DrawElements* calls required to draw a single mesh will be reduced to 1 single call as all required information may be stored right in the GL_ELEMENT_ARRAY_BUFFER making the "mode" parameter of a set of gl*DrawElements* functions obsolete.

Actually strips have not been the most optimal primitive type for a long time. glDrawElements with GL_TRIANGLES is preferred (you can arrange the data in strip order if you wish, but you don't have to). id software pushed this as the optimal path for Quake III (in 1999), they give better vertex reuse, and this has been what vendors optimize around. See also http://tomsdxfaq.blogspot.ie/2005_12_01_archive.html

Let me say this just once, because academics are still spending time and money researching this subject. You're wasting your time. Strips are obsolete - they are optimising for hardware that no longer exists. Indexed vertex caches are much better at this than non-indexed strips, and the hardware is totally ubiquitous. Please go and spend your time and money researching something more interesting.

The exception is if you're talking about a mobile device where you know that strips are still preferred, but that's OpenGL ES, not OpenGL

"...Indexed vertex caches are much better at this than non-indexed strips..." - what I see here is a comparison between indexed way of drawing of the most unoptimal primitives (GL_TRIANGLES) vs indexed GL_TRIANGLE_STRIPS - this emphasizes the performance gain from vertex post-processing cache, it has nothing to do with the actual type of the primitive. If the comparison will be made between indexed drawing of both of those types, no doubt strips will win. Well, at least the size of the index buffer will be smaller.

Drawing 2 connected triangles independently will require 6 invocations of a vertex shader, but if they are drawn using GL_TRIANGLE_FAN or GL_TRIANGLE_STRIPS, there will be only 4 invocations. Well, with the help of vertex post-processing cache the number of vertex shader invocations may be equal to the number of processed vertices, yes, but still there is a gain achieved by the reduction of the index buffer size, as triangles take 3*n indices and strips (or fans) take 2+n+1 indices (including the PRI), so if there is 2 or more connected triangles in a mesh, it is better to use strips or fan primitive to draw them. And in most cases any triangle in a mesh has at least 3 neighbors.

Drawing 2 connected triangles independently will require 6 invocations of a vertex shader, but if they are drawn using GL_TRIANGLE_FAN or GL_TRIANGLE_STRIPS, there will be only 4 invocations. Well, with the help of vertex post-processing cache the number of vertex shader invocations may be equal to the number of processed vertices, yes, but still there is a gain achieved by the reduction of the index buffer size, as triangles take 3*n indices and strips (or fans) take 2+n+1 indices (including the PRI), so if there is 2 or more connected triangles in a mesh, it is better to use strips or fan primitive to draw them. And in most cases any triangle in a mesh has at least 3 neighbors.

For a vertex cache of size 2 or more there will be only 4 invocations for indexed triangles either.

The gain in index buffer size is only relevant if you actually build long strips, but in that case you can't make effective use of the vertex cache:

In an optimal triangle mesh, every vertex is part of 6 triangles, but in a strip it is only part of at most 3 triangles, so (on average) every vertex has to be part of two strips. If you optimize for long strips to reduce memory consumption, you'll no longer have that vertex cached the second time it is needed, so you need twice the number of vertex shader invocations for strips.

So am I just outdated? Are you guys saying everybody nowadays using triangles only and nobody bother with strips and fans? So there will be no use for the proposed extension at all?

Originally Posted by mbentrup

In an optimal triangle mesh, every vertex is part of 6 triangles, but in a strip it is only part of at most 3 triangles, so (on average) every vertex has to be part of two strips. If you optimize for long strips to reduce memory consumption, you'll no longer have that vertex cached the second time it is needed, so you need twice the number of vertex shader invocations for strips.

If I have such mesh that could be tessellated onto long strips - long enough to extend the capacity of the cach - I can't imagine how could I draw it with triangles to make a better use of cache. Drawing it area-by-area will still make border vertices outcached. Perhaps it may be even worse than with strips. Still, strips can be trimmed into the smaller sizes, and even after that the amount of indices will be smaller comparing to what the individual triangles will take.

In an optimal triangle mesh, every vertex is part of 6 triangles, but in a strip it is only part of at most 3 triangles, so (on average) every vertex has to be part of two strips. If you optimize for long strips to reduce memory consumption, you'll no longer have that vertex cached the second time it is needed, so you need twice the number of vertex shader invocations for strips.

And vertices are much bigger than indices too, so the saving on vertices more than offsets the increased index count.

Originally Posted by Yandersen

So am I just outdated? Are you guys saying everybody nowadays using triangles only and nobody bother with strips and fans? So there will be no use for the proposed extension at all?

Strips are still relevant in the mobile world and this would probably be useful for GL ES. There may be some cost from both restarting a primitive and switching the primitive type, but the saving from fewer draw calls may offset that (draw calls are a much bigger deal with ES than they are on the desktop).

Let's skip on arguing what type of the primitive is *the best* and focus on the actual idea of the proposed extension. The most important question, as I see that, is the performance. Is it possible to predict if switching the primitive type along with primitive restarting will considerably slow down the performance comparing to the simple primitive restarting? Is there a way to implement the proposed primitive switching functionality and keep rendering speed the same?

Maybe the way of primitive switching I described earlier is not the best one for implementation. Maybe the switching indices has to be set independently using some special function:

In other words, the implementation will have a set of supported primitive switching indices (PSI) ranging from GL_PRIMITIVE_SWITCHING_INDEX0 to GL_PRIMITIVE_SWITCHING_INDEXi, where i is equal to GL_MAX_PRIMITIVE_SWITCHING_INDICES-1. Each of those index binding points have an index value implicitly associated with them. That value is based on the primitive restart index, so for the target GL_PRIMITIVE_SWITCHING_INDEX0 the actual index value will be equal to PRI+1; for GL_PRIMITIVE_SWITCHING_INDEX1 it is PRI+2 and so on. The actual number of supported PSI is implementation dependent and may be smaller than the number of supported primitive types. Maybe. I don't know, I am not a developer.
So we associate the primitive types for different PSI, then enable that PSI.

It is just an another way to implement the index-based primitive switching.
It can also be extended. F.e. if some index value is set to mode GL_NONE, let such index to be the so called termination index, acting just like 0 in character strings, resulting in abortion of index array execution:

Code :

glPrimitiveSwitchingIndexMode(GL_PRIMITIVE_SWITCHING_INDEX2, GL_NONE); //Once the index equal to PRI+3 encountered, all subsequent indices will be ignored
glEnable(GL_PRIMITIVE_SWITCHING_INDEX2); //Now the index 255 will stop the index array execution

I don't want to be too harsh, but what is the purpose of extending the primitive restart to change the primitive type? Is it expected that the index buffer will be made by a feedback mechanism? Is it to just avoid an extra draw call?

Lets take a review why primitive restart was/is a good idea. A long time ago, in a galaxy far away, folks used triangle strips. They used them because one index per triangle was great. It was really great in immediate mode. Then came indexed mode and TnL hardware where vertex processing was on GPU (instead of CPU). Once there, a post vertex transform cache was christened and folks realized that can get better than 1 triangle per processed vertex if one uses GL_TRIANGLES and primes for the cache. Finally, we get to primitive restart. With primitive restart one can get one triangle per index (often) and better than 1 triangle per processed vertex. From an API point of view we always had glMultiDraw* but they are really done by software.

What is the gain for being able to change the primitive type in the index stream? What I see are:

avoid issuing another draw call

potential to reuse more vertices from the post-vertex cache

if index stream is made by feedback process (i.e. by GPU)

Here are my thoughts for each one of the above:

Avoiding issuing a draw call is not really a big deal. Most drivers batch many draw calls together before sending it down to the kernel to send to the GPU, so what matters for many draw calls is:

If there are state changes between draw calls. For this case the state change is primitive mode, which is not really a big deal. Also, one can argue that index buffer offset as well.

Likely we are talking saving maybe 3 or 4 vertices max between mode changes, not a big deal UNLESS one has very number of vertices between primitive mode changes; this also applies to 1.

one can rig a feedback process to also output when the primitive mode changes in an array of streams, indexed by primitive mode

So what I see, the main benefit is if the batches are tiny between changing primitive modes and the draw order MUST be in that order [for otherwise one can organize the stream by primitive mode].

From the hardware point of view, it makes life harder because now the primitive type, rather than being set only at draw call, would need to be propagated down the pipeline in addition to the logic at vertex fetch units to recognize that the primitive type changed. That is going to cost some sand.

Well, my major intention was to make an index buffer pretty much a self-descriptive item, requiring only a single draw call to render an entire model. An array of indices technically represents an array of batches, but type of the batch' primitive and the bounds of each batch are not stored in an index array - that "mapping" is made externally by a draw call providing the missing arguments. The same arguments every time, which is senseless. It would be logical for client to store the range information, which defines where inside that index buffer each individual detalization-level-submodel is located, but if we also have to store each primitive batch mapping of that submodel - it becomes just messy and doesn't make sense anymore. We may draw one submodel or another depending on the conditions (distance from the viewer) but we never draw parts of the model separately just because those are built with different type of the primitive - all batches related to a single submodel are either all drawn or none of them at all. The particular primitive is not a unit the user may need to manipulate independently unless the whole submodel is built with the same type of primitives. We also can't choose which type of the primitive to use when drawing a given batch of indices - the given set of indices require a predefined primitive type to construct primitive properly. Those are not the type of info the client may need to manipulate at all - the primitive type is an info of the same atomic level as an indices by themselves. Those are two data types which are naturally bound; together they represent the single logical item and changing any of them independently will not produce anything meaningful. So in my opinion, the type of the primitive should not be separated from the indices, so the drawing routines will provide only the starting index and the quantity of indices to render. IMO.

There is another argument of glDrawElements command that client should not bother also, IMO, - that is the type of indices. Again, it is a type of info that is bound to the contents of an IBO. Howether, unlike the primitive type, the index type does not dynamically changes and in most cases an entire IBO share the same type of indices (especially if all those indices refer to the same VBO). I think, it would make a perfect sense to store an index type as state attribute of GL_ELEMENT_ARRAY_BUFFER target, or better, an IBO by itself. The primitive switching index values, PRI and index type - all can be set there also as all of those parameters are hard-bound to the array of indices. F.e. the commands setting those values may look like this:

So to keep a backward-compatibility, if a draw call is made by any of the currently known functions, the index type and primitive mode are taken from the function arguments rather than from a states of an GL_ELEMENT_ARRAY_BUFFER; as well as a single value of PRI set by glPrimitiveRestartIndex. To make a use of an IBO states, a set of functions for indexed drawing I recommend to introduce:

The indexFirst parameter is an "index of a first index" in an array stored in IBO; therefore the byte offset will depend on the type of the indices, which is taken from a parameters of the given IBO. That means that from a client point of view the IBO represents an array of indices, and all the client should know about it when choosing which part of it to draw, is a particular area range where the desired model' data is located. Neither the index type has to be respecified, nor the primitive-mapping has to be messed about.

Well, taken that IBO is a part of VAO state anyway, all those parameters may alternatively be a part of VAO state rather than the GL_ELEMENT_ARRAY_BUFFER state. I am not sure about that...

It's valid for the index type to be allowed change if you're streaming indices to a dynamic buffer. One type of object may need 32-bit, another type may be fine with 16-bit. Allowing this means that you only need create a single such buffer and only bind it once rather than having an extra buffer bind (or even full VAO change) each time the type needs to change. This also works well with a persistent mapping setup.