This would allow the graphics module to bind the correct attributes based on the VertexFormat of the buffer. But at the same time, It doesn't ensure that each Vertex is guaranteed to be in the format specified which won't be a problem if I'm controlling the data coming in from my own asset pipeline.

I want to know if this is a good way to go about things? What have other people done and is there a clearly better way that I'm missing?

Well you only need struct Vertex for programatically created vertex data. Vertex data that is loaded from a file can be referred to via void*, and can use any layout.

For vertex formats, you can either have a hard-coded enum/list like in your example, and have the file specify a value from that list (usually you don't have too many unique formats, so this will be fairly maintainable), or, the file can actually encode the vertex format itself, with e.g. struct { int offset, type, size, stride, etc; } elements[numElements]; (which would allow people to use new formats without editing your code -- useful on bigger projects with more artists / tech-artists).

Yep, malicious/corrupt data will do bad things, but you can put the error checking code into the tool that generates your files.

If you get too hung-up about "wasting precious space" then you're going to miss other avenues for optimization; memory usage is not the be-all-end-all and by focussing on that to the exclusion of everything else, you may actually end up running significantly slower. It's easy to fall into this trap because memory usage is something that's directly measurable and not influenced by (many) outside factors, but the reality is quite a bit more complex.

Out of the theoretical and into the practical - let's look at your specific example here.

Changing your vertex format can be expensive. You may need to unbind buffers, switch shaders, break current batches, upload new uniforms, etc. All of these actions will interrupt the pipeline and - while they won't directly cause pipeline stalls - they will cause a break in the flow of commands and data from the CPU to the GPU. You've got one state change that potentially requires a lot of other state changes as a knock-on, and things can pile up pretty badly. Do it too often per frame and you'll see your benchmark numbers go in the wrong direction.

How often is too often? There's no right-in-all-situations answer to that one; it depends on your program and data.

I'm not saying that you should deliberately set out to unnecessarily use huge chunks of memory here. Quite the opposite; you should instead be making decisions like this based on information obtained through profiling and benchmarking - questing to reduce memory usage in cases like this and without this info is premature optimization. If accepting some extra memory usage is not causing any problems for your program, then just accept it as a tradeoff for reduced state changes - it may well be the right decision for you.

Edited by mhagain, 18 November 2012 - 10:09 AM.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

note that you can calculate the binormal by cross(normal, tangent)
personally, i make a vertex struct for every usage type, since many things have different needs
for example a fullscreen shader needs only vertices that are [0, 1], and the tex coords can be derived from the vertices!
example:
// tex coords directly
texCoord = in_vertex.xy;
// transformation to screenspace in ortographics projection (0, 1, 0, 1, zmin, zmax)
gl_Position = vec4(in_vertex.xy * 2.0 - 1.0, 0.0, 1.0);

unless you are confident this is your bottleneck, you should be in gDEBugger to find out where your performance bottlenecks really are!
one neat feature in gDEBugger is rendering in slow motion, so you can see the exact order of operations, including order of rendering your models
that way you can for example get a real estimate on how effective your z-culling and samples passed (query objects) will be

I'm going to 2nd mhagain's comment .... use a vertex definition that is a superset of all you expect to render and use as few vertex/pixel shaders as possible, even if you have to superset them. Creating arbitrary state buckets just to save a few thousand bytes of "precious space" is going to slow down your render.

To add a counter-viewpoint --- using a single fat vertex format for all meshes, to avoid some possible future performance problem, is definitely a premature optimisation in my eyes.Why not just use the API properly, and then take measures to merge objects into single batches later when profiling says you have to?

All of these actions will interrupt the pipeline and - while they won't directly cause pipeline stalls - they will cause a break in the flow of commands and data from the CPU to the GPU

That's a bit of an exaggeration. All any GL function that talks to the GPU does, is write command packets into a queue (which is read by the GPU many dozen milliseconds later). Writing new command packets can't cause a break in the flow of already-written packets, nor will it somehow stall later packets.On the GPU end, it has hardware reading/decoding this queue in parallel to actually doing work. As long as you're submitting large enough groups of work (and GPU groups/batches don't correspond to 'batches' on the CPU, which are usually regarded as individual glDraw* calls -- the GPU can merge multiple CPU draws into a single group depending on conditions), then the executing the groups will take longer than decoding them, so decoding (which includes applying/preparing state changes) is pretty much free.i.e. when you're giving the GPU enough work per batch, the pipeline looks like:

Decode #1|Decode #2| |Decode #3|
| Run #1 | Run #2 | Run #3 |

and if you're not giving it enough work, it might look like:

Decode #1| Decode #2 | Decode #3 |
|Run #1|stall|Run #2|stall|Run #3|

And yes, if you run into the second case, then to fix it you may want to increase the number of pixels/vertices processed per draw call, and one way to do that may be to merge shaders, which may in turn require the merging of vertex formats... But all that is an optimisation topic, which means it should be done under the supervision of a profiling tool.

N.B. the first pipeline diagram above actually has a 'break' between Decode#2 and Decode#3 (i.e. the flow of commands from the CPU->GPU), but isn't a bad thing ;)

As for "saving precious space", this isn't about saving RAM. Yep, RAM is cheap and ever growing. The reason you want to save space is bandwidth.Below are the specs on a high-end and low-end model GPU from 3 different generations of nVidia cards:

As you can see, the high-end cards can pretty much read or write every byte of their memory around once per frame, but, the low-end cards can only touch a quarter of their RAM in any given frame.Moreover, large parts of your RAM have to be read/written more than once in a frame -- render targets with blending will require multiple reads/writes per frame, texels will likely be read many times, VBOs are shared between different models and thus reused, and even within the drawing of a single mesh verts are shared between triangles (and will be redundantly reshaded upon cache miss, about half the time).

When you get to profiling, it's just as likely that some of the fixes you'll have to apply will be bandwidth-saving measures, which could be the opposite of the above -- e.g. splitting a single shader into multiple ones that take different vertex inputs, and sample different amounts of textures.

If you get too hung-up about "wasting precious space" then you're going to miss other avenues for optimization; memory usage is not the be-all-end-all and by focussing on that to the exclusion of everything else, you may actually end up running significantly slower. It's easy to fall into this trap because memory usage is something that's directly measurable and not influenced by (many) outside factors, but the reality is quite a bit more complex.

Out of the theoretical and into the practical - let's look at your specific example here.

Changing your vertex format can be expensive. You may need to unbind buffers, switch shaders, break current batches, upload new uniforms, etc. All of these actions will interrupt the pipeline and - while they won't directly cause pipeline stalls - they will cause a break in the flow of commands and data from the CPU to the GPU. You've got one state change that potentially requires a lot of other state changes as a knock-on, and things can pile up pretty badly. Do it too often per frame and you'll see your benchmark numbers go in the wrong direction.

How often is too often? There's no right-in-all-situations answer to that one; it depends on your program and data.

I'm not saying that you should deliberately set out to unnecessarily use huge chunks of memory here. Quite the opposite; you should instead be making decisions like this based on information obtained through profiling and benchmarking - questing to reduce memory usage in cases like this and without this info is premature optimization. If accepting some extra memory usage is not causing any problems for your program, then just accept it as a tradeoff for reduced state changes - it may well be the right decision for you.

I honestly wasn't trying to do premature optimization, if it seems like that. I just wanted better flexibility. Wouldn't I be required to swap shaders anyway if I had some geometry coming through that didn't use the shader that handled normal maps (other things requiring tangent space data). For example, half of my geometry has normal maps or bump maps associated with rendering them then I want to render some geometry which doesn't have these things, i'd run into errors trying to process the geometry in a shader that requires some normal map that isn't there. So wouldn't this lead to it being pointless for me to having bound the extra attributes anyway?

Sorry If i'm way off here, still trying to get the hang of shaders and how this whole process should work.

If you do have a need to use the same shader on both types of objects, it's possible to just use a "flat" 1x1 pixel normal map for objects that don't require normal mapping.

I know you can't really comment on whether or not that would be the best solution for me, but is that somewhat of a bandaid solution? Even if it is there to prevent shader swapping and setting new uniforms.

This is the format I use for VBO data, in a plain C application:
[source lang="cpp"]#define POS_COUNT 3#define NORMAL_COUNT 3#define TEXCOORD_COUNT 2#define COLOR_COUNT 4struct vertex_data_s { int this_size; //size of this structure float* databuffer; //holds all data //pointers into data buffer: float* pos; //pointer to pos for 0th vertex float* normal; //pointer to normal for 0th vertex float* color; //pointer to color for 0th vertex float* texcoord; //pointer to 0th texcoord for 0th vertex int texcoord_count; int stride; //number of floats to next vertex int count; //number of vertices} vertex_data_t;//In the above, pos, normal, color, texcoord are all staggered pointers into the same databuffer. To access a particular type of data, its just:// assume 'd' is a vertex_data_t*d->pos[ d->stride * n]; //pointer to nth vertex positiond->normal[ d->stride * n]; //pointer to nth normalSuppose I have data that has position, and normal, but no color or texcoords. d->color and d->texcoord will be null, and d->stride will be set to NORMAL_COUNT + POSITION_COUNT;What if I want 3 textures? Then d->stride can be NORMAL_COUNT + POSITION_COUNT + 3* TEXCOORD_COUNT. And the three texcoords are:d->texcoord[ d->stride * n] d->texcoord[ d->stide *n + TEXCOORD_COUNT ]d->texcoord[d->stride * n +TEXCOORD_COUNT *2]in general :d->texcoord[stride * n + TEXCOORD_COUNT * tn ] for the 's'd->texcoord[stride * n + TEXCOORD_COUNT * tn +1 ] for the 't'[/source]

These all get simplified by some macros, so I don't have to think about it. I just have macros like vertex_position( my_vbo, n) to get a pointer to the nth 3d vector (pointer to 3 floats).

With macros, it makes it easy for my to change the underlying structure and recompile, without rewriting everything else. For C++, you could use accessor functions instead of macros. You can also return the correct data types if you do this. For instance, when I use the vertex_position macro, the macro casts the float* into a vertex_3*, which is my app is just 3 floats. So I can do vertex_position(my_vbo, n)->y or vertex_texcoord(my_vbo, n, tn)->y if I want. But in C++, you can do all kinds of other things with accessor functions, such as check for valud values of n, tn, etc, so you can use the same idea, but refine it a bit and make it more crash proof.

This structure also translates data easy to opengl: Call glBufferData using a pointer to d->databuffer, size of d->stride * sizeof(float) * d->count. All the data is transfered to gl (and probably the gfx card vram) in one quick call. Then when its time to draw, just call glVertexPointer / glAttribPointer with these pointers, give it d->stide for each one, and then draw. Fast and easy. If a particular buffer has no normals, colors, etc, its d->normal or d->colors pointer will be null and stride will be smaller: no wasted space.

I have an alternate form that does a structure-of-arrays, but since it uses different accessor macros (because access is d->position[POSITION_COUNT*n] instead of d->position[d->stride*n]), but I don't have much use for it anymore because having two layouts is confusing. The idea was if all the positions are continuous in memory they can be updated by the cpu without having to retransfer everything else (like texcoords). But it didn't take long to realize that it was better just to have two vertex_data_t structs, one with positions/ normals updated with GL_DRAW_DYNAMIC, then the other with everything else as GL_DRAW_STATIC. All I needed to do was allow multiple of these to be bound to one model, which was not hard at all.

I suppose it is. When I was implementing this stuff, I considered many options:

[ ] denotes one VBO, P = position N = normal T = texcoord

interleaved: [PNTPNTPNTPNTPNT] (if model is static, I suspect this is ideal)

separate: [PPPPP] [NNNNN] [TTTTT] (I think this is the worst. advantage is T can be kept static, and P or N can be dynamic/streamed. Don't know when you would want to change P without changing N though) (This was also my first VBO implementation, just to get things working!)

sequential: [ PPPPP NNNNN TTTTT] (not sure about this one. I would guess it's the same or worse than interleaved. Not sure how it could be better.)

'twin' [PNPNPNPNPNP] [TTTTTTTTTTT] (If PN are dynamic and T is static, I suspect this is the best for CPU animated models)

Are there any other formats anyone uses? I suppose everyone has some custom attributes they like to pass, but you can just shove those into their own VBO, or extend my struct to allow custom0, custom1, custom2, etc to be part of the stride.

I suppose another interesting question is: Is the cost of having to bind 2 or more VBOs per model almost always less than the saving produced by tagging one VBO as static and the other as dynamic? For instance, given the choice between:

will A almost always beat-out B? I think so, unless the VBOs being drawn are so short (only handfuls of triangles), that you're spending all your time thrashing and binding; in which case you should reevaluate what's going on.

Another reason to use split (non-interleaved) streams is when you need to use the same mesh data with different vertex shaders.e.g. your render-to-shadow-map shader only requires positions, not normals or tex-coords.In that case, you might want to use [PPPPPPPP][NTNTNTNT] so that the shadow-shader doesn't unnecessarily have to skip over wasted normal/tex-coord bytes.I've also seen other engines that simply export [PPPPPPPP] and [PNTPNTPNTPNTPNTPNTPNTPNT] -- so that both shaders are optimal, at the cost of memory ;)

In my engine, the model/shader compilers read in a Lua configuration file, describing their options for laying out vertex data. For any sub-mesh, it first determines which vertex shaders might be used on that sub-mesh, and collects a list of "vertex formats" (i.e. vertex shader input structures) that the sub-mesh needs to be compatible with. It then uses that list, along with the list of attributes that the artist has authored (e.g. have they actually authored tex-coords?) and selects an appropriate stream storage format for the data to be exported in.e.g.