I haven't seen this posted anywhere on the site, so, Valve held a presentation a few months ago on porting their Source engine to Linux. It goes into detail on how to use modern OpenGL and getting the best performance out of it. It's an hour long video - a great wealth of info.

Is the "Vertex Attribute Object" they mention as being slow in this video the same thing as Vertex Arrays (eg glGenVertexArrays)

and if so, what kinds of things would make them slower than rebinding the buffers/calling glVertexAttribPointer every draw?

Yes, that's VAOs.

Really, all Valve have given us is their conclusions; we don't have their code, we don't have their test cases, we don't have their profiling data, and we don't know to what extent this was based on hardware vendor advice (it's interesting to note here that id Software don't use VAOs either - and in their case they have released their code so we can cross-check and confirm).

If you think about it, changing a VAO involves swapping out one huge chunk of state and swapping in another similarly huge chunk - the enabled arrays, their pointers and the buffers used for them. It's easy enough to concieve of scenarios where not using VAOs may be more efficient - maybe you just wanted to change one pointer but keep everything else the same, or maybe you wanted to change the buffers (and remember that they're not using GL4.3 so no vertex attrib binding) but keep everything else, or maybe your GPU vendor just has a bad implementation of VAOs in their driver? Again, without the missing information from Valve it's hard to draw conclusions - we don't really know what kind of vertex formats they're using, how often they're changing them, and whether their usage patterns are consistent and sensible, or borderline insane.

Without all of that the only conclusion we can validly draw is that Valve have found one case where using VAOs is slower (and they're not giving us the full information we need to test and/or support their findings), but that doesn't necessarily hold true for all cases. Profile your own code, form your own conclusions that are appropriate for your own program, and use whichever approach gives you the best performance for your own use cases.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

the only conclusion we can validly draw is that Valve have found one case where using VAOs is slower

…I’m responsible for performance inside of tri-Ace and I have yet to find a case at all in which a VAO is faster than manual switching, when manual switching is done properly.

It’s not like they found a few cases in which the performance was better for lack of VAO’s, but rather that their searches, id Software’s searches, and my own searches yielded no results in the pursuit of a better-case scenario for VAO’s. VAO’s simply do not offer better performance than you can get on your own via your own redundant state-tracking, and as Valve mentioned that is likely never to change, since it requires a scope outside of the driver’s range.

In theory something like a VAO SHOULD be faster because the driver can cache and validate upfront the various buffers bound and convert it to a sane format.

In reality not all your streams/buffers are going to be static AND bind-to-edit allowances mean they probably don't cache to such/any degree. (bind vao, bind new buffer, opps edited vao...).

The 4.4 extension 'GL_ARB_multi_bind' will probably end up being the fastest way of doing it as you can use one API call to set multiple streams at once which, assuming the api is sane, should let you set 'static' stream data and then bind in 'instance' data as needed afterwards.

As for everything else from Valve on this; given they are pushing a Linux/OpenGL based OS I'd take what they say with varying degrees of salt.

Just to be clear on this, I've also benchmarked VAOs as being slower, even in the case where you create and bind a single VAO during startup then write the rest of your code as if VAOs didn't exist. I stopped short of saying "VAOs are always slower" because I can guarantee that somebody, somewhere, right now undoubtedly has a case where they are actually faster.

I haven't benchmarked VAOs combined with GL4.3 vertex attrib binding but I suspect that this may be a faster path than pre-GL4.3 usage because it can involve just swapping out buffer specifications, leaving the rest of the vertex format intact. Valve of course aren't using that because they must target pre-GL4.3 hardware.

Despite all of this I still use VAOs because I find them convenient for state management, the performance impact is not too high, and there are bigger bottlenecks in GL anyway (updating dynamic buffer objects, for example, although I'm hoping that GL4.4 buffer storage will resolve much of that).

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

I'm surprised VAO should be slower because not only is it fewer API calls, and not only can the driver cache and validate updront the various buffers, but it can also cache the validation. This is admittedly less expensive than in the case of a FBO (which is why it's faster to switch 2 FBOs than to add/remove attachments to a single one), but still it necessarily means touching fewer objects spread out in memory, and thus fewer cache misses.

That said, it surprises me they're discouraging MapBuffer, too. In my experience, MapBufferRangeis just about the same as CopyBufferSubData, with the difference that you can offload the copy to another thread. And if the GPU sync really bites you as they suggest, there's still MAP_UNSYNCHRONIZED_BIT which you can use as described by Hrabcak and Masserann in Cozzi/Riccio's book. That not only avoids synchronization and lets you offload the copy to another thread, but it also avoids having the driver perform memory allocation and reclamation work.

I'm surprised VAO should be slower because not only is it fewer API calls, and not only can the driver cache and validate updront the various buffers, but it can also cache the validation. This is admittedly less expensive than in the case of a FBO (which is why it's faster to switch 2 FBOs than to add/remove attachments to a single one), but still it necessarily means touching fewer objects spread out in memory, and thus fewer cache misses.

Depends on Valve's usage, to be honest. E.g. a common enough scenario is to use the same vertex format and layout but to change the buffers; without GL_ARB_vertex_attrib_binding it's not possible to do this without respecifying the entire VAO, so there's not only no caching going on in this scenario, but also the extra overhead of VAO respecification and revalidation (at which point in time you may as well not be using VAOs at all).

I highly doubt that Valve are using GL_ARB_vertex_attrib_binding as many AMD cards, and all Intel cards, don't support it, and Valve's products must run on that hardware.

I'd also draw your attention to their earlier observation (in the same presentation) about GL being chatty but efficient, and not to judge a piece of code by number of calls. It's easy enough to concieve of a single API call that does a lot more work than multiple calls, so it really depends on the amount of work that each API call has to do. If - as I suspect - most vendors implement VAOs primarily as a user-mode software wrapper, with lazy state changes calling into kernel mode to flush changed VAO states to the hardware when a draw call is made, the API overhead of single call versus multiple calls should really be very minimal.

That said, it surprises me they're discouraging MapBuffer, too. In my experience, MapBufferRangeis just about the same as CopyBufferSubData, with the difference that you can offload the copy to another thread. And if the GPU sync really bites you as they suggest, there's still MAP_UNSYNCHRONIZED_BIT which you can use as described by Hrabcak and Masserann in Cozzi/Riccio's book. That not only avoids synchronization and lets you offload the copy to another thread, but it also avoids having the driver perform memory allocation and reclamation work.

Surely the Valve guys would know about that technique?

Valve definitely know about this technique because it's the way D3D buffer updates work, so they've been using it in D3D for over 10 years now; it's very straightforward to port D3D discard/no-overwrite code to MapBufferRange (the API calls used match up very well) so they must have another reason for not using MapBufferRange. Again, I'd suggest that this reason is because GL_ARB_map_buffer_range may not be available on all of their target hardware. Raw MapBuffer (i.e. without "Range" ) has several problems so BufferSubData is definitely to be preferred over that in cases where MapBufferRange isn't available.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

I only do it once, and after that I just upload new data once it changes.. set and forget

Am I doing it wrong?

That sounds fine; I'd expect it to run as well as possible provided you don't change the buffer size during your "upload new data" step. If you do change the buffer size, the driver would need to specify a new buffer behind the scenes, swap it out with the old, and wait until pending draw calls on the old have completed before freeing it, which could lead to VAO respecification (it would be driver internal behaviour and is not specified by OpenGL however).

GL_ARB_vertex_attrib_binding is for opengl 4.2, which I can understand not many cards will support

In theory the basic functionality behind vertex attrib binding should be able to work on any GL2.0+ hardware as it's a fairly clean duplicate of the way D3D9+ specifies vertex formats. In practice the inclusion of instancing entry points, the L and I versions of glVertexAttribFormat, and the requirement for a non-zero VAO to be bound limit it's ability to work on downlevel hardware. Vendors could however implement it as an extension for 3.0+ hardware if they wished, and there seems no good reason for why they have not done so.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.