So it's not impossible, but certainly isn't trivial either. If you are thinking about Mac OSX as a potential platform and want to use modern GL features, you'll be faced with this problem. Otherwise, I wouldn't worry about the core profile at all, and instead gradually upgrade the parts of your application that will benefit from modern GL techniques (whether it be performance or new capabilities).

And behold - here lies the problem with it all! Yes, we want our code to work on MacOSX and Intel hardware but what the theoreticians completely overlook is that management also has a say in the matter, resulting in the following:

- no rewrite from the ground up
- no change of general program flow
- no time consuming changes

Of course it's easy to say 'you should have done...' and other smart-ass remarks but they always fall way off the mark of reality. That's what some people seem to forget: The old legacy code exists, and in some form it needs to continue to exist, and worse, it needs to be kept operable on more modern systems.

So here it goes:

Originally Posted by mhagain

No.

The point is that this isn't a GL3.x+ problem; this is a problem that goes all the way back to GL1.3 with the GL_ARB_vertex_buffer_object extension, so you've had more than ample time to get used to the idea of using buffer objects, and more than ample time to learn how to use them properly.

Yes, tell that to the people who made the mess more than 10 years ago. I'd fully agree that it was badly designed but that's what I have to deal with and no considerations you make, will make the code go away.
But it'a complete bullshit anyway. glBegin/glEnd was a tried and true feature until GL 2.1 so whatever you are trying to say here goes way off the line. You are arguing from a theoretic standpoint, completely forgetting that what I have to deal with is code that actually exists and actually needs to be kept working.
Plus, the performance characteristics of both methods are so totally different that there's simply no 1:1 transition, that's why the old code was never changed.

Originally Posted by mhagain

Talking about it as though it were a GL3.x+ problem and as if it were something new and horrible isn't helping your position. Howabout you try doing something constructive like dealing with the problem instead?

Of course it's a GL 3.x problem, that's when the immediate mode stuff was deprecated and some driver makers decided to drop it without any equally performant feature to replace it.

And now to the other person who doesn't seem to have a grasp on the maintenance of old legacy code...

Originally Posted by thokra

Astrom: My take is simple: no legacy GL in new code.

If you're forced to maintain a legacy code base, usually due to economical, time and compatibility constraints, by all means, keep the legacy well and clean. As mhagain already stated: there are core GL 4.4 features you can already use even in legacy code, the most prominent being plain vertex buffer objects.

Sorry, that doesn't work. 'Legacy' doesn't necessarily mean to keep the old feature set. What if you want to upgrade to integrate some newer shader-based features but for some reason or another cannot afford to do a complete overhaul of your code base, be it for financial or time reasons. In that case you have to find a compromise.
So far the compromise has been the compatibility profile but at my workplace everybody is in agreement that this is a stopgap measure at best, and as soon as it's technically doable, migrate to a core profile so that we aren't locked to AMD and NVidia on Windows.

Originally Posted by thokra

I see you completely fail to see that it will be a huge or possibly massive investment of time anyway. The question is, do you invest the time in small steps, porting feature by feature, or do you go ahead and rewrite everything. Going from legacy to modern core OpenGL takes time and care - no doubt. Still, mhagain already proposed the first option - and he's right to do so IMO.

The orders are, not to do a huge investment of time. And as I already said, mhagain's proposal has already been nixed. It can't be done. End of story. Too much work for no gain. We'd have to do months of work with no result in sight, that's plain and simply not affordable.
So again, find a compromise that gets us where we want to be. (Yes, you read that correctly: The operative term is always to 'compromise'...)

Originally Posted by thokra

Proof please. I'm not aware of any D3D10 feature that so massively kicks the GL's ass. Or am I simply not aware of something similar to persistently mapped buffers in D3D10? I thought the only thing giving you an advantage over is D3D10_MAP_WRITE_NO_OVERWRITE with a D3D10_MAP_WRITE_DISCARD at frame begin.

The problem seems to be that buffer mapping is a lot more efficient with D3D than with OpenGL 3.x. All I can tell you that the buffer updates were killing us with GL but not when doing a D3D test setup.

Originally Posted by thokra

Yeah yeah, you mentioned that already - several times - in another thread. It's high time you tell us what frickin' exotic scenario you're talking about. Otherwise you'll simply stay in that magical position that no one here can disagree with because there isn't enough hard facts to do so. Cut the crap and get real.

I think I said this countless times before: The code I have to deal with is sprinkled with immediate mode draw calls, one quad here, one triangle fan there, and a triangle strip elsewhere. It's not exotic, it's just crufty, bad old code from another time. Due to the way all of this is done it's very hard to optimize. Since I am not allowed to disclose more information you have to trust my saying that the only way to port this to a buffer-based setup is to upload each primitive's data separately, issue a draw call and go on. The code is inherently tied to such an approach (which was all nice and well when it was written a long time ago)

Originally Posted by thokra

Microsoft did it. Maintaining backwards compatibility for over 20 years is ludicrous for something like OpenGL. D3D10/11 doesn't give a crap about the D3D9 API. The things is, even if you only leverage the features that comply to the D3D9 feature subset still supported by D3D11, you still have to code against the D3D11 API. You can't even use the old D3D9 format descriptors. No way you're gonna have a D3D11 renderer and still write stuff similar to glEnableClientState(GL_FOG_COORD_ARRAY), to mention just one example the make me want to jump out the window, while at the same time wanting to have kids with the GL 4.4 spec because of GL_ARB_buffer_storage.

... which ultimately was the reason why we decided against porting to D3D. As soon as you need to move beyond the currently set-in-stone feature set you are screwed. A new API every 3 or 4 years is deadly if you got to work with software that may exist for a decade or more and also needs to be kept up to date to a degree - not to mention that D3D11 is restricted to Windows 7, causing problems if it needs to be accessed from an older system.

Originally Posted by thokra

And what are we gonna do anyway? Suppose there had been a compatibility break and we now were forced to either stay with GL3.0 at max OR start rewriting our code bases to use GL3.1+ core features - what would have been the alternative? Transition to D3D and a complete rewrite of everything? Also, where I work, we're supporting Win/Linux/Mac - go to D3D and you have to write another renderer if you want to keep Linux and Mac around.

What would have happened? Easy to answer: The code would have stayed as it was, limited to GL 2.1 features with no chance of ever being upgraded, having our bosses quaking in their shoes that the old API won't eventually vanish completely.

Originally Posted by thokra

IMO, you have to make sure that people have to adopt - see D3D. That's where the ARB failed - by letting us use the new stuff and the old crap side-by-side. I seriously doubt many companies would have been pissed off enough to leave their GL renderers behind.

And here you are forgetting something:
D3D is mainly used for entertainment software, which MUST be current with actual technology. The 5 year old D3D9 engine won't make do anymore for a new product.
The same is not true for corporate software, which is often badly maintained, full of ancient cruft and something a company's well-being relies on.
It's absolutely unfeasible to go at this with the 'out with the old - in with the new' approach, management would balk at this. Again, the nice word 'compromise' must be mentioned. And it's clearly here where the compatibility profile comes in: Lots of high profile customers who simply cannot afford to port their software to an entirely different paradigm of working. The mere fact that a compatibility profile had to be established was a clear indicator that something was wrong with how the deprecation mechanism was used.

Originally Posted by thokra

And there is no substantial problem I know of that's solvable with GL2.1 but not with GL 3.1+ - if you have one, stop rambling and prove it with an example.

It's not about the inability to solve a problem but about the inability to redesign an existing solution without blowing it up. Face it, GL 3.x was completely missing an efficient method to do small and frequent buffer updates, resulting in horrendous CPU-side primitive caching schemes and similar crutches to reduce the amount of buffer uploads. I have written my share of those myself for other projects, all this did was cost a lot of time, while providing absolutely no performance increase over using immediate mode.
And frankly, this particular thing was the ONLY thing that was sorely missing from GL 3.x

Originally Posted by thokra

Name one feature you're missing from GL 3.1+ that forced you to rewrite your entire application. I'm very, very curious. If you're answer is gonna be what you repeatedly mentioned, i.e. immediate mode vertex attrib submission is king and everything else is not applicable or too slow (which is a hilarious observation in itself), I refer you to my earlier proposition.

See above: The inability to just put some data into a buffer without some insane driver overhead. Yes, just an efficient method to replace immediate mode draw calls. You may ignore this problem as much as you like, that doesn't change anything about the cold hard fact that our 'big app''s life depends on it.

Originally Posted by thokra

Liberally invoking draw calls? Since when is someone writing a real world application processing large vertex counts interested in liberally invoking draw calls? Please define liberallyand please state why you can't batch multiple liberal draw calls into one and source the attribs from a buffer object. Otherwise, this is just as vague as everything else you stated so far to defend immediate mode attrib submission.

Again: The code exists, the code needs to continue to exist, it's one of the backbones of our company that this application continues working.
Again: It's very old, it's very crufty and today would be written in a different way.
Again: All of this doesn't eliminate the fact that I have to deal with the code as it was written more than a decade ago and liberally expanded over the years.

It's a simple question of economics - a rewrite would be too costly. There's no point to discuss this. The decision has been made and I have to deal with this and make do with what I can do - which is merely picking out the immediate mode draw calls and replace them with anything that's compatible with a core profile and doesn't bog down performance.

Originally Posted by thokra

More than 15 years isn't enough? Seriously?

You cannot pull away the rug under some existing software in the vain hope that everyone can afford to take the time to reorganize all the data.

Originally Posted by thokra

See? That's what I'm talking about ... the code to do that, except for a few lines of code, is exactly the same. In fact, with persistent mapping, you have to do synchronization inside the draw loop yourself - a task that's non-trivial with non-trivial applications.

Huh? The point of persistent, coherent buffers was precisely to AVOID such schemes! Just write some data into a buffer, issue a draw call and go on, allowing perfect 1:1-translation of existing immediate mode code without any need of restructuring and none of the overhead from the inefficient way to specify vertex data in immediate mode.

Originally Posted by thokra

Persistent mapping is an optimization and it doesn't make rewriting your write hundreds of times easier. You, however, continue to state this perverted notion that persistently mapped buffers are the only viable remedy for something that was previously only adequately solvable with immediate mode ... Have you ever had a look the the "approaching zero driver overhead" presentation of the GDC14? Did you have a look at the code sample that transformed a non-persistent mapping to a persistent mapping? Your argument before was that you cannot replace immediate mode with anything else other than persistently mapped buffers. If you're so sure about what your saying, please explain the supposedly huge difference between an async mapping implemenation and a persistent mapping implementation - because you didn't say that async mapping was too slow because of implicit synching inside the driver or something (and that's AFAIK only reportedly so in case of NVIDIA drivers which really seem to hate MAP_UNSYCHRONIZED), you said you couldn't do it at all.

The problem with a non-persistent mapping (using glMapBufferRange) is that each time I want to write data to the buffer is to lock the buffer, write some data into it, unlock it again, and issue a draw call (since a draw call may not source from a mapped buffer.) And that process is SLOW!!! Sure it's doable but it's far from performant, it was significantly slower than using immediate mode, to the point where it bogged down the app. Same for updating with glBuffer(Sub)Data. From day one of working with a core profile, my one and only gripe has been that a low-overhead buffer update mechanism had completely been overlooked, it was all geared toward having large static buffers while forgetting that not everything is large and static and not all code is easily rewritten to keep data large and static.

That's the main reason I jumped for persistent, coherent buffers, with those the code is actually FASTER than immediate mode, even on NVidia where glBegin/glEnd still works fast.

Originally Posted by thokra

Again, there is nothing of importance you can't do with core GL 3.1+ that you can do with GL 2.1 - except for quads maybe. You have everyhing you need at your disposal to go from GL2.1 to core GL 3.0 - and everything you write then is still usable even if you then move directly to a GL 4.4 core context.

Aside from performance in some border cases, one of which our app unfortunately depends on, sure, you can do everything with GL 3.x core. (And from what I learned all these border cases stem from the convenience of using immediate mode drawing just like a simple 'draw something to the screen' function so it's something that has been heavily used in legacy code.)
The problem is that in order to make it work some more extensive rewrite may be in order if you are dealing with legacy code from another generation. And it's particularly that extensive rewrite that corporate programmers often won't be able to take.

Originally Posted by thokra

Even if it means a little more work, it's almost definitely solvable and never a worse solution. If I'm wrong, please correct me with concrete examples.

Yes, unless that 'little more work' you are talking about is being considered too much by management, than all your therories fall flat on their face with a loud 'thump'.

Originally Posted by thokra

Wrong again. Developers chose client side vertex arrays before VBOs because for amounts of data above a certain threshold, client side vertex arrays substantially improve transfer rates and substantially reduce draw call overhead. Plus, there is no way of rendering indexed geometry with immediate mode because you needed either an index array or, surprise, a buffer object holding indices.

Client side vertex arrays - just like static vertex buffers are nice when you can easily collect larger amounts of data. But they become close to useless if your primitives regularly consist of less than 10 vertices and on top of that are dynamically created and contain frequent state changes that break a primitive. Sure, you can continue to collect them, but you also got to collect your state along and in the end save no time vs. glBegin/glEnd. The world doesn't entirely consist of 100+ vertex triangle strips.

Originally Posted by thokra

Again, purely speculation - and stating the a buffer object supposedly performs better than immediate mode sometimes ... that's really something to behold. Unless the driver is heavily optimized to batch vertex attributes you submit and send the whole batch once you hit glEnd() or even uses some more refined optimizations, there is no way immediate mode submission can be faster than sourcing directly from GPU memory - not in theory and not in practice.

No speculation. You seem to operate from the assumption that once the data is in the buffer it will stay there. Yes, in that case buffers are clearly the way to go.
But believe it or not, there are usage scenarios where it's far more important to optimize the way of the data into the buffer than anything else. For a strictly CPU-bottlenecked app it doesn't matter one bit how much data you can draw with a single draw call, all that matters is to find the fastest way to get your data onto the GPU - and that's exactly my problem. Restructuring the code to allow better batching would cause maintenance overhead that's entirely on the CPU, where we are already at the limit and each small addition can be felt immediately.

TL;DR, I know, to make it easier to digest I'll post the summary separately.

Let's get back to the discussion about deprecation and its advantages and disadvantages. I firmly stand on the point that if something gets removed but at the same time has to be reinstated through a backdoor, there's something gone horrendously wrong. If I want to deprecate stuff, I'd want to remove it eventually - permanently!
And to allow that you have to think twice about what features are in use, how they can be replaced and how much work needs to be invested to replace them.

But look at what happened: Stuff got deprecated. Fine!
But wait: There's tons of legacy apps that may want to use the new features - so let's add an extension that brings back all of the old.

Ugh...

Now, if things had been done seriously, at this point everyone should have stopped, think about this for a moment - and then develop a way to actually remove the old stuff WITHOUT bringing it back through the backdoor! The moment the ARB_Compatibility extension was established, the whold thing could have been considered a failure.

So it should be clear that the main reasons someone thought they NEEDED such an extension should have been addressed before actually removing anything.

Yes, we want our code to work on MacOSX and Intel hardware but what the theoreticians completely overlook is...

"Theoreticians" - this is a pretty bold assumption. The members of the OpenGL architecture review board consists of people who design graphics hardware, write graphics drivers, author graphics engines, and use the OpenGL API in applications.

....management also has a say in the matter, resulting in the following:

- no rewrite from the ground up
- no change of general program flow
- no time consuming changes

In that case, you've got a pretty clear set of restrictions which would preclude you from doing much GL3 or GL4 specific work anyway, even with compatibility mode. While I'm sure they have their reasons, it seems a bit short-sighted to me. Dealing with third-party API changes is a normal part of software development, which software managers need to account for (usually as "software maintainance"). Legacy to Core just happens to be a larger change than most, and one which isn't even being forced upon you.

However, under those development restrictions I don't see any problem with setting out the hardware and platform requirements in your application's system requirements (Windows - Nvidia or AMD).

You cannot pull away the rug under some existing software in the vain hope that everyone can afford to take the time to reorganize all the data.

Except that they didn't. Compatibility mode remains for those cases, for GL implementors that are willing to support it. Even Apple has a GL2.1 compatibiltiy profile - you just can't use GL4 features with it, which sounds like your application couldn't use anyway.

See above: The inability to just put some data into a buffer without some insane driver overhead. Yes, just an efficient method to replace immediate mode draw calls. You may ignore this problem as much as you like, that doesn't change anything about the cold hard fact that our 'big app''s life depends on it.

Sounds like your app suffers from the "small batch problem" which a lot of people in the industry are attempting to resolve (AMD with Mantle, Microsoft with DirectX 12, Nvidia with their 337 series GL driver). Nvidia does well in immediate mode because it has an excellent emulation layer which is batching up all the vertices for you into buffers. They also optimize their display lists during compilation. However, you mileage with vary, as some immediate mode and display list implementations are quite a bit slower.

Current hardware simply doesn't like small draw batches with rendering changes (GL state changes) between them, so it's up to the software to optimize the draws and buffer submissions. You can either do it yourself or hope the driver does a good job. The draw batcher we wrote provides very similar performance in Core GL mode to our previous immediate mode code. Perhaps a bit of profiling is required? Especially if, as you say, your big app's life depends upon it.

Coming back to this mess - because I think I found a solution to my problem - but it's far from what I would ever have expected.

I have been trying around with all kinds of buffer hacks but to no avail: Uploading buffers with GL 3.x's feature set is horrendously slow, no matter what API is being used, it only works for a relatively small amount of buffers per frame but not as in my case where I needed to do several 1000s of buffer uploads per frame. To solve this I would have had to cache the entire uniform state for 100s of draw calls just to reduce the amount of buffer uploads.

However, while working on something else, I noticed that uploading uniform arrays repeatedly is virtually free with no perceptible performance loss at all.
So, I had this crazy idea not to put my vertices into a buffer object but into a uniform array and merely use a static vertex buffer to index this uniform array - and to my neverending surprise, after a little tweaking it worked! On the systems I have tested this so far it's nearly the same performance than using immediate mode functions - something none of the buffer-related methods even remotely managed. And the best thing: I do not have to mess around with caching state on the CPU to reduce the amount of API calls.

But now I ask myself: Why is buffer uploading so much slower than uniform uploading, to the point that in some extreme use cases it becomes completely worthless as a feature?

But now I ask myself: Why is buffer uploading so much slower than uniform uploading, to the point that in some extreme use cases it becomes completely worthless as a feature?

With the vertex buffers: are you just uploading buffer contents when drawing these small batches? Or are you also setting up the vertex state as well (glVertexAttribPointer)? I'd expect a buffer upload to be the same speed regardless of whether it's filling a VBO or a UBO. Specifying the vertex state can be more expensive, though.

Also, some drivers are picky about the vertex formats used. AMD drivers, for example, don't like vertex buffers non-4B aligned elements (such as 8b vec2). Try using GL_KHR_debug to see if you're getting any performance warnings back from the driver.

I don't use UBO, btw., just a plain, simple uniform array with 100 floats, enough to store 20 vertices and glUniform1fv to upload my data. This way I easily manage to upload 40000 batches with 200000 vertices per frame altogether with no performance degradation compared to using immediate mode.

But now I ask myself: Why is buffer uploading so much slower than uniform uploading, to the point that in some extreme use cases it becomes completely worthless as a feature?

If you tell me whether your uniform values actually change between calls or not, I'll be probably able to answer to your question.

First, there is a limited number of uniforms you can use. It is usually about 4K fp entries for older cards (or 16K for new top models). It is 16KB (to 48K). On the other side, VBO can be up to size of available memory (several GB). That's the first difference.

Second, the drivers optimize setting uniforms. If they are not changed, nothing is sent to the graphics card. Try to modify 16KB of uniform space in each draw call. I bet it is more expensive than modifying 16KB VBO in a single glGetBufferSubData() call.

Second, the drivers optimize setting uniforms. If they are not changed, nothing is sent to the graphics card. Try to modify 16KB of uniform space in each draw call. I bet it is more expensive than modifying 16KB VBO in a single glGetBufferSubData() call.

Of course the uniform array changes! For each draw call it will contain the vertices that were generated.
It's just, the performance of glUniform1fv is what I'd expect in this scenario. It's roughly the same as transferring the same amount of data via immediate mode and somewhat slower than using a persistently mapped coherent buffer (as per GL_ARB_buffer_storage)

The buffer uploading must get hung up on some synchronization issue, but I've been unable to find out why. Of all the buffer upload methods I tried, glBufferSubData was the fastest one, but it still increased frame processing time from 20ms to 80ms for my 40000 draw call test scenario.