UPDATE ON THE TOPIC: I have tried SSBOs which basically let me have a pointer to an memory location that is mapped to the SSBO data, so I can easily change the stuff I want individually if I must. But now I have a dilemma:

Should I use SSBOs or UBOs? From what I've seen, I can allocate space for the SSBO per frame depending on whether the number of active lights has changed. But isn't this worse than just allocating a MAX value using a UBO and only filling a part of it? Also, I've read that writing to a SSBO is slower than writing to a UBO, so shouldn't I just allocate a really big UBO and fill it with the active lights on a per-frame basis?

From what I can tell, only sending data for the lights that have been updated might prove to be harder than I initially thought since I'd have to find a way to deal with deleted lights. So I think I'll just stick with sending all the active lights.

From what I've seen, I can allocate space for the SSBO per frame depending on whether the number of active lights has changed. But isn't this worse than just allocating a MAX value using a UBO and only filling a part of it?

Yes, but you can do that with SSBOs too. Did you not read the part about where I pointed out that, not 4 days ago, the ARB released an OpenGL feature who's primary purpose is to make it impossible to reallocate space for a buffer object?

Also, I've read that writing to a SSBO is slower than writing to a UBO

There is no such thing as an SSBO. Or a UBO. Or a VBO.

They are just buffer objects: unformatted linear arrays of memory stored and managed by OpenGL. You can use a buffer for shader storage purposes, then turn around and use it for UBO. You can do transform feedback into a buffer, then upload that data to a texture via PBO. You can use a buffer with a buffer texture, write with image load/store to it, then use it as vertex data with glVertexAttribPointer.

All buffer objects provide the same functionality.

What may be slower is reading from it in your shader. UBOs will (in all likelyhood) be copied into the constant local storage of your shaders, so reading will be quite fast. SSBO's are basically just a nice form of Image Load/Store via buffer textures, so they're treated like global memory accesses.

Yes, but you can do that with SSBOs too. Did you not read the part about where I pointed out that, not 4 days ago, the ARB released an OpenGL feature who's primary purpose is to make it impossible to reallocate space for a buffer object?

There is no such thing as an SSBO. Or a UBO. Or a VBO.

They are just buffer objects: unformatted linear arrays of memory stored and managed by OpenGL. You can use a buffer for shader storage purposes, then turn around and use it for UBO. You can do transform feedback into a buffer, then upload that data to a texture via PBO. You can use a buffer with a buffer texture, write with image load/store to it, then use it as vertex data with glVertexAttribPointer.

All buffer objects provide the same functionality.

What may be slower is reading from it in your shader. UBOs will (in all likelyhood) be copied into the constant local storage of your shaders, so reading will be quite fast. SSBO's are basically just a nice form of Image Load/Store via buffer textures, so they're treated like global memory accesses.

So UBO's should be faster to read from in my shader? Since, as you said, it is impossible to reallocate space for a Buffer Object, I will have to allocate for example 100 lights whether I use UBOs or SSBOs. If this is true, then I should use UBOs, shouldn't I?

Also, in my code, I'm using glBufferData with NULL data, per frame, just before I use glMapBufferRange. My SSBO initialization is basically generating it and binding as a GL_S_S_B to a predefined binding point. My questions are:

1. If I have defined MAX_LIGHTS = 100 and I only need to upload 30 lights, what's the differences between using glBufferData with size = current_number_of_lights and size = MAX_LIGHTS? You said it doesn't reallocate space, so what does it really do? If I use glBufferData with size = 30 and the next frame use it with size = 31, what happens?

2. Should I really be using glBufferData each frame? Or should I use it once in initialization and then just make sure glBufferSubData doesn't upload something bigger than the space I've allocated?

3. Should I really be using glMapBufferRange or should I instead be using glBufferSubData? I figured glMapBufferRange either copies the data to host memory and then copies changes back to the GPU, or it returns a pointer directly to GPU memory which is would be risky. So it seems glBufferSubData is just better since it just copies the specific data you want to the GPU at that point.

EDIT: (EDIT2 responds to this) This is strange, I tried using the .length method to check the light array length and it always returns 1. I've decided to see what would happen if I took glBufferData (with NULL pointer) and nothing happened, it didnt seem to be doing anything at all. I am basically generating a buffer, binding it and using glBufferSubData to upload the data with 3 lights... this results in 3 lights being correctly evaluated in the shaders accessing light[0], light[1] and light[2] but light.length returns 1 so I'm basically accessing a memory position out of the array. I don't really understand what's going on...

EDIT2: After some testing it seems that light.length compiles but isn't supposed to even be used as it has nothing to do with array length. The correct way would be light.length() but there is a known bug where you need to wrap it with uint(uintBitsToFloat(light.length())) to get the unsigned int out of it. So now that I get the correct length of the array I managed to test some things:

If I use glBufferData(100,NULL) during initialization and then I use glBufferSubData(3,data) the array length is 100. If I only use glBufferSubData(3,data) it doesn't allocate anything as expected. If I use glBufferData(3,data) per frame and then change to glBufferData(4,data) the array's length also goes from 3 to 4, meaning it allocated a bigger space. But then how have they made it impossible to reallocate? And is it better to use glBufferData(3,data) per frame or use glBufferData(100,NULL) when initializing and glBufferSubData(3,data) per frame?

OK, let's just cut to the chase. Go read this and implement one of those streaming strategies.

Since, as you said, it is impossible to reallocate space for a Buffer Object

I didn't say it was impossible. I said that recent functionality allows you to make it impossible. And since that functionality exists to make using them faster, that's a strong hint that you shouldn't be doing it in the first place.

I will have to allocate for example 100 lights whether I use UBOs or SSBOs.

The ability to resize the storage for a buffer object has nothing to do with how you use it.

Uniform blocks must be of a specific size. Therefore, whatever buffer object you use for them must be at least that size. It could be bigger, but it can't be smaller.

You said it doesn't reallocate space

Where? I said that ARB_buffer_storage/GL 4.4 allows you to allocate buffers that cannot be reallocated. And that means that it was a mistake for OpenGL to let you reallocate them to begin with. So you should never do it.

I figured glMapBufferRange either copies the data to host memory and then copies changes back to the GPU, or it returns a pointer directly to GPU memory which is would be risky. So it seems glBufferSubData is just better since it just copies the specific data you want to the GPU at that point.

No it doesn't. It copies the specific data to the GPU eventually.

Consider this. If you map the buffer, generate your light data every frame into that pointer, and unmap it, the worst-case scenario is that the driver will have to DMA-copy the data from the mapped pointer into the buffer object. It will do that at a time of its choosing, but sometime before you do anything that reads from that data. The best-case scenario is that you're writing directly to the buffer object's storage. This is much more likely if you use GL_INVALIDATE_BIT to invalidate the buffer (since you're overwritting all of its contents).

If you use BufferSubData, you must generate your data into an array of your own, and you give that to BufferSubData. Worst-case, BufferSubData must then copy that array into temporary memory, and later DMA-copy that into the buffer. The reason why is quite simple. If the buffer is currently in use (is going to be read by GL commands that you have already issued that haven't executed yet), then it can't simply overwrite that data. The OpenGL memory model doesn't allow later commands to affect earlier ones. So the implementation must delay the actual DMA-copy into the buffer storage until that storage is no longer in use. And since BufferSubData cannot assume that the pointer it was given will still be around after BufferSubData returns, it must copy that data into temporary memory and DMA from that into the buffer later.

So worst-case with BufferSubData is that there are two temporary buffers. You had to generate your lighting data into one temporary buffer, and OpenGL had to copy it into another temporary buffer.

Best case with BufferSubData is that it is able to do the DMA immediately. But that almost never happens. Why? Because DMA's aren't instantaneous. They're an asynchronous operation. Also, DMA's typically can't happen directly from client memory. So most implementations of BufferSubData are still going to have to copy the buffer into some temporary, DMA-able memory, and then DMA it up to the GPU.

With mapped pointers, odds are very good that, if the pointer you get isn't actually the buffer, it's at least memory that's DMA-ready. So the worst-case scenario for mapping is equal to the best case scenario for BufferSubData.

So yes, if performance is a concern (and at this point, it shouldn't be. Stop prematurely optimizing stuff), mapping will only ever be equally as bad as BufferSubData, and can be a good deal faster.

OK, let's just cut to the chase. Go read this and implement one of those streaming strategies.

I didn't say it was impossible. I said that recent functionality allows you to make it impossible. And since that functionality exists to make using them faster, that's a strong hint that you shouldn't be doing it in the first place.

The ability to resize the storage for a buffer object has nothing to do with how you use it.

Uniform blocks must be of a specific size. Therefore, whatever buffer object you use for them must be at least that size. It could be bigger, but it can't be smaller.

Where? I said that ARB_buffer_storage/GL 4.4 allows you to allocate buffers that cannot be reallocated. And that means that it was a mistake for OpenGL to let you reallocate them to begin with. So you should never do it.

No it doesn't. It copies the specific data to the GPU eventually.

Consider this. If you map the buffer, generate your light data every frame into that pointer, and unmap it, the worst-case scenario is that the driver will have to DMA-copy the data from the mapped pointer into the buffer object. It will do that at a time of its choosing, but sometime before you do anything that reads from that data. The best-case scenario is that you're writing directly to the buffer object's storage. This is much more likely if you use GL_INVALIDATE_BIT to invalidate the buffer (since you're overwritting all of its contents).

If you use BufferSubData, you must generate your data into an array of your own, and you give that to BufferSubData. Worst-case, BufferSubData must then copy that array into temporary memory, and later DMA-copy that into the buffer. The reason why is quite simple. If the buffer is currently in use (is going to be read by GL commands that you have already issued that haven't executed yet), then it can't simply overwrite that data. The OpenGL memory model doesn't allow later commands to affect earlier ones. So the implementation must delay the actual DMA-copy into the buffer storage until that storage is no longer in use. And since BufferSubData cannot assume that the pointer it was given will still be around after BufferSubData returns, it must copy that data into temporary memory and DMA from that into the buffer later.

So worst-case with BufferSubData is that there are two temporary buffers. You had to generate your lighting data into one temporary buffer, and OpenGL had to copy it into another temporary buffer.

Best case with BufferSubData is that it is able to do the DMA immediately. But that almost never happens. Why? Because DMA's aren't instantaneous. They're an asynchronous operation. Also, DMA's typically can't happen directly from client memory. So most implementations of BufferSubData are still going to have to copy the buffer into some temporary, DMA-able memory, and then DMA it up to the GPU.

With mapped pointers, odds are very good that, if the pointer you get isn't actually the buffer, it's at least memory that's DMA-ready. So the worst-case scenario for mapping is equal to the best case scenario for BufferSubData.

So yes, if performance is a concern (and at this point, it shouldn't be. Stop prematurely optimizing stuff), mapping will only ever be equally as bad as BufferSubData, and can be a good deal faster.

From what you'd telling me, mapped pointers are better if I'm rewriting the storage. So if I only allocate once, I should allocate 100 lights during initialization phase using glBufferData with a null pointer. Then, every frame, I should use a mapped pointer to overwrite the data from lights 0 to current_number_of_lights.

What about using glBufferData with a null pointer every frame just before using the mapped pointer like it is said on the Streaming techniques link you put? Will that be reallocating? (I'm under the impression that glBufferData always reallocates) Or will that be more efficient since it is used to tell the driver that you don't really care about the previous piece of memory? I might be confusing buffer allocation with uniform block allocation, am I? After reading that link you gave me it seems that using glBufferData with the same size as the initial allocation and with a null pointer will basically be faster since I will be filling a new buffer or the old buffer (if not being used).

Also, should I use glMapBufferRange with GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT | GL_MAP_INVALIDATE_RANGE_BIT | GL_MAP_UNSYNCHRONIZED_BIT ? From that link you gave me, using GL_MAP_INVALIDATE_RANGE_BIT will be an optimization since I'm only writing and not reading. Also, using GL_MAP_UNSYNCHRONIZED_BIT would work since I'm only generating data into it before I actually render. Am I right?

And why should I stop prematurely optimizing? I must admit I'm a perfectionist but isn't optimization good?

Sorry, I know it's a lot of questions but this isn't just for optimization, optimization is just my own way of understanding things thoroughly and I don't want to be someone who just comes here and asks for people to fix stuff, I want to understand so I can teach others as well. In any case, you've already helped A LOT with my understanding of this and I thank you for that.

What part of "OK, let's just cut to the chase. Go read this and implement one of those streaming strategies," did you not understand?

And why should I stop prematurely optimizing? I must admit I'm a perfectionist but isn't optimization good?

No, it isn't. Optimization is a waste of time unless what you're optimizing is actually responsible for the poor performance of your application. There's the general 80/20 rule: 80% of your application's performance is governed by 20% of your code. Until your application is actually somewhat remotely like working, you can't know what 20% is making it slow. And if you don't know what's making it slower, you can't know what to spend time optimizing. So often times, you'll waste time optimizing something completely irrelevant.

Like say, how to efficiently stream a whole 16KB of data per frame to the GPU.

In the time it has taken us to have this discussion, you could have implemented any one of the general strategies you've suggested and moved on to something else. You can come back to this when a profiler tells you that it's making your application slower.

What part of "OK, let's just cut to the chase. Go read this and implement one of those streaming strategies," did you not understand?

No, it isn't. Optimization is a waste of time unless what you're optimizing is actually responsible for the poor performance of your application. There's the general 80/20 rule: 80% of your application's performance is governed by 20% of your code. Until your application is actually somewhat remotely like working, you can't know what 20% is making it slow. And if you don't know what's making it slower, you can't know what to spend time optimizing. So often times, you'll waste time optimizing something completely irrelevant.

Like say, how to efficiently stream a whole 16KB of data per frame to the GPU.

In the time it has taken us to have this discussion, you could have implemented any one of the general strategies you've suggested and moved on to something else. You can come back to this when a profiler tells you that it's making your application slower.

Yes but I'm not delivering a product, I'm doing research so my purpose is to understand everything I can. It may take me a while to implement this due to all these discussions of optimization, but once I've done this once, I will understand it well so the next time I'll have to implement something I will know exactly how to do it properly. Experienced are always doing prematurely optimization without even noticing. When you are developing something, you probably make design choices that are optimized or as optimized as you can come up with from the top of your head. That's what I'm trying to achieve here. Sure, I am developing something, but my whole purpose of actually asking you guys is to make sure I fully understand it so that the next time I do it I will already know how to optimize from the top of my head as I implement it. I don't think there is anything wrong with prematurely optimizing something if you already have the experience to do it right there at that moment without wasting time. Personally, I can't do something that is new to me without trying to fully comprehend it. It comes as incomplete learning for me if I just copy code from a website to use on my application. Don't get me wrong, I'm not saying any other way of doing it is wrong, but that's just how I am.

I read the link you sent me. I only tried to confirm my interpretation of it applied to what I'm doing here, since you seem to be very experienced and educated on this. I wasn't trying to be lazy and making you explain what's in that link for me, I simply read it and still had some questions about how to apply it to what I'm doing. Most of those questions are just simply YES or NO questions. Sorry if I should have known the answers just by reading the link you gave me but as it may not seem to you, English is not my first nor second language. I am not a native speaker and sometimes I have trouble understanding things clearly through text and need some human confirmation of my own interpretation.

I understand if you don't want to help me fully understand rather than just use what's on the link. So if you're not helping any more, know that you were already helpful, not just with the other posts but also with that link, since it made me understand things better even if not fully.

- Initially use BufferData will NULL pointer once to specify data storage size, do this only once.
- In your case you probably should use STREAM_DRAW usage.
- Use fixed size array(s) with maximum number of lights size.
- Pass number of lights actually used (since length would only tell the max size) to shaders using a uniform.
- When you update the buffer, you can map it all with explicit flushing, and manually flush only the first N lights which are in use.
- Use the invalidate bit. Invalidating the whole buffer is probably best.
- There is no need for BufferData(NULL) - that is just an older way to say invalidate.
- Using unsynchronized bit may be unsafe. When you update the data with the CPU, GPU may still be using the older data for previous frame. However, it is still worth experimenting with it. I found that it gives more performance and rendering errors were not an issue in my case.

In general, I would only use BufferData with NULL data and always use MapBufferRange to specify buffer contents, and never use BufferSubData. However, older OpenGL and unextended OpenGL ES versions before 3.0 do not have MapBufferRange. To support those, you could create an abstraction for buffer with mappufferrange and flush operations; These can be implemented using Buffer(Sub)Data calls if they not available in GL.

Then understand that "premature optimization is the root of all evil". The 80/20 rule, or sometimes even more strictly constrained the 90/10 rule, is an established fact whose applicability that has been observed by generations of developers on a multitude of systems. If you don't have substantial profiling data and know exactly which parts of the application are a limiting factor, you cannot really optimize anything.

Experienced are always doing prematurely optimization without even noticing.

For instance?

you probably make design choices that are optimized

You cannot make optimized design choices. A design is an abstract perspective on how components of your application work and interact. You can implement that design in some way possible in the language of your choice. How optimal the resulting code will be, depends not only on patterns and idioms you follow that apply to your programming language, but in large part on the compiler, the platform (OS) and the CPU architecture. Obviously, for some designs you know that there is no way of naively implementing them and getting good performance in the end. I argue, however, that the code could in itself can still be fairly optimal but performance may be constrained by other factors that have nothing to do with the quality of your code - like I/O performance (hard disks, network and so on), or a crappy operating system, etc. etc. In general the design alone cannot speak to the optimality of the resulting code and the performance of your application.

As I said, there are rules and idioms one should obey, like avoiding unnecessary copies of large sets of data and so on (and in this instance, deciding that a function take its arguments by reference is actually a design choice that will probably lead to faster code), but all in all that is not optimization: not doing so is actually premature pessimization - unless you have a good reason not to follow the general rule. So is the above mentioned choice of an obviously poor design.

as optimized as you can come up with from the top of your head

What you come up with from the top of your head is seldom optimal. If you want it really fast, you're gonna have to profile and check the data - always.

It comes as incomplete learning for me if I just copy code from a website to use on my application.

Where did you get a recommendation to do so?

I don't know your background, but I assume you're either a student or rather fresh post-grad, and I don't know if everyone here will agree with me, but please don't let senseless perfectionism take over. You're not gonna get anywhere if you try to tweak every single function and every expression in your code. Ivory tower thinking isn't well applicable in the real world.

I understand if you don't want to help me fully understand rather than just use what's on the link. So if you're not helping any more, know that you were already helpful, not just with the other posts but also with that link, since it made me understand things better even if not fully.

That is so not the point. The point is: You were already given sufficient help to tackle your problem at hand, at least on a basic level. If you have specific questions, no one on this forum will deny you their help until you understand what to do. In regards to performance, however, specific means providing actual data and pieces of code responsible for that data. If it's crappy, we'll tell you. If it's OK and you just can't do better on your current hardware, we'll tell you. If you don't seem to get what you're doing at all, we'll tell you. Personally, I think we got a very nice and helpful community here - you could do much, much worse.

Also, how who can anyone actually say that they fully understands everything they do? How, pray tell? Do you know exactly how your hardware works? Do you know exactly what code your GLSL compiler generates for your current GPU? I could go on ... but I suspect the answer is "no!". Being able to fully understanding everything you do when developing software is an illusion. Period.

As a general rule: First make it correct (which implies that it works in general) - then make it fast. This is exactly what Alfonse already told you above:

Originally Posted by Alfonse

In the time it has taken us to have this discussion, you could have implemented any one of the general strategies you've suggested and moved on to something else. You can come back to this when a profiler tells you that it's making your application slower.