avoiding the default framebuffer blit overhead

Hi,

First I will describe the problem.
As we know the default framebuffer (0) is remnant from the past which for some mysterious reason opengl is still dragging along like a bag with stones.
It is very un-flexible and totally alien to many modern-day ways of doing things, e.g. deferred rendering.
One would often need to be able to combine freely various color/depth/stencil buffers, which is easy with the FBO infrastructure.
But when we need to display something there is a problem. The final image to be displayed is often not generated in the default framebuffer,
because we need the flexibility of FBOs. For example we may need the depth buffer used to render the scene available as a texture or something.

Then we need to blit to the default framebuffer. This adds overhead, which may be something like 1-2 milliseconds per frame.

In direct3d the colorbuffer that can be displayed (swapchain) is a pure colorbuffer-only object from the POV of the renderer and can be combined with other buffers just like the non-displayable ones.
This is unlike the opengl default framebuffer, which drag it's own depth buffer (or has none) and can not be changed.

I experimented a bit with the nvidia WGL_NV_DX_interop2 extension.
I created some d3d11 device with it's swapchain, then using the extension, setup a opengl renderbuffer that corresponds to the swapchain backbuffer.
Then i did some rendering on the opengl while using the d3d's way of presenting image to a window.
After some tweaking i managed that to run faster than opengl's own way using blit.

All the rendering was just a glClear(GL_COLOR_BUFFER_BIT) and then present the result.

The mentioned tweaking included removing of the synchronization calls (wglDXLockObjectsNV and wglDXUnlockObjectsNV)
I only call wglDXLockObjectsNV once and the objects stays locked all the time (otherwise opengl generates GL_INVALID_FRAMEBUFFER_OPERATION)

the render loop is basically
glClearColor(0, rand()%256*(1.0f/256), 0, 1);
glClear(GL_COLOR_BUFFER_BIT);
glFlush();
sc->Present(0, 0);
the backbuffer of the swapchain is bound to the opengl draw framebuffer.

Also when the swapchain is created, the BufferUsage must include the DXGI_USAGE_RENDER_TARGET_OUTPUT flags, otherwise the performance is crippled.

It is a shame that this ugly hack actually outperforms the opengl's native way to output it's graphics.
I think it is about time they get rid of the default framebuffer.
They can look at the ipad for an idea how to do it.

here is the test source if someone is interested to try it
change "mode" to select among the 3 cases i mentioned - see the comment
ah, also "start" is the program entry point (i set that in the linker options). you can rename it to WinMain or whatever

Also, you never said what your actual results are, only that one was "noticeably slower". Oh, and I would be curious to see what you would get via query objects. That is, detecting the GPU time rather than the CPU time.

you have the source, feel free to test with QueryPerformanceCounter, queries and whatever you like. the results i got were telling enough for me

both d3d-present cases did about 1000 fps on my machine and the opengl-present case did something between 500 and 600 fps
the gl-clear/d3d-present case did abit lower than pure d3d, but the difference was marginal.

To me it is clear that the gl-present case has one additional buffer copy than the d3d-present cases

The biggest question I have is this... what if you're not doing it the way you describe?

Consider the case of actually rendering something for real. You're doing deferred rendering; OK, fine. You have your g-buffers, where you have your actual data. Then you convert this into light reflectance as seen by the camera. But if you're doing HDR (which, let's be honest, is far more of a no-brainer than deferred rendering by this point), you're doing all of this accumulation into a floating-point buffer. You can't "present" that; you need to tone-map it first. Not only that, you probably have some transparent objects to render, so you need to do some blending. This should presumably be done in HDR space.

Now it's time to tone-map down to SRGB8_ALPHA8. But where should the output go? Why not... the default framebuffer?

In short, I'm not seeing the problem here. Your problem seems to be that you don't want to use the default framebuffer (as stated by your passive/aggressive introduction). That's fine, but... it still there.

No matter how many threads on this forum you make, no matter how many alternative rendering systems you write, no matter how much you want it to be so, it's still there. It was there in OpenGL 4.1. It was there in OpenGL 4.2. It was there in OpenGL 4.3. Next year, it will still be there in OpenGL 4.4/5.0/etc. Whether you want to use it or not, it is there and available for use. So if you can, use it. And in most real cases, you can. So use it, and you won't have to worry about that copy being slow, since you won't be doing a copy.

If you spent more time using the API you have, rather than the API you want, you'll be a lot happier.

To me it is clear that the gl-present case has one additional buffer copy than the d3d-present cases

You're not honestly showing this code off because you had the revelation that copying is slower than not copying, are you?

In windows vista+ since the desktop is rendered through d3d, you do get an aditional copy. But with direct3d9ex there is a way to render more efficiently. Basically it passes a pointer to the surface for DWM to render directly instead of requiring an extra copy. I guess this won't be possible under opengl. But really the cost of 1 blit is nothing to worry about anyway.

I myself thought that 1 blit more would be nothing, but it turned out it does have noticeable impact on lower-end hardware. It's nothing spectacular and certainly not a show stopper, but still it could be bigger (depending on the GPU memory bandwidth) than many other things people make efforts to optimize.