Bad multi GPU performance scaling

Iím having some troubles running my OpenGL renderer in multi-gpu configuration. There are two Quadro graphics cards in my computer with one monitor connected to each card. My renderer creates two windows, one on each monitor with correct gpu affinity. After that, two rendering threads are created, each with itís own rendering context and with a local copy of data to render. Thereís no data sharing or synchronization between threads and also thereís no data sharing between render contexts.

Trouble is that thereís almost no performance scaling and GPU utilization is below 50%. Framerate is exactly the same as in case with rendering only on one GPU.

I can run this renderer in one thread/one window configuration. In this case, selected GPU utilization is almost 100% and frame time is exactly halved compared to situation above. Surprisingly, running two instances of this renderer I can achieve perfect utilization of both GPUs.

I have observed similar behavior on Windows 8.1 64bit and also on Linux, both running latest nvidia drivers.

- What do I have to do to achieve good scaling from within one process/multiple render threads configuration? Do I need special driver profile for my app? Is there some other conditions to meet?

- Under Windows 8.1, it seems that GPU affinity is set correctly by default from initial window position. I can verify that each thread is sending commands to different GPU via NSight Performance Profiler.
- Under X11, thereís two X server displays, Xinerama is disabled.

For the purpose of testing there’s no data upload during render loop. There’s just binding of textures, binding of vertex buffers and glDrawArraysInstancedBaseInstance calls. Data for each draw call is sourced from shader storage buffer using gl_BaseInstanceID.

Trouble is that thereís almost no performance scaling and GPU utilization is below 50%. Framerate is exactly the same as in case with rendering only on one GPU.

I can run this renderer in one thread/one window configuration. In this case, selected GPU utilization is almost 100% and frame time is exactly halved compared to situation above. Surprisingly, running two instances of this renderer I can achieve perfect utilization of both GPUs.

This is the same behavior I observed six years ago while trying to use three NV Quadro GPUs in a single system. I reported it to NVIDIA, who mentioned something about their OpenGL driver serializing all work within a process (but, as you notice, not across different processes). They tracked the bug for a couple years, didn't fix it, and it sounds like this must still be a problem today. For what it's worth, their Direct3D driver doesn't have this limitation.

At the time, their GPU affinity extension was being pushed alongside QuadroPlex systems and a paper on how amazing it was to scale across multiple GPUs, so it was pretty surprising to find out how that was a lie in practice and couldn't actually be obtained. $10k in GPUs just to try out a feature based on that advertising was a pricey mistake...

I have switched back to GLFW and after creating two fullscreen windows it's finally working! I'm getting around 85% performance scaling (4.45ms for a one window, 5.15ms for two windows/threads on two GPUs). I was expecting to see a little bit better results but it's better than nothing. Later I have tested two older AMD 5870 and there's almost 100% scaling.

Funny thing, at begining I was using GLFW and fullscreen without much luck. But there might have been some bug and both contexts and windows were created from within main thread. Because of that I have switched to custom code to create a window but never tried to create a fullscreen one again.

Thank you for your inputs! I was staring to think that it's not possible to get it working...