Multithreaded Texture downloading issue

Hello,

I am writing video composing application that use several video sources (SDI, NDI, CEF3 Browser, libmpv player and etc) as textures and compose final image in another texture that displayed on screen and downloaded to send as NDI. All textures uploaded and downloaded asynchronously via PBO. Everything works fine in single threaded OPENGL renderer. But now I trying to download final textures in dedicated thread. Every frame rendered in round-robin textures list (3 textures) then I make glFenceSync and signal download thread. Download thread insert glWaitSync and initiate downloading for current texture. Then it create another glFenceSync and map PBO from previous frame to send via NDI. The problem is that sometimes it download not finished image that have just background color or just one layer of video. It depends on GPU utilization and happening about once per several seconds (1080p60). On the screen everything is ok and problem exist only in downloaded image. Looks like it miss synchronization.
Under VMWare everything works fine in both modes (single threaded and mulithreaded).
Could you help me to fix the issue? because I don't have any ideas. My system is Ubuntu 16.04, NVidia 1050 ti video card, 384.11 driver.
Also see attached opengl log for one of frames in frames.zip

Looks like glWaitSync does not work at all. If I put usleep before downloading in thread then everything works fine. If I just do glWaitSync then it start downloading immediately and before everything would be rendered to texture. I have checked that correct sync object passed to glWaitSync but for some reason it does not work:
291,,void glFlush(),0x00007f45cc6fe478,358,-,8978
292,,"GLsync glFenceSync(GLenum condition = GL_SYNC_GPU_COMMANDS_COMPLETE, GLbitfield flags = 0) = 3",0x00007f45cc6fe478,3454,-,8978
293,,"void glBindFramebuffer(GLenum target = GL_FRAMEBUFFER, GLuint framebuffer = 0)",0x00007f45cc6fe478,32558,-,8978
294,,"void glWaitSync(GLsync sync = 3, GLbitfield flags = 0, GLuint64 timeout = 18446744073709551615)",0x00007f459439e818,924,-,8997
295,,void glDeleteSync(GLsync sync = 3),0x00007f459439e818,2802,-,8997

0x00007f45cc6fe478 is Rendering thread and 0x00007f459439e818 is downloading thread

When you say "downloading", are you talking about copying the data to a GPU buffer (e.g. PBO) or to client memory?
If you're copying to client memory, you need to use glClientWaitSync() to block the CPU thread until the sync object is signalled.

I mean downloading from GPU to CPU via glGetTexImage. But why I have to wait via glClientWaitSync? I guess that if I put glWaitSync then GL Server have to queue wait before actual downloading. In single thread it is work without any synchronization because of single queue. But in multhithreaded application I have to sync both queues by glWaitSync. Also all NVIDIA papers ( http://on-demand.gputechconf.com/gtc...-Transfers.pdf ) indicate that I have to use glWaitSync (not glClientWaitSync()). It exactly what I do. The only difference that NVIDIA indicate that download thread have to download previous frame texture but I start downloading for current frame texture and finish downloading for previous frame texture. But if glWaitSync work then it does not matter.

I mean downloading from GPU to CPU via glGetTexImage. But why I have to wait via glClientWaitSync?

Because that's the operation you want to do. The "client" is the CPU. `glWaitSync` tells the GPU to wait on the completion of a fence before processing further commands. But what you want is the CPU to wait until the completion of the fence. That's spelled "glClientWaitSync".

Because that's the operation you want to do. The "client" is the CPU. `glWaitSync` tells the GPU to wait on the completion of a fence before processing further commands. But what you want is the CPU to wait until the completion of the fence. That's spelled "glClientWaitSync".
NVIDIA's PDF is wrong.

I am not sure. I am using asynchronous PBO transfer that works on GPU and async for CPU. In single thread it works without any delays on CPU side. glGetTexImage return immediately in both cases because it just put it in GPU queue. On next frame I use glMapBuffer on this PBO that have to wait on client (application) side but at this point transfer already finished. In single thread glGetTexImage put command in right place of queue and once rendering has finished it start transfer. But all waiting on GPU side. The problem is that in multithreaded application glWaitSync does not actually sync both queues on GPU.

Let see how it works in single threaded:
APP: CLEAR N->RENDER N->GET IMAGE ASYNC N->MAP+UNMAP IMAGE (N-1)->SWAP
GPU: CLEAR(N-1)->RENDER(N-1)->TRANSFER(N-1)->CLEAR(N)->RENDER(N)->TRANSFER(N)

APP thread stuck only on MAP to wait finish previous frame and SWAP to wait VSYNC