This guy had a problem, slow fps in a game, knew the generalized view of how Wine worked, and used the tools they knew of to try to fix the problem.

I've never written a line of DX, GL, etc but I know what command buffers / driver synchronization / AZDO are like this article mentions.

I also play WoW on Linux, and am kind of embarassed I didn't think to try perf monitoring the game for easy to fix huge slowdowns like this. I kind of assumed since WoW is one of the most popular Wine games and generally pushes the DX api support to make sure it always works that the main Wine devs would optimize it more.

That being said, buffer_storage is a GL 4.4 extension and Wine has this awful habit of trying to strictly support OSX, which will never see OpenGL beyond 4.1, and I'm not sure if buffer_storage is available there. That alone might mean these patches are never merged mainline, which would be... inconvenient.

"Fundamentally, it’s a function that maps a slice of GPU memory into the host’s address space, typically for streaming geometry data or texture uploads"

Can someone clarify this for me? Are OpenGL/D3D buffers that get stuff memcpy'd into them by the CPU actually "slices of GPU memory," or are they more often reserved driver memory that eventually get DMA'd to the GPU? (I realize both probably happen at different times, but I'm curious which is more typical for modern systems)

It seems like spending CPU cycles writing every byte over the bus would perform much worse than a fast write to sysmem followed by a DMA transfer.

EDIT:
I looked into it, and it seems like the typical implementation is that map returns a pointer to some pinned driver sysmem and unmap kicks off an async DMA to GPU memory.

> Can someone clarify this for me? Are OpenGL/D3D buffers that get stuff memcpy'd into them by the CPU actually "slices of GPU memory," or are they more often reserved driver memory that eventually get DMA'd to the GPU?

The answer is, as with all things OpenGL, it depends. You might get back a pointer to GPU memory that you can directly write to, or you'll get back some chunk of system memory the driver has.

The ARB_buffer_storage extension improves matters as you can almost guarantee that you'll get GPU memory, and you can keep it mapped for the entire lifetime of your application (the old buffer APIs wouldn't let you keep it mapped during a draw call). The downside is that you're now responsible for synchronising access to that data.

But as for "is it quicker?", maybe. DMA transfers aren't free, they take time to setup. Usually they need to operate from a limited pool of source memory. If the driver has to take a local copy of your data to copy it (which it will do for every glBufferData/SubData call), then you might as well copy it yourself, GPUs aren't hurting for PCIe bandwidth these days. In addition, you can use a separate thread/CPU core to do the copy, since unlike every other OpenGL call, mapping memory and memcpy'ing doesn't require an OpenGL context.

Pinterest seems to work (brought up create board dialog when logged in)

Google+ seems to work (brings up share form).

I do have to say that for me that these would actually work and the JS likely wouldn't in some cases, since I make heavy use of Firefox's containers now to sandbox a lot of online identities, and just have new windows for certain URLs automatically load in the correct container.

But I was left wondering what the actual problem was. Why is glBufferMap slow? Is it just the impedance mismatch between D3D and GL that don't have the same synchronization guarantees for that specific call? Why does Wine have it's own command handling thread when very likely, the underlying OpenGL driver has one, too?

glMapBuffer doesn't have any ability to declare that you won't overwrite data. All you can say is whether you want read/write access to the buffer. So the driver has to assume that the client might overwrite in-flight data, so synchronization is required.

As for the command stream handling, it makes decent sense to do translation up-front and your drawing commands into a command stream so a separate thread can just hammer through it as fast as possible, rather than doing GL calls in-line with the translation. Partly so the game can return to doing its thing as fast as possible, and partly to fix issues with GL's threading model being horrible ( see e.g. https://bugs.winehq.org/show_bug.cgi?id=24684 )

Wine uses glMapBufferRange when available, and has logic to translate D3DLOCK_NOOVERWRITE to GL_MAP_UNSYNCHRONIZED_BIT. I don't understand what the current implementation is doing that is causing wined3d_resource_map to block.

This is an excellent question. The current wine implementation using OpenGL should mirror the behaviour of D3D, since it is using glMapBufferRange and passing GL_MAP_UNSYNCHRONIZED_BIT when possible. I am also wondering what is actually going on that is hurting performance.

EDIT:

After further research and more thought, I suspect that the "pipeline stall" doesn't involve waiting for the GPU to complete work using the buffer, just waiting for the driver. The map/unmap with overwrite or discard is working as intended, but the persistent buffer heap he implemented outperforms it because it reduces the number of calls into the driver required.

I initially had the impression that the existing wine implementation was somehow deficient, but really what the author did was find a way to use the new persistent buffers feature to optimize D3D code using the older per-frame map/unmap method.

This is in fact what the post essentially said (after a re-read), I just misunderstood and thought the pipeline stall was actually waiting for the GPU. The note in the post about the "GPU" line really being the driver is important.

There's two main parts to the stall, which aren't well illustrated by the diagram (I'll get on updating it):

1. Waiting for the resource to exit the command stream (wined3d_resource_wait_idle).

2. Waiting for the CS thread to finish after the map (occurs in wined3d_cs_map).

It's a pipeline stall because the D3D thread has to wait for the CS thread to do things, and thus is unable to dispatch more commands to the CS (and thus the GPU) during this time. I don't consider the actual glMapBufferRange to be part of the stall.

Another option is using gallium nine, which directly uses D3D state tracker in driver skipping GL layer. https://wiki.ixit.cz/d3d9 (though on nvidia nouvenau will be probably slower than propertiary GL driver)

Excellent writeup. I would be curious as to the specific considerations given to a heap allocator on the GPU. Related: I'm not too familiar with Wine patches - what is the easiest way to view the final source code of this patch?