glDrawElements() blocking in CPU - takes 5ms to return

I am having a strange issue where calling glDrawElements appears to be blocking; stalling the CPU for a moment while it waits for the function to return. I placed a system timer immediately before and after my call to glDrawElements and the function consistently takes betweeen 5 and 7 milliseconds to return.
First, this is very slow. I'm only rendering about 4000 vertices. Second, I thought the whole point of using VBOs was to eliminate immediate mode, so that glDrawElements would just issue commands to the GPU and the CPU can continue working while the GPU works at its own pace. This does not appear to be happening at all.

As I mentioned, my data is entirely stored in VBOs which are initialized and filled once. I am using very simple shaders. The vertex shader is handed a 20 element array of mat4's as uniforms before each render, and the pre-filled VBO's are merely activated, and vertex attribute pointers set.
The only thing I actually time is the call to glDrawElements. Which as I mentioned, takes about 5 to 7 ms.
Here are the general specs for my setup. Pretty modern. Should be fine.

Doesn't glDrawElements pass indices? Still seems very slow. You should be using an element buffer I think. I am not sure, but graphics hardware seems to be only able to manage one task at a time. So there could be interaction with other applications running on your computer.

EDITED: You might also be sure that your indices are aligned. If you just allocated them with new (assuming C++) then you should be fine. Anyway, passing unaligned memory can be incredibly slow. Just a hunch.

God have mercy on the soul that wanted hard decimal points and pure ctor conversion in GLSL.

glDrawElements doesn't just draw. Most modern GPUs will operate in a "lazy mode" so any state changes, shader changes, texture changes, etc are cached locally by the driver as they happen, then evaluated/validated/etc when a draw call occurs. What you're getting in your timing is the result of all of this as well as the cost of the draw call itself - you can confirm this by issuing a second draw call immediately after with the exact same params and time that too - you're most likely going to find that it returns almost immediately.

So based on that, the excessive time is going to be on account of something that happened before the glDrawElements call (but which the driver just stored up at the time it happened, and is only doing for real when the draw call is made), and the most likely looking suspect is that 20-element array of mat4s. Some info on how you're sending that to the driver will help in diagnosing further.

It's also possible that the timing functions you're using are not accurate (e.g. you might be using something like GetTickCount which has very poor resolution) in which case wrong times are to be expected. That's what you should double-check first.

^Yeah that makes sense. I was just reading an article on Wikipedia yesterday (comparing D3D and OpenGL) that claimed D3D's weakness was an inability to buffer user mode calls (before switching to kernel mode) presumably because the "IHV" layer Microsoft engineered does not allow for it. But the article says this only plagues D3D9 and was corrected for 10 I think. Still ~5ms is a long time. A 60fps frame is like 15.

EDITED: 20-element array sounds like 5 matrices which seems like nothing. I've always assumed uploading all of the program registers at once would not be a big deal (unless your hardware supports a whole lot more than are normally required)

God have mercy on the soul that wanted hard decimal points and pure ctor conversion in GLSL.

D3D9 and below actually does buffer user-mode calls - that's 100% a myth. The cost was in validation.

I'm reading the 20-element array as being 20 matrices, which equates to 80 vec4s. Worst case is 80 glUniform4f calls here, but even 20 glUniformMatrix4fv calls can be quite heavy (especially if each one is also accompanied by a run-time glGetUniformLocation). Of course it can also be done with a buffer object (which - if careless - may involve a CPU/GPU sync but the time for that wouldn't be expected to be measured with glDrawElements) or even a single glUniformMatrix4fv call, so we need to know how the OP is setting these.

As a complement to measuring CPU time, use a query to measure the time as reported by the GPU. But be careful about asking for the result, as it may stall the pipeline. One way is to do something as follows. That is, you get the result at the next iteration.

^mhagain, Direct3D9 lets you upload as many consecutive registers (of a class) as you need to. Which seems reasonable, as I would imagine the best approach would be to stream them all up in a block if possible. But I can also imagine the driver building a buffer and tagging each register in the upload for random access update. I recently programmed a lot with OpenGL ES (WebGL) and I don't remember there being an API for updating a block of registers (which would probably be very helpful for Javascript; as would bringing back display lists I think) but I did not really look. I guess OpenGL cannot do that then? Either way I don't think it would matter much beyond the unnecessary (presumably user mode) function calls.

God have mercy on the soul that wanted hard decimal points and pure ctor conversion in GLSL.

^mhagain, Direct3D9 lets you upload as many consecutive registers (of a class) as you need to. Which seems reasonable, as I would imagine the best approach would be to stream them all up in a block if possible. But I can also imagine the driver building a buffer and tagging each register in the upload for random access update. I recently programmed a lot with OpenGL ES (WebGL) and I don't remember there being an API for updating a block of registers (which would probably be very helpful for Javascript; as would bringing back display lists I think) but I did not really look. I guess OpenGL cannot do that then? Either way I don't think it would matter much beyond the unnecessary (presumably user mode) function calls.

For the record. I am not sure the spec (above) even explains it, but it sounds like (from a little searching about) you can pass a multiple of 16 sized array. But I am not 100% positive that you can select out the individual matrices for use in your script with the Float32Array spec. That may be a fundamental limitation of Javascript.

Sounds like I have some homework to do.

God have mercy on the soul that wanted hard decimal points and pure ctor conversion in GLSL.

Doesn't glDrawElements pass indices? ... You might also be sure that your indices are aligned. If you just allocated them with new (assuming C++) then you should be fine. Anyway, passing unaligned memory can be incredibly slow. Just a hunch.

Forgive me. I'd forgotten how glBindBuffer interacts with glDrawElements. Look into it if you've not heard of it. Otherwise disregard these comments. I cannot edit the original post for correctness at this point.

God have mercy on the soul that wanted hard decimal points and pure ctor conversion in GLSL.