The effect of both is identical in terms of what they render (actually the immediate mode can handle colors per vertex..but I don't use them anyway) and how they get their data, but the immediate mode runs 1 FPS faster!

Is this an evil of NIO? Is it that under the hood copying the data to the card from the NIO buffer takes just as long as it would to call the glXXXXX methods anyways?

Hmm... Maybe it's something to do with my video card? Geforce 5500FX, latest drivers. I find that unlikely though, because nVidia is pretty hard core about opitimizing and vertex arrays have been around since OpenGL 1.1. I would think they would have optimized it as much as possible long ago.

It's actually several arrays that compose the objects on screen. They range from about 100-1000 verts each plus corresponding normals and texture coordinates, all sent as arrays. Each are allocated as direct ByteBuffers, then tranformed into FloatBuffers (long before the rendering stage).

That method does the same thing - Allocates a direct ByteBuffer the gets a FloatBuffer from that. If doing it yourself wasn't working, you were probably forgeting to set the ByteBuffer to native order.

If the your ByteBuffers are direct buffers, there shouldn't be any copying. That's kind of the point of these things It's really strange that the frame rate actually increases when rendering without vertex arrays, since you're doing many extra JNI calls. The overhead of those must not be as big as I thought. Maybe you could use the TraceGL pipeline to check if you're making unnecessary calls in the vertex array case? Or the DebugGL to check for errors?Something you could also try is reducing the number of glEnable/DisableClientState calls you are making. Calling these methods probably triggers a state validation in the driver which might slow things down. The enabling and disabling of the vertex array for instance seems pointless since you are always using it.IIRC going from vertex3f to drawArrays made a big difference in my rendering code. Something that can give you a boost as well is using triangle strips and/or fans instead of triangles (don't know if this is feasible in your specific case of course).

Doing a pure glDrawArrays direct from system RAM is the worst way to use vertex arrays. It's probably doing a copy from system RAM into AGP RAM and thence onwards to the card, every frame; and what's more if you're filling the system RAM buffers up with data every frame you've shafted your cache unneccessarily. You need to be using, at the very least, glDrawRangeElements; EXT_compiled_vertex_array is a crap extension and shouldn't be used any more. NV_vertex_array_range2 is the next best solution on Nvidia cards. And the very best way to do it is Vertex Buffer Objects.

What's more this bit of code quite easily could just be a microbenchmark anyway with all the silliness that entails. No mention of overheads, number of vertices, pixels drawn, fill rates, etc. etc. It's a testament to Nvidia's driver writers that they've got immediate mode as fast as plain vertex arrays in your special case.

The way glDrawArrays works is that you are effectively saying, "all this data needs to be drawn, entirely, now." but rarely do you actually have this situation. Normally you are drawing chunks of elements from a big load of vertex data. So typically you'll be using glDrawElements anyway. But underlying it all:

glDrawArrays can be implemented by the drivers in various ways but one of the ways it'll be optimised is to copy system RAM data out to AGP, and then let the card suck it over in its own time. It's not that glDrawArrays is slow; its the particular kind of RAM you're using that's slow. The usage of glDrawArrays won't make any difference here - it's purely down to where those buffers came from.

Now then.. glDrawRangeElements specifies a minimum and maximum bounds for the indices normally used by glDrawElements. What this does is precisely enable the drivers to determine the range of vertex memory pointed at, and copy only that which it is necessary to copy over to AGP RAM (along with a few other optimisations). If that data is already in AGP RAM by virtue of NV_vertex_array_range or VBOs, then the optimisations are not nearly as significant.

I'm wired on coffee and starving so I might be rambling and not making much sense here.

The only difference between glDrawArrays and glDraw(Range)Elements is the use of indirection via the indices array. I don't see how one of these methods could be noticeably more effient than the other. To me it seems that they just serve different purposes. If you don't need the indexing provided by draw(Range)Elements, why use it in the first place?I have to admit I've never done any low level OpenGL profiling/debugging so I'm only looking at this at an API level...

Thought about this a bit more...The only way I see glDraw(Range)Elements being inherently faster than glDrawArrays is if you reuse vertices multiple times. Using the draw(Range)Elements methods you just refer to them multiple times in the indices array, whereas with drawArrays you have to repeat the vertex/color/tex coord data each time, which means you're pushing more data to the video card. Wild guess, but maybe the driver can also reuse calculations if it encounters the same index twice?Does this make any sense?

It's specifically to do with optimising what data the drivers send over to the graphics card. If you've already decided what format and where the data is going to be by using VBO or NVVAR, it's neither here nor there - probably (unless theres some interesting "paging" or "windowing" or something going on in the hardware AGP bus). The drivers may still copy your data into a more efficient format. But in the case of plain old direct buffers of vertex data, the driver has absolute discretion of how it's going to get the data to the card and this usually involves it copying it to AGP RAM. If you are able to be more specific about what data you're going to be needing you should use glDrawRangeElements because then the driver doesn't need to needlessly copy tons of data or parse the entire index array to work out a min/max vertex.

Doing a pure glDrawArrays direct from system RAM is the worst way to use vertex arrays. It's probably doing a copy from system RAM into AGP RAM and thence onwards to the card, every frame; and what's more if you're filling the system RAM buffers up with data every frame you've shafted your cache unneccessarily.

Well, if the entire array needs to be drawn, this is supposed to be the fastest way because it causes the least indrection, as opposed to the other GL array functions which are better at working with portions of arrays.

Quote

You need to be using, at the very least, glDrawRangeElements; EXT_compiled_vertex_array is a crap extension and shouldn't be used any more. NV_vertex_array_range2 is the next best solution on Nvidia cards. And the very best way to do it is Vertex Buffer Objects.

glDrawRangeElements is better if you only want to use a portion of the data, I am drawing all vertices, normals and tex coords. Using drawRangeElements wouldn't improve it. VBO's would, but that's VBO's not vertex arrays and not OpenGL1.1.

Quote

What's more this bit of code quite easily could just be a microbenchmark anyway with all the silliness that entails. No mention of overheads, number of vertices, pixels drawn, fill rates, etc. etc. It's a testament to Nvidia's driver writers that they've got immediate mode as fast as plain vertex arrays in your special case.

Cas

This not a full blown benchmark but it's a pretty realistic one. It's the renderer in my game for the ships and cockpit. It's a very diverse set of data ranging from small to farily large quantities of vertices and different combinations of normals - no normals, tex coords - no tex coords. The test is to remove the static objects (roids, planets, etc) and leave nothing but moving objects and then restrict all movement of the ships, but allow them to rotate.

I do agree this is a testament to nVidia's drivers, but I really don't think this is a special case.

The way glDrawArrays works is that you are effectively saying, "all this data needs to be drawn, entirely, now." but rarely do you actually have this situation. Normally you are drawing chunks of elements from a big load of vertex data. So typically you'll be using glDrawElements anyway. But underlying it all:

I have that situation all the time. My vertex arrays are all self contained in a scene graph structure. Frustum culling determines exactly which nodes need to be drawn and then they draw themselves entirely. This isn't a sliced buffer, each has it's own. glDrawArrays isn't supposed to be faster for data transfer (though you save some), the performance boost is supposed to be in requiring less OpenGL calls..alot less calls. The same amount of data or less will go to the card either way. What I don't understand is why it's not doing what it says on the box....faster because uses far less OpenGL calls and cards may be able to optimize becuase of predictability of the vertexes being rendered (they are all triangles to render for the next 1000 verts for example).

It's specifically to do with optimising what data the drivers send over to the graphics card. If you've already decided what format and where the data is going to be by using VBO or NVVAR, it's neither here nor there - probably (unless theres some interesting "paging" or "windowing" or something going on in the hardware AGP bus). The drivers may still copy your data into a more efficient format. But in the case of plain old direct buffers of vertex data, the driver has absolute discretion of how it's going to get the data to the card and this usually involves it copying it to AGP RAM. If you are able to be more specific about what data you're going to be needing you should use glDrawRangeElements because then the driver doesn't need to needlessly copy tons of data or parse the entire index array to work out a min/max vertex.

Cas

Although that's all true, it still doesn't apply if you do actually want to render all the buffers pointed to entirely, which I do. For a bulk render, where all the buffers pointed to through glVertexPointer, glNormalPointer, etc. is needed, drawArrays should be faster then drawElements...and it should be much faster then using the immediate calls. According to the red book, the only reason it would be as slow as immediate calls is if I have geometry saturation on the card, which really means that the card itself has become the bottleneck...which can't be the case. Considering JNI's overhead, in Java I would expect it to outperform the immediate calls by even more then in C.

Implementations denote recommended maximum amounts of vertex and index data, which may be queried by calling glGet with argument GL_MAX_ELEMENTS_VERTICES and GL_MAX_ELEMENTS_INDICES. If end - start + 1 is greater than the value of GL_MAX_ELEMENTS_VERTICES, or if count is greater than the value of GL_MAX_ELEMENTS_INDICES, then the call may operate at reduced performance.

This bit is only mentioned in the context of glDrawRangeElements, however it might be equally valid for glDrawArrays. Maybe you can try limiting the amount of data you are passing by doing multiple calls to glDrawArrays taking that GL_MAX_ELEMENTS_VERTICES value into account. I don't have a clue about all the memory transfers involved, but it might be worth a try

Hang on a sec: it really is a rare case to call glDrawArrays unless you have some very peculiar characteristics of your scenegraph. These scenegraphs notably being that the entire scene has a single rendering state (texture, same vertex format) and I can't think of many games where this is the case. Even my crappy 2D games have tons of different rendering states operating on the same vertex data.

By the way... this is a slightly different question, but it's on the same subject. I have my program loading ms3d models into vertex arrays / vertex buffer objects (it does both, it just checks first) but I'm curious what's the best way to handle skeletal animations.

I don't think I should be running through my entire array every frame and rewriting the position of each vertex. It's bad enough to do that with regular arrays, but doing it with put() seems slightly disingenuous. OpenGL handles transformations really well when you use rotatef/translatef etc. It's a hell of a lot faster than using my own matrix calculations anyway, but I don't know how to apply it to my model as it's being rendered. Do I just need to keep track of how the vertices have been transformed, and just push/pop transformations as I render them? Or is there some way to use opengl for a transformation and have it actually return the transformed vertices?

Hang on a sec: it really is a rare case to call glDrawArrays unless you have some very peculiar characteristics of your scenegraph. These scenegraphs notably being that the entire scene has a single rendering state (texture, same vertex format) and I can't think of many games where this is the case. Even my crappy 2D games have tons of different rendering states operating on the same vertex data.

Cas

You must be using one large block of ram and then splitting it up. In a 2D game that is fine, but in a 3D game, there is way to much data, unless you force everyone into a minimum of 128MB cards or make them deal with major thrashing....one large detailed terrain alone could thrash a 16MB card

Fortunetly, there are lots of cases in a 3D engine where the only thing that changes is the data, not the state. Models typically consist of hundreds to thousands of vertices using the same format for textures, colors, normals etc.

As a simplified example - picture a model of a sphere composed of 16x16 segments with colors and texture (default modulate), done with floats, rendered as triangles.

1536 vertices

1536 normals

1536 colors (rgb)

512 texture coordinates

Total amount of data that must get to the card to render a single frame: 17920 bytes (4 bytes per float, and exluding texture upload)

Implementations denote recommended maximum amounts of vertex and index data, which may be queried by calling glGet with argument GL_MAX_ELEMENTS_VERTICES and GL_MAX_ELEMENTS_INDICES. If end - start + 1 is greater than the value of GL_MAX_ELEMENTS_VERTICES, or if count is greater than the value of GL_MAX_ELEMENTS_INDICES, then the call may operate at reduced performance.

This bit is only mentioned in the context of glDrawRangeElements, however it might be equally valid for glDrawArrays. Maybe you can try limiting the amount of data you are passing by doing multiple calls to glDrawArrays taking that GL_MAX_ELEMENTS_VERTICES value into account. I don't have a clue about all the memory transfers involved, but it might be worth a try

That shouldn't matter in this case. I am well below any limits and I am not using glGet.

By the way... this is a slightly different question, but it's on the same subject. I have my program loading ms3d models into vertex arrays / vertex buffer objects (it does both, it just checks first) but I'm curious what's the best way to handle skeletal animations.

I don't think I should be running through my entire array every frame and rewriting the position of each vertex. It's bad enough to do that with regular arrays, but doing it with put() seems slightly disingenuous. OpenGL handles transformations really well when you use rotatef/translatef etc. It's a hell of a lot faster than using my own matrix calculations anyway, but I don't know how to apply it to my model as it's being rendered. Do I just need to keep track of how the vertices have been transformed, and just push/pop transformations as I render them? Or is there some way to use opengl for a transformation and have it actually return the transformed vertices?

Off topic but -- basically push and pop is what you need for each group as they are transformed from their parents coordinate system. To do full skelatal is a more complicated then that though and will require some inverse kinematics

ok... should I use glDrawRangeElements for that? Because a bone may influence only part of a mesh.

I don't need to do anything fancy with the skeleton stuff, since MS3D lets you model animations yourself and just specify keyframes. It gives you rotation/position keyframes for all the joints, and you can just interpolate the frames in between. My main issue is doing it quickly, and applying my own transformations in Java (which I could have done) is just slow as molasses.

It all depends on how the data for your models is allocated. If all the data for the model is in a single buffer, then yes, you will want to use the finer grain calls to operate on subsets of the vertices...if all the data for ALL your models is allocated in a single buffer, you'll get the thrashing I described above unless the user has an ubber card and you'll have no choice but to use the fine grain calls.

It's actually several arrays that compose the objects on screen. They range from about 100-1000 verts each plus corresponding normals and texture coordinates, all sent as arrays. Each are allocated as direct ByteBuffers, then tranformed into FloatBuffers (long before the rendering stage).

There is probably some overhead in enableing/disableing the client state and calling glDrawArrays. Have you tried using vertex arrays only when there are many verts, like atleast 500. Can you find a breaking point where vertex arrays are faster than immediate mode?

It migh also be the case that transfering the vertices to the card is not the bottleneck. That when using immediate mode you are feeding the card at exactly the right pace. Still strange that vertex arrays are slower.

If I were in your situation I would start making a whole lot of tests to find out what was going on.

if all the data for ALL your models is allocated in a single buffer, you'll get the thrashing I described above unless the user has an ubber card and you'll have no choice but to use the fine grain calls.

I'm creating a buffer for each mesh, since vertices are not shared between meshes. I still have a big problem though, since knowing how to divide up the vertices is not the same as dividing up the triangles. There can be any number of triangles in a mesh which are composed of vertices bound to separate joints, meaning they stretch when the bones move, and don't just rotate/translate. The only way to do this is transform some of the vertices and not others, so you can't really break these triangles into 'groups' and render them all at once. If I push/pop matrices for every transformation, I'd have to do so multiple times for each of the triangles which share these vertices.

I've been searching around all over the place for an elegant solution to this problem. With every major video game using skeletal animation these days, I cannot believe they're doing all the transformations in software. What are they doing differently?

if all the data for ALL your models is allocated in a single buffer, you'll get the thrashing I described above unless the user has an ubber card and you'll have no choice but to use the fine grain calls.

I'm creating a buffer for each mesh, since vertices are not shared between meshes. I still have a big problem though, since knowing how to divide up the vertices is not the same as dividing up the triangles. There can be any number of triangles in a mesh which are composed of vertices bound to separate joints, meaning they stretch when the bones move, and don't just rotate/translate. The only way to do this is transform some of the vertices and not others, so you can't really break these triangles into 'groups' and render them all at once. If I push/pop matrices for every transformation, I'd have to do so multiple times for each of the triangles which share these vertices.

I've been searching around all over the place for an elegant solution to this problem. With every major video game using skeletal animation these days, I cannot believe they're doing all the transformations in software. What are they doing differently?

Skinning can be done in vertex shaders. I'm guessing that is what most games are doing. But of course, it also provides a path that does it in software if the card don't support vertex shaders.

So you need to provide a software option. You just have to optimize the code to make it as fast as possible. This is one case where java is slower than c/c++, but still adequate.

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org