on windows/directx i have XMMATRIX. on ios i have GLKit and GLKMATRIX. i can't seem to find a hardware accelerated matrix library (for example on armv7 uses NEON extension for matrix multiply) for android that i can use on the C++/NDK side. anyone have any recommendations?

I do appreciate the link to the simd library discussion. We don't currently use simd however i would be willing to use it if i can get a performance increase out of it. i was looking for something that does have the matrix math functions written in assember. but i guess i'll have to dive into the assembler myself. should be fun

NEON is a SIMD instruction set, which is why I thought that's what you were asking for

That stackoverflow discussion is talking about math libraries that have been ported to use NEON intrisics (and SSE, AltiVec, etc).

Math code that's hand-written in assembly won't be any better/worse that math code that's carefully written in C/C++. I'd recommend just using a higher level language and looking at the assembly output from your compiler to double-check that it's doing an OK job (and if it's not, don't resort to writing asm yourself, tweak the high level code so that the compiler performs better).

I'm sorry for interrupting this nice and polite conversation, but I still don't understand why hdxpete needs SSE or similar instruction to multiply two 4x4 matrices.

1. The system is as slow as its slowest part. Matrix multiplication is very rare operation in common visualization system, so improving calculation time for a fraction of ms will not improve overall performance.

2. I haven't seen other approaches, but XMMATRIX uses floats for arguments. In all application I have developed I needed double precision for the matrices calculations. This also discredits usage of XMMATRIX.

3. Why do you think you really need faster matrix manipulation library? Did you try to benchmark your application and find bottleneck? Did you ever experience performance problems with matrix operations?

1. If you're processing a large mesh, transforming, distorting or whatever other usage you've encountered then yes matrix operations can dominate in that situation. hdxpete hasn't said that he's doing so it might be that he simply wants to use an optimised library throughout his project right from the start, rather than use one he writes himself. Why pick a slower option if there's an existing better one?

2. All applications you've developed required double precision? Then you're missing out on SIMD performance benefits, wasting memory needlessly and not running on a console/mobile platform. The only situation I currently encounter the usage of doubles is in space games and even there I'm pushing to get as much as possible back into float to avoid the conversions between double precision generation of the terrain and the floating point vector representation for rendering.

3. This is a valid question however it does come with the caveat that simply using a decent SIMD maths library might also force you to write code in a more data centric way so having it from the start might also be a good idea.

I'm sorry for interrupting this nice and polite conversation, but I still don't understand why hdxpete needs SSE or similar instruction to multiply two 4x4 matrices.

1. The system is as slow as its slowest part. Matrix multiplication is very rare operation in common visualization system, so improving calculation time for a fraction of ms will not improve overall performance.

The main benefit of SIMD is not computation time (although it's a very nice bonus), but the amount of time you spend reading / writing data. Loading 4xfloat as a packed register is much quicker than the FPU equivalent. I'd also say, that in the case of a matrix multiply, it's a function that is a prime candidate for SIMD, since it's used all over the place. Generally, you always optimise the biggest bottlenecks first, and typically memory access is a bigger bottleneck than computation time.

2. I haven't seen other approaches, but XMMATRIX uses floats for arguments. In all application I have developed I needed double precision for the matrices calculations. This also discredits usage of XMMATRIX.

1. It is entirely possible to use double precision SIMD instructions (ok, maybe not on NEON).

2. If you actually need double precision for a matrix, then you're either doing science (and need the accuracy), or you're doing something very wrong indeed. It may be that you have a huge game world, and are losing precision when the position is far from the origin (in which case, store an integer based 3D grid reference along with the matrix), otherwise the problem is possibly something that could be fixed by reordering your equations for better accuracy, or a simple orthogonalise might be whats needed.

3. Double precision typically halves the performance of your code (doubles the amount of time spent reading / writing data). If your ultimate criteria is speed, double precision is not a good idea....

Thank you guys for the replies, but I still have to clear up something...

I hope you don't mind, and the whole community could benefit from such polemics.

1. If you're processing a large mesh, transforming, distorting or whatever other usage you've encountered then yes matrix operations can dominate in that situation. hdxpete hasn't said that he's doing so it might be that he simply wants to use an optimised library throughout his project right from the start, rather than use one he writes himself. Why pick a slower option if there's an existing better one?

I also thought at the first glance he has something seriously do with matrix calculus, but from the rest of his posts I really doubt it is so.

Transforming large meshes? On a CPU? Could you give an example? I really think that should be done on a GPU.

2. All applications you've developed required double precision? Then you're missing out on SIMD performance benefits, wasting memory needlessly and not running on a console/mobile platform. The only situation I currently encounter the usage of doubles is in space games and even there I'm pushing to get as much as possible back into float to avoid the conversions between double precision generation of the terrain and the floating point vector representation for rendering.

I have to admit that I've never developed for a console platform, but on a PC I have never noticed a performance lost when using doubles instead of single precision floats. As I've already said, those operations are rare to be noticed. Texture streaming, for example, is something more noticeable than calculation of a transformation matrix. And that's where I'm using SIMD libraries. For example, for texture compression. BTW, a large terrain rendering is a focus of my graphics programming currently.

3. This is a valid question however it does come with the caveat that simply using a decent SIMD maths library might also force you to write code in a more data centric way so having it from the start might also be a good idea.

It is not a bad idea per se, I just pointed out this could make complications without a real need for that.

2. If you actually need double precision for a matrix, then you're either doing science (and need the accuracy), or you're doing something very wrong indeed. It may be that you have a huge game world, and are losing precision when the position is far from the origin (in which case, store an integer based 3D grid reference along with the matrix), otherwise the problem is possibly something that could be fixed by reordering your equations for better accuracy, or a simple orthogonalise might be whats needed.

3. Double precision typically halves the performance of your code (doubles the amount of time spent reading / writing data). If your ultimate criteria is speed, double precision is not a good idea....

Currently I'm working on a high precision massive terrain algorithm. I've modeled Earth based on WGS84 ellipsoid, geoid undulation, and high precision (submeter precision for the whole Earth) DEM with the ability to place the viewer 1 micron above the surface without visible artifacts. Of course, there is no need to place viewer on such hight, but it is just a demonstration of the power.

Math is done partially on the CPU (in double precision, only for the viewer position) and partially on the GPU (in single precision for all vertices of the terrain). Frame-rate is really high so I'm probably doing things right.

I know that double precision on a GPU requires SM5 cards, and is at least two times slower (in fact the factor is much higher and depends on a concrete architecture). That's why I'm using single precision on a GPU. But on the CPU, I have never had performance problems with DP calculations. On the other hand, things I'm working on would be impossible without DP.

I also thought at the first glance he has something seriously do with matrix calculus, but from the rest of his posts I really doubt it is so.
Transforming large meshes? On a CPU? Could you give an example? I really think that should be done on a GPU.

This is mobile, and even though the mobile GPUs are quite powerful, they still are far from desktop systems, and sometimes you want to use the GPU cycles for shading, and offload some vertex processing work to the CPU. In those cases it feels foolish to not use the entire CPU, and ignore the excellent SIMD instructions. You quite easily get up to 4x the processing performance by using NEON.

Also, there is more then the graphics pipeline that can use some matrix calculations, and with limited resources, you don't want to needlessly waste any

I also thought at the first glance he has something seriously do with matrix calculus, but from the rest of his posts I really doubt it is so.
Transforming large meshes? On a CPU? Could you give an example? I really think that should be done on a GPU.

This is mobile, and even though the mobile GPUs are quite powerful, they still are far from desktop systems, and sometimes you want to use the GPU cycles for shading, and offload some vertex processing work to the CPU. In those cases it feels foolish to not use the entire CPU, and ignore the excellent SIMD instructions. You quite easily get up to 4x the processing performance by using NEON.

Also, there is more then the graphics pipeline that can use some matrix calculations, and with limited resources, you don't want to needlessly waste any

But that's ignoring the fact that you're going to have a lot of data to transfer to the GPU each frame, which will eat bandwidth and force you to have to deal with dynamic buffer management and contention. That alone (unless the data is truly dynamic to begin with) favours doing it on the GPU; yes, the extra shader instructions will be extra GPU overhead, but I'm betting that they'll be nothing by comparison to the bandwidth and contention overhead of a CPU-based approach.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

In all application I have developed I needed double precision for the matrices calculations.

To quote Tom Forsyth - "Double precision has no place in games. If you think you need double precision, you either need 64-bit fixed point, or you don't understand the algorithm."

He's being a bit facetious; they might occasionally have a use... but they definitely should not be your default choice, especially on 32-bit architectures. In my experience, doubles are very, very, very rarely used in games (and as Tom says, when you do see them used, it's often done without understanding -- "oh float was having trouble so I just changed it to double"). Float and double have a huge range, yes, but their logarithmic precision is usually not the most efficient choice.

Memory bandwidth is a bigger bottleneck that CPU ALU speeds these days. A PC x86 CPU will likely crunch through floats and doubles at the exact same speed (by treating them both as 80-bit internally...). The performance impact comes from the fact that doubles double your memory bandwidth, and if that's your bottleneck (which it often is these days), then this ~doubles your execution times.

As for the need for optimized matrix/vector classes - many games do skeletal animation on the CPU, where you might have two dozen characters each with five dozen bones, which is a few local->global matrix computations that need to be done each frame. This is enough work to be >1ms and show up on your profiler

i port a lot of code to mobile. what im currently porting now used some SIMD so if i can find a CPU instruction to do it. i'd like to use it. i have my own matrix library for platforms that don't have special cpu instructions. while i haven't optimized it. it is pretty swifty. but if i can get 2% CPU time back by using a special cpu instruction set i would like to so i can spend more time processing the rest of the frame.

To quote Tom Forsyth - "Double precision has no place in games. If you think you need double precision, you either need 64-bit fixed point, or you don't understand the algorithm."

He's being a bit facetious; they might occasionally have a use... but they definitely should not be your default choice, especially on 32-bit architectures. In my experience, doubles are very, very, very rarely used in games (and as Tom says, when you do see them used, it's often done without understanding -- "oh float was having trouble so I just changed it to double"). Float and double have a huge range, yes, but their logarithmic precision is usually not the most efficient choice.

I agree that doubles are more expensive than floats on GPUs (not on CPUs, as you already noted), but they make many things easier and some of them even faster on a GPU. When precision is needed, using doubles instead of single-double floats is significantly faster.

In my example, how would one calculate (and draw) precise position (of the vertices) on the globe (where even radius cannot be represented with a meter accuracy with floats) without doubles on the CPU side? Maybe with some mathematical gymnastics. But what for? The calculation is done at the same speed on the CPU side, only floats (just several floating point values for hundreds of thousands of vertices generated on a GPU) are transfered to a GPU and everything is done using FP arithmetics on the GPU side? With all respect to you and Tom Forsyth, it does not make any sense. Please, before disapprove something generally, consider cases when and where it might be a better solution.

I agree that doubles are more expensive than floats on GPUs (not on CPUs, as you already noted)

I didn't quite note just that -- I pointed out that they have the potential to halve performance, because memory bandwidth is usually more of a bottleneck than CPU speed.

how would one calculate (and draw) precise position (of the vertices) on the globe (where even radius cannot be represented with a meter accuracy with floats) without doubles on the CPU side?

Neither floats or doubles are a great choice for storing globe surface points relative to the globe, because both formats dedicate the bulk of their precision to representing points within the globe's core. What a waste!The surface of earth only varies vertically by about 20km, so if you need sub-metre height accuracy you could use a 16-bit int to store the height difference from average, or a 32-bit int would give you near-micron accuracy.If you need the globe vertices displaced horizontally as well as vertically, then you could then compliment the height with two spherical coordinates, or smoothed-cube coordinates that are trendy in planetary renderers.Why? Because a more efficient storage format takes up less space, and efficiency in memory layouts is one of the primary optimisations on modern computers (arguably more important that reducing CPU cycles -- in relative terms of bandwidth per CPU cycle, memory is getting slower and slower every day...).

Please, before disapprove something generally, consider cases when and where it might be a better solution.

Keep in mind I only jumped in here because you claimed that all applications you've developed required double precision -- that seems to be the same generalization on the other side of the fence

The main benefit of SIMD is not computation time (although it's a very nice bonus), but the amount of time you spend reading / writing data. Loading 4xfloat as a packed register is much quicker than the FPU equivalent.

I'm not sure how true that is... Yes, you can load 4 values with one instruction (just how you can do 4 of many other ops with one instruction), but those 16 bytes aren't magically transferred from RAM faster than 16 bytes requested via 'normal' means.Many applications don't see any performance improvement after porting to SIMD (despite using ~4x less CPU cycles) exactly because the memory bandwidth has remained the same.

I didn't quite note just that -- I pointed out that they have the potential to halve performance, because memory bandwidth is usually more of a bottleneck than CPU speed.

We again misunderstood each other. I don't "promote" double precision models, just calculation. There is no impact on the bandwidth since only few floats are sent to the GPU.

Neither floats or doubles are a great choice for storing globe surface points relative to the globe, because both formats dedicate the bulk of their precision to representing points within the globe's core. What a waste!
The surface of earth only varies vertically by about 20km, so if you need sub-metre height accuracy you could use a 16-bit int to store the height difference from average, or a 32-bit int would give you near-micron accuracy.If you need the globe vertices displaced horizontally as well as vertically, then you could then compliment the height with two spherical coordinates, or smoothed-cube coordinates that are trendy in planetary renderers.
Why? Because a more efficient storage format takes up less space, and efficiency in memory layouts is one of the primary optimisations on modern computers (arguably more important that reducing CPU cycles -- in relative terms of bandwidth per CPU cycle, memory is getting slower and slower every day...).

Can you elaborate this, please?

I'm already using 16-bit storage for the height map (DEM). It is enough for 0.14m accuracy on the global level (without need for average block values or differential coding). Quite enough for the global elevation data currently available.

Keep in mind I only jumped in here because you claimed that all applications you've developed required double precision -- that seems to be the same generalization on the other side of the fence

We again misunderstood each other. I don't "promote" double precision models, just calculation. There is no impact on the bandwidth since only few floats are sent to the GPU.

... Can you elaborate this, please? I'm already using 16-bit storage for the height map (DEM).

Ah yes, I thought that you were storing vertices in double-precision format.

I guess you're reading in some compact data (e.g. 16-bit elevation), doing a bunch of double-precision trasforms on it, then outputting 32-bit floats?

That's much less offensive to performance than what I assumed you were doing

However, it may still be that double-precision calculations aren't necessary... you may be able to rearrange your order of operations, or the coordinate systems that you're working in so that everything works ok with just 32-bit precision. Whether that's at all worthwhile when you've already got a working solution is another whole topic though!

I guess if ALU-time was a performance bottleneck for you and you wanted to make use of 4-wide (or 16-wide on new PC CPUs) SIMD, then it might be worthwhile, otherwise if it aint broke don't fix it

While on this topic though, it's worth noting that some compilers, such as MSVC, actually output really horribly bad assembly code when you use floats, depending on the compiler settings. MSVC has "Enhanced Instruction Set" and "Floating point model". With the FP model set to "strict" or "precise", then it will produce assembly code with a LOT of redundant instructions to take every 80-bit intermediate values and round it down to 32-bit precision, so it your code behaves as if the FPU actually used 32-bit precision internally. When using double, it doesn't bother with all this redundant rounding code, which can actually make double seem like it's much faster than float!

Personally, I always set the instruction set to SSE2 and the FP model to "fast", which makes MSVC produce more sensible x86 code for floats.

Ah yes, I thought that you were storing vertices in double-precision format.

I guess you're reading in some compact data (e.g. 16-bit elevation), doing a bunch of double-precision trasforms on it, then outputting 32-bit floats?

That's much less offensive to performance than what I assumed you were doing

Nope! In fact I'm generating terrain completely on the GPU. Only 16-bit elevation data, and different overlays are sent through textures. Everything is rendered without a single attribute (in a GLSL sense). CPU calculates precise position on the globe and relevant parameters used to generate full ellipsoid calculation and height correction on the GPU per vertex. Everything is done using FP on the GPU side, but coefficients are calculated on the CPU in DP, downcasted to FP, and sent to GPU as uniforms. Once again, no attributes are used. The representation cannot be more compact. But I still need DP to do accurate math on the CPU.

While on this topic though, it's worth noting that some compilers, such as MSVC, actually output really horribly bad assembly code when you use floats, depending on the compiler settings. MSVC has "Enhanced Instruction Set" and "Floating point model". With the FP model set to "strict" or "precise", then it will produce assembly code with a LOT of redundant instructions to take every 80-bit intermediate values and round it down to 32-bit precision, so it your code behaves as if the FPU actually used 32-bit precision internally. When using double, it doesn't bother with all this redundant rounding code, which can actually make double seem like it's much faster than float!

Personally, I always set the instruction set to SSE2 and the FP model to "fast", which makes MSVC produce more sensible x86 code for floats.

Thank you for the advice! Although I've been using VS since version 4.1, I have never had need to tweak compiler options. I'll try what you have suggested!