Note:1) It's very rare for well designed approximation to not be much faster than a general library call. These generalize calls are (almost always) designed to have well defined results for all inputs. Also Java only supports common analytic functions for doubles. So you could write a function mySin(float) such that mySin(x)==(float)Math.sin(x) for all values of 'x' which I would expect to be about 2x faster. That can be improved by throwing out specific results for one or more of the followin: NaNs, denormals and outside of common or specifically required range.2) The timing methods used here will tend to show table based methods in a more favoriable light than might be the case under real usage-patterns.

Your benchmark was fundamentally flawed. It allowed the JVM to completely remove the Math.cos() call.

not mine I just copy/past the one posted in this thread

good catch as that was a big surprise to see that Math.cos was so fast ....and results seems now to be what we would all attempt them to be

EDIT :

just one point you are measuring the error between mine and java but Java does not give the exact result too ( as for 0.5 * PI where taylor is more accurate the original java cos ) but not that far yes... (anyway this cos taylor version use float so it should be pretty less accurate)

Most hardware will execute floats faster than doubles (notable counter-example is intel-a-likes using x87 instead of SSE).

Note, I'm only attempting to compare truncated power-series vs. other polynomial approximation methods. Truncated power series ignore finite precision and are centered on a single point and most NA methods take into account finite precision and target a range. I'm ignoring stuff like argument reduction and dealing with any special values (NaN, +/-Infinity, denormals, -zero)

sorry but I dont agree, IMHO double is probably or at least same speed (and for sure will become faster)

Most of these comments are ill informed. For example an SSE machine can perform 4 float ops faster than the 2 double ops that will fix in the same registers. Example - dividing 4x4 floats has a throughput of 36 cycles and 2x2 doubles is 62 (for CPUID 69) - so 9 cycles/divide for float and 31 cycles/divide for doubles. I know of no hardware in which doubles are faster than floats and don't expect to see it in my lifetime (as least for consumer hardware).

Most of these comments are ill informed. For example an SSE machine can perform 4 float ops faster than the 2 double ops that will fix in the same registers. Example - dividing 4x4 floats has a throughput of 36 cycles and 2x2 doubles is 62 (for CPUID 69) - so 9 cycles/divide for float and 31 cycles/divide for doubles. I know of no hardware in which doubles are faster than floats and don't expect to see it in my lifetime (as least for consumer hardware).

Right... forgetting the fact that Java has no SIMD? Besides that, the statement 'has a throughput of 36 cycles' makes no sense, at all.

Benchmark to show that float / double performance is nearly identical:

There's a wide misconception that doubles are somehow 'better' than floats and this is simply not the case. They simply have more precision and you pay for that in terms of memory usage and (usually) execution time (some ops will have same execution time for both). It's true that for some usages the extra precision is required, but this is likewise true about doubles. I find this somewhat strange because you don't find people thinking that 64-bit integers are somehow 'better' than lower bit width integers and they will happly use the one which in most appropriate for the given usage.

Quote from: Riven

Right... forgetting the fact that Java has no SIMD?

Not at all, they are simply not exposed at the high level. It's the compiler's job to use these instructions (runtime in the case of a VM). I'd expect vectorized support to improve as runtimes like Mono have been working on this front. It'll probably require a new compiler framework as Sun's seems to be showing it's age (I haven't paid attention to what the next gen is up to.) Both vmkit and Shark/Zero are based on LLVM, so they have "promise" in this respect. (Although LLVM needs work on auto-vectorization as well.) If we had a VM that performed a reasonable amount of vectorization, there would be a huge speed difference between floats and doubles on SIMD hardware. But ignoring SIMDifing, scalar floating point operations are generally faster than double (and never slower) for all CPU architectures that I'm familiar with. Couple this with the ever increasing speed gap between main memory and the CPU, moving data becomes a more and more of an issue.

Quote from: Riven

Besides that, the statement 'has a throughput of 36 cycles' makes no sense, at all.

Not sure what doesn't make sense. These are the measure of time required for the executional unit to complete the computation.

Quote from: Riven

Benchmark to show that float / double performance is nearly identical:

Your example appears to be memory bound. As a simple counter-example: the speed difference between float and double multiple is much narrower than divide, but if you take my last minimax sin example and simply change from float to double you're likely to see a speed difference on 1.2-2.0x depending on your hardware (and assuming Sun VM).

The question of performance of float vs. doubles is really a hardware question (and the indicated thread was about hardware) and my orginal statement was to that effect. My follow-up was in reply to "IMHO double is probably faster or at least same speed (and for sure will become faster)" which I claim to be incorrect. Is it possible to have two otherwise identical routines where a double version will be faster then the float? Yes, but that will be in the case where stalls of the double version are being more hidden by some other stall (probably memory read or write) that float version. This will be a rare occurance and tightly coupled to an exact hardware configuration. This is more of a desktop issue for CPU with many functional units.

I'd be happy to be proven wrong with a counter-example or pointer to CPU specification!

@DzzD: I started programming in the 8-bit days, but that isn't my problem. I use doubles and multi-precision elements all the time...when needed.

There are three important trends that form my opinion about doubles being unlikely to outperform floats on consumer hardware in my lifetime. The first two have already been mentioned: SIMD and speed gap with main memory. The other is die size reductions. Making the channels narrower requires increased energy consumption (and therefore heat). Thus it is becoming more and more important to shut down subsystems which are not in usage to reduce heat (and battery drain in the case of notebooks). Computation of doubles requires wider data paths and more stages (to slightly vuglarize).

In real life I do a fair amount of low level and optimization work so I read a fair number of CPU design specs and semi-keep up with hardware design research and am seeing nothing to contradict this opinon. However I'll admit that I never expected to see anything like the new decimal formats from IEEE 754-2008 either.

For sqrt, I tried Float.floatToRawIntBits, but it is slow on Android. I also played with DzzD's code, but as he indicated, it needs some smarts to be fast enough. Here is Riven's benchmark code, modified to show DzzD's sqrt algorithm:

On sqrt I have a couple of questions and maybe I can come up with a different initial guess for a N-R based method.

1) How many N-R steps can you perform before they equal the speed of the default sqrt? and 1/sqrt?2) If they exist are Math.getExponent and Math.scaleb can be used instead of bit inspection, but I'd expect them to be slow as well (worth checking)3) Speed of conversion: fp to int and int to fp.4) Speed of either lead or trailing zero count of 32 bit ints.

I finally got 'unlazy' and did a quick web search on Android devices. All that I saw are based on the ARM-11. So lead zero counting is a hardware instruction and should be fast (unless someone forgot to hook it in).

I didn't get any idea about float support. Not being able to find tech specs is a big pevee of mine. The ARM-11 does not have an FPU, but ARM provides various FPUs as coprocessors.

I'll try to throw together a zero counting base guess version. (on the log-2, I guess you see where I'm going with the zero counting).

I finally got 'unlazy' and did a quick web search on Android devices. All that I saw are based on the ARM-11. So lead zero counting is a hardware instruction and should be fast (unless someone forgot to hook it in).

I didn't get any idea about float support. Not being able to find tech specs is a big pevee of mine. The ARM-11 does not have an FPU, but ARM provides various FPUs as coprocessors.

I'll try to throw together a zero counting base guess version. (on the log-2, I guess you see where I'm going with the zero counting).

When you say "forgot to hook it in" - to what? There isn't a corresponding bytecode, so are you referring to a hypothetical android.util.IntMath? I don't see why it should be there - the spec surely isn't designed around a particular CPU?

Re log 2 - yes, it was pretty obvious I implemented a fixed-point sqrt once, but I can't remember what I did about the initial guess; just that it involved a lookup table.

When you say "forgot to hook it in" - to what? There isn't a corresponding bytecode, so are you referring to a hypothetical android.util.IntMath? I don't see why it should be there - the spec surely isn't designed around a particular CPU?

That Integer.numberOfLeadingZeros is replaced at link time by a native method, rather than executing the bytecode. I haven't looked into the guts of Dalvik, but I'd expect it to do this.

This is a quick hack to test a workaround for slow floatToRawIntBits on Android. As I don't have hardware I can't do timing tests myself. If someone with hardware is willing to test please try timing "getExponent" against "timeMe".

Has the following methods: getExponent, getSignificand, scalb and a first pass isqrt. These were thrown together so probably have bugs and isqrt is a quick pass of making a reference version (included) not completely dog-slow.

In general, I haven't found using fixed point to be any faster than just using floats, even though the G1 doesn't have an FPU. I only use fixed point with OpenGL ES, which is 16.16. I found any real number crunching is going to have to be native code. I didn't try using fixed point in native code as it doesn't seem to be a bottleneck.

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org