Intel’s manuals for their x86/x64 processor clearly state that the fsin instruction (calculating the trigonometric sine) has a maximum error, in round-to-nearest mode, of one unit in the last place. This is not true. It’s not even close.

The worst-case error for the fsin instruction for small inputs is actually about 1.37 quintillion units in the last place, leaving fewer than four bits correct. For huge inputs it can be much worse, but I’m going to ignore that.

I was shocked when I discovered this. Both the fsin instruction and Intel’s documentation are hugely inaccurate, and the inaccurate documentation has led to poor decisions being made.

I discovered this while playing around with my favorite mathematical identity. If you add a double-precision approximation for pi to the sin() of that same value then the sum of the two values (added by hand) gives you a quad-precision estimate (about 33 digits) for the value of pi. This works because the sine of a number very close to pi is almost equal to the error in the estimate of pi. This is just calculus 101, and a variant of Newton’s method, but I still find it charming.

The sin() function in VC++ is accurate enough for this technique to work well. However this technique relies on printing the value of the double-precision approximation of pi to 33 digits, and up to VC++ 2013 this doesn’t work – the extra digits are all printed as zeroes. So, you either need custom printing code or you need to wait for Dev 14 which has improved printing code.

So, I tried g++ on 32-bit Linux (Ubuntu 12.04) because I knew that glibc handles the printing just fine. However the results weren’t adding up correctly. I eventually realized that the sin() function in 32-bit versions of 2.15 glibc is just a thin wrappers around fsin and this instruction is painfully inaccurate when the input is near pi.

Catastrophic cancellation, enshrined in silicon

The first step in calculating trigonometric functions like sin() is range reduction. The input number, which Intel says can be up into the quintillion range, needs to be reduced down to between +/- pi/2.

Range reduction in general is a tricky problem, but range reduction around pi is actually very easy. You just have to subtract the input number from a sufficiently precise approximation of pi. The worst-case for range reduction near pi will be the value that is closest to pi. For double-precision this will be a 53-bit approximation to pi that is off by, at most, half a unit in the last place (0.5 ULP). If you want to have 53 bits of accuracy in the result then you need to subtract from an approximation to pi that is accurate to at least 106 bits – half the bits will cancel out but you’ll still be left with enough.

But you actually need more than that. The x87 FPU has 80-bit registers which have a 64-bit mantissa. If you are doing long-double math and you want your range reduction of numbers near pi to be accurate then you need to subtract from at least a 128-bit approximation to pi.

This is still easy. If you know your input number is near pi then just extract the mantissa and do high-precision integer math to subtract from a hard-coded 128-bit approximation to pi. Round carefully and then convert back to floating-point. In case you want to give this a try I convertedpi to 192-bits of hexadecimal:

C90FDAA2 2168C234 C4C6628B 80DC1CD1 29024E08 8A67CC74

But Intel doesn’t use a 128-bit approximation to pi. Their approximation has just 66 bits. Here it is, taken from their manuals:

C90FDAA2 2168C234 C

Therein lies the problem. When doing range reduction from double-precision (53-bit mantissa) pi the results will have about 13 bits of precision (66 minus 53), for an error of up to 2^40 ULPs (53 minus 13). The measured error is lower at approximately 164 billion ULPs (~37 bits), but that’s still large. When doing range reduction from the 80-bit long-double format there will be almost complete cancellation and the worst-case error is 1.376 quintillion ULPs.

Oops.

Actually calculating sin

After range reduction the next step is to calculate sin. For numbers that are sufficiently close to pi this is trivial – the range-reduced number is the answer. Therefore, for double(pi) the error in the range reduction is the error in fsin, and fsin is inaccurate near pi, contradicting Intel’s documentation. QED.

Note that sin(double(pi)) should not return zero. The sine of pi is, mathematically, zero, but since float/double/long-double can’t represent pi it would actually be incorrect for sin() to return zero when passed an approximation to pi.

Frustration

It is surprising that fsin is so inaccurate. I could perhaps forgive it for being inaccurate for extremely large inputs (which it is) but it is hard to forgive it for being so inaccurate on pi which is, ultimately, a very ‘normal’ input to the sin() function. Ah, the age-old decisions that we’re then stuck with for decades.

I also find Intel’s documentation frustrating. In their most recent Volume 2: Instruction Set Reference they say that the range for fsin is -2^63 to +2^63. Then, in Volume 3: System Programming Guide, section 22.18.8 (Transcendental Instructions) they say “results will have a worst case error of less than 1 ulp when rounding to the nearest-even”. They reiterate this in section 8.3.10 (Transcendental Instruction Accuracy) of Volume 1: Basic Architecture.

Section 8.3.8 (Pi) of the most recent Volume 1: Basic Architecture documentation from Intel discusses the 66-bit approximation for pi that the x87 CPU uses, and says that it has been “chosen to guarantee no loss of significance in a source operand”, which is simply not true. This optimistic documentation has consequences. In 2001 an up-and-coming Linux kernel programmer believed this documentation and used it to argue against adding an fsin wrapper function to glibc.

Similar problems apply to fcos near pi/2. See this blog post for more details and pretty pictures.

I ran my tests on the 2.19 version of glibc (Ubuntu 14.04) and found that it no longer uses fsin. Clearly the glibc developers are aware of the problems with fsin and have stopped using it. That’s good because g++ sometimes calculates sin at compile-time, and the compile-time version is far more accurate than fsin, which makes for some bizarre inconsistencies when using glibc 2.15, which can make floating-point determinism challenging.

Is it a bug?

When Intel’s fdiv instruction was found to be flawed it was clearly a bug. The IEEE-754 spec says exactly how fdiv should be calculated (perfectly rounded) but imposes no restrictions on fsin. Intel’s implementation of fsin is clearly weak, but for compatibility reasons it probably can’t be changed so it can’t really be considered a bug at this point.

However the documentation is buggy. By claiming near-perfect accuracy across a huge domain and then only providing it in a tiny domain Intel has misled developers and caused them to make poor decisions. Scott Duplichan’s research shows that at long-double precision the fsin fails to meet its guarantees at numbers as small as 3.01626.

For double-precision the guarantees are not met for numbers as small as about 3.14157. Another way of looking at this is that there are tens of billions of doubles near pi where the double-precision result from fsin fails to meet Intel’s precision guarantees.

The absolute error in the range I was looking at was fairly constant at about 4e-21, which is quite small. For many purposes this will not matter. However the relative error can get quite large, and for the domains where that matters the misleading documentation can easily be a problem.

I’m not the first to discover this by any means, but many people are not aware of this quirk. While it is getting increasingly difficult to actually end up using the fsin instruction, it still shows up occasionally, especially when using older versions of glibc. And, I think this behavior is interesting, so I’m sharing.

That’s it. You can stop reading here if you want. The rest is just variations and details on the theme.

How I discovered this

I ran across this problem when I was planning to do a tweet version of my favorite pi/sin() identity

I compiled this with g++, confident that it would work. But before sending the tweet I checked the results – I manually added the two rows of numbers – and I was confused when I got the wrong answer – I was expecting 33 digits of pi and I only got 21.

I then printed the integer representation of sin(pi_d) in hex and it was 0x3ca1a60000000000. That many zeroes is very strange for a transcendental function.

At first I thought that the x87 rounding mode had been set to float for some reason, although in hindsight the result did not even have float accuracy. When I discovered that the code worked on my desktop Linux machine I assumed that the virtual machine I’d been using for testing must have a bug.

It took a bit of testing and experimentation and stepping into various sin() functions to realize that the ‘bug’ was in the actual fsin instruction. It’s hidden on 64-bit Linux (my desktop Linux machine is 64-bit) and with VC++ because they both have a hand-written sin() function that doesn’t use fsin.

I experimented a bit with comparing sin() to fsin with VC++ to characterize the problem and eventually, with help from Scott Duplichan’s expose on this topic, I recognized the problem as being a symptom of catastrophic cancellation in the range reduction. I’m sure I would have uncovered the truth faster if Intel’s documentation had not filled me with unjustified optimism.

Amusingly enough, if I’d followed my usual practice and declared pi_d as const then the code would have worked as expected and I would never have discovered the problem with fsin!

Code!

If you want to try this technique for calculating pi but you don’t like manually adding 34-digit numbers then you can try this very crude code to do it for you:

In other words, if the sixth digit of sin(double(pi)) is four then you have the correct value for sin(), but if it is zero then you have the incorrect fsin version.

Compile-time versus run-time sin

g++ tries to do as much math as possible at compile time, which further complicates this messy situation. If g++ detects that pi_d is constant then it will calculate sin(pi_d) at compile time, using MPFR. In the sample code above if pi_d is declared as ‘const’ then the results are quite different. This difference between compile-time and run-time sin() in g++ with glibc 2.15 can be seen with this code and its non-unity result:

It’s almost a shame that these goofy results are going away in glibc 2.19.

Sine ~zero == ~zero

Some years ago I was optimizing all of the trigonometric functions for an embedded system. I was making them all run about five times faster and my goal was to get identical results. However I had to abandon that goal when I realized that many of the functions, including sin(), gave incorrect answers on denormal inputs. Knowing that the sine of a denormal is the denormal helped me realized that the original versions were wrong, and I changed my goal to be identical results except when the old results were wrong.

66 == 68

Intel describes their internal value for pi as having 66 bits, but it arguably has 68 bits because the next three bits are all zero, so their internal 66-bit value is a correctly rounded 68-bit approximation of pi. That’s why some of the results for fsin are slightly more accurate than you might initially expect.

Sin() is trivial (for almost half of input values)

For small enough doubles, the best answer for sin(x) is x. That is, for numbers smaller than about 1.49e-8, the closest double to the sine of x is actually x itself. I learned this from the glibc source code for sin(). Since half of all doubles have a magnitude below 1.0, and since doubles have enormous range, it turns out that those numbers with magnitudes below 1.49e-8 represent about 48.6% of all doubles. In the glibc sin() source code it tests for the top-half of the integer representation of abs(x) being less than 0x3e500000, roughly equivalent to <= 1.49e-8, and that represents 48.6% of the entire range for positive doubles (0x80000000). The ‘missing’ 1.4% is the numbers between 1.49e-8 and 1.

So, if asked to write a sin() function I will volunteer to do the easy (almost) half and leave the rest for somebody else.

Share this:

Like this:

LikeLoading...

Related

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs.
I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8.

this may be an ancient thread and i’m not sure if anyone will read it but won’t that approach possibly create problems when adding or substracting n*double(pi)? but then again i suppose that trying to compute the sine of something stupidly large like 20 000 000 is futile anyway since floats are very grainy at that point

Range reduction is tricky but it is also well studied, and I believe can be done well, if that is important.

In most cases sin(large) is just a bad idea, agreed. On the other hand, sin(double(x)) always does have an actual best answer, and philosophically at least I like to know that there are sin() implementations that can calculate it.

Put another way, with very large numbers two adjacent doubles may be many multiples of pi apart, which makes sensible trigonometry ‘tricky’ to say the least. And yet, those numbers still have a best value for sin(x).

It’s reasonable for fsin to handle those poorly, as long as that is documented as such.

Could be interesting compare the results of this with D langauge, using dmd and gdc (gcc backend), with double and real types. Real type on D language, try to use the longest float point on the system, ie x87 FPU on a PC, when the Double type is the the typical 64 bit float

They talk about in the manual. They say the extra two bits are just for padding. However it turns out that the next two bits are supposed to be zero, so they could legitimately describe it as a 68-bit constant. Why don’t they? Ask Intel.

I would like to examine this further on other hardware platforms (IBM mainframe plus PowerPC, SUN, & HP distributed processors, all non-Intel). Could you provide a single program that could be compiled and run elsewhere (attempted at least) across multiple platforms to see how those fare (produce printed output or binary files if necessary).

Note that none of those CPUs you’ve mentioned has instructions for transcendentals (sin, cos, exp, etc). They all rely on software implementations.
Even on x86 CPUs, the trend is to use software implementations for those functions.

Well, not entirely true. The IBM PowerPC in the RISC/6000 has a “complex” instruction set (note the irony), with built-in hardware (microcode or firmware) for many of the instructions or functions being discussed here. But using software implementations does not necessarily mean they are doing it right. Testing will bear out how close to correct they are. Early Apple computers (Apple II, Macs) did not have hardware floating point, and made use of the SANE software package (IEEE based specifications). Yet, during my tenure using an Apple IIgs, one company came out with an FPU card that fit in a slot, which also was built to those IEEE specifications, including jumping from 64 bit floating point to 80 bit floating point). When installed, it replaced all the equivalent SANE functions, and we simply believed they were correct (with some justification, of course).

By the way, that FPU card cost more in the 90s than a whole computer does today.

This will affect programmers who then have to work around the issue so that every day computer users are not affected. The developer of VC++ and glibc had to write alternate versions, so that’s one thing. The inaccuracies could be enough to add up over repeated calls to sin and could lead to errors in flight control software, CAD software, games, various things. It’s hard to predict where the undocumented inaccuracy could cause problems.

It likely won’t now because most C runtimes don’t use fsin anymore and because the documentation will now be fixed.

One could just as well say “do you expect people to know that 10^18 means quintillion?” Large numbers are not familiar to most people in any notation. Note, also, that in C and C-like languages, used throughout this article, 10^18 is 24, so your suggestion depends on the reader knowing from context that just this time, ^ has a totally different meaning.

I don’t know if I can do superscripts in wordpress.com titles, and even if I can they would map poorly to a URL name. Quintillion has the advantage of being less ambiguous when googled, although to be really clear I should have said “1.3 exa ulps”.

I thought that the names of the large numbers in english are easy to remember if you know how numbers in latin are named. One just divides the tens power by three and subtracts one. For example 10^18 : 18/3-1=5 Latin “quinque” gives quintillion; 10^33 : 33/3-1=10 Latin “decem” gives decillion; 10^12 : 12/3-1=3 Latin “tria” gives trillion. There are only two exceptions in American English, 10^6 and 10^3.

Yep, British naming for the large numbers is different, with billion/trillion/quadrillion/quintillion being one million raised to the 2nd, 3rd, 4th, and 5th powers respectively. I usually avoid saying ‘billion’ because of this ambiguity, although I think the (more consistent) British names are losing out.

When needing a sin() and cos() function for rotation matrices, I was so frustrated with the lack of precision guarantees, that I decided to make my own implementation.

By using Mapping sin(a) with a > Pi/4 to cos(a – Pi/2) and sin(a) with a < -Pi/4 to -cos(a + Pi/2) I'm reducing the range to -Pi/4 to Pi/4. I've done so under the assumption that the Taylor Series works best around a = 0.

After I wrote that question I gave it a try. Instead of reducing the range to ±Pi/4 by referencing sin()→cos() and cos()→sin() I only reduced it to ±Pi/2 by self-referencing sin()→sin() and cos()→cos().

The result of my test run was bit-perfect the same output. So I conclude, as far as double precision math goes it doesn’t matter.

Kami – you have the right idea, but think in radians not fractions of Pi.
Consider the Taylor Maclaurin series for each:
sin(x) = x – x^3/3! + x^5/5! – x^7/7! + x^9/9! – x^11/11! + x^13/13! – x^15/15! + . . . and
cos(x) = 1 + x^2/2! – x^4/4! + x^6/6! – x^8/8! + x^10/10! – x^12/12! + x^14/14! – . . .
To achieve the greatest precision with the fewest terms, keep x < 1 (~ 57º). The key to speedy computations is
1) to table an array of the inverse factorials (constants). and
2) compute x^2. Each term in either equation is the previous term times x^ times InvF(i)
If you keep x<1, precision is automatic. If you need 10^-35 precision cos(), you'd need 16 terms (ends with + x^30/30!) and the precision is better than the 17th term -x^32/32! or 3.80039075e-36 (at worst). For sin(), you'd also need 16 terms (ends with – x^31/31!) and the precision is better than the 17th term -x^32/32! or 1.15163356e-37 (at worst).
Today, I generally use my own T/M Sin and T/M Cos routines because I only need about 1.0e-5 precision when drawing to my screen or 1.0e-6 going to the printer.
Even at double precision, my software was 2.4 to 3 times faster than the original Pentium chips. Not true today.
One other observation. If I needed this level of precision, I'd sell the wife and get a good XEON system. (only kidding – about selling the wife).

Thanks for the pointer. I’ll definitely look into it when I find the time.

Right now the performance of sin()/cos() is completely negligible. You need them to create a rotation matrix which in turn is applied to thousands of coordinates. In my last profiling run
108111 calls of the cos() function for
46370280251 rays cast, which in turn led to
114611272720 calls to the Gaussian Elimination function.

Hi, I have combined the two given programs in one and here is what I get as a result. I haven’t change code just put them in one file. My CPU is AMD Athlon 64 x2 4200+ Brisbane compiler g++ 4.9.1 on Debian testing.

Would it make sense to introduce another instruction (or pass a second argument to FSIN) to calculate sin(pi*x) instead of sin(x)? The argument reduction will be fast and error free. That way people caring about accuracy around integer multiples of pi can reformulate their calculations to take advantage of this.

If such a technique makes sense then it should be done in software, not in hardware. The whole concept of an fsin instruction is from another era. It is not something that hardware can do significantly better.

I am surprised that FSIN is indeed still used in some semi-recent variants of glibc (in 32 bit builds). I was of the impression that all those “special” math routines have been disused and replaced by software versions for at least 10 years, which are much faster in practice. (well, that seems to be the case for exp() at least).

Avoiding fsin was discussed about a decade about but (as shown in the link from “up-and-coming Linux kernel programmer”) Linux Torvalds believed Intel’s documentation and said that it wasn’t necessary. If the documentation had been accurate then sin() would probably have been updated then.

Way back around 1982, I was working as a research assistant for Cleve Moler, he of MATLAB fame. He has always been keenly interested in trig function algorithms and implementations. We tested the IBM 360 libraries and the then brand-new Intel 8087 to measure the error rates. We found the 360 trig functions displayed enormous errors around certain values, and so did the 8087, somewhat less severe. He tested simple algorithms that used a small table of values to adjust results as the calculation proceeded and provided less than 1 bit errors across the entire function. However, the idea of using up valuable RAM to hold a few dozen bytes of table data seemed outlandish, I don’t know if it ever went anywhere. I’m not surprised a hardware implementation of sine would have the kind of errors described here.

I’ve been trying to think of an example of when this error would be anything that would have any real effect; a picture would look wrong, a satellite would miss its orbit, a protein folding would be calculated incorrectly, a bank transfer would be incorrect. I’d love it if somebody would suggest a way this could have any kind of practical effect at all — I’m certain that there is, but I just can’t see it.

Perhaps some sort of simulation, or something where the internal inconsistencies cause issues. I’m not actually sure. It’s hard to say because issues caused by this would probably be found in test and the developers would then work around them, perhaps by not using fsin.

A general question here is how does the ‘accurate’ software version compare to the hardware FSIN/FCOS/FSINCOS implementations in terms of speed?
Now I know that “fast and inaccurate” is not often a good idea, but I can imagine a lot of cases where either the angles are small or only float accuracy is needed, and if the speed is different bu an order of magnitude then it might be worth considering.

It depends. The glibc math functions for 64-bit computations should be correctly rounded. This means that for y = sin(x) the value of y is the closest possible floating point number to the ‘true’ sin of x. The performance of correctly rounded math functions is highly variable (input dependent). Indeed, at least for the pow(…) function there are some degenerate cases that be 10,000 of times slower than ‘usual’.

Thankfully no one in the scientific community uses either the glibc math functions (except when the algorithms depend on the results being correctly rounded) or the x87 unit in general (SSE/SSE2/AVX has taken its place even for scalar computations). A decent vector math library should be ~10 or so times faster than the x87 transcendental functions (double precision) with errors on the order of ~1 to ~2 ulp.

It is worth pointing out that the x87 functions are not ‘in silicon’; they are almost all in microcode nowadays. Hence, Intel can probably fix the error if they wanted to with a microcode update. (They wont as the issue affects ~0 real world applications.) Indeed, the fact that these instructions are in microcode is one of the reasons why they are so slow. As a reference point on a Haswell CPU fsin can take between 47 and ~100 or so cycles. A Pentium IIII chip takes between 27 and ~100 or so cycles for the same operation. Or, in other words: no improvement beyond clock speed over 16 years.

I’m genuinely curious. Better rounding than fsin seems pretty easy, but correctly-rounded is a tougher challenge, and correctly rounded across a wide range presumably has limited benefits so it would be fascinating if they had decided to go that far.

BTW, the handy thing about software implementations is you can (if you have the source and the time) choose your own tradeoffs. This can include performance, precision, and supported range. Instructions for things like sine and cosine can’t offer that sort of flexibility.

Correct Rounding was a goal for IEEE754-2008 which was to have a section on accuracy for transcendental functions. They stumbled over this little thing of the “Table-Maker’s Dilemma” – how many digits must you generate before you can know which way to round the final digit? The worst cases along the FP number line are shockingly large, certain theoretical values are “Entscheidungsproblem” hard.

The trend nowadays is towards cheap hardware approximation functions. We already have them in SSE as RCP and RSQRT for getting a cheap approximation to 1/x and 1/sqrt(x) which the user can then polish with a Newton iteration or two. These will be extended further with AVX-512 (ERI). Similar instructions are already available on the Xeon Phi.

Indeed, now we have fused multiply add instructions available the days of hardware double precision div and sqrt are numbered. Expect the throughput to remain flat or even regress in the coming years. This already happened when AVX was introduced; we got 256-bit variants of sqrtpd and divpd but the latency is ~double that of the 128-bit variants. Incidentally, NVIDIA GPUs do not have double precision variants of either of these instructions — they use FMA operations to polish single precision approximations.

To clarify your comment, the reason the scientific community might not find this ideal is because this sin() function is accurate but not necessarily fast. Also note that the source code lists the cutoff point below which sin(x) == x — handy.

To those who aren’t familiar with the tradeoffs of implementing sin/division/etc. in hardware versus software, hardware has very few inherent advantages for this (the same operations must be done) and using software means that the ever improving out-of-order processing engine can improve performance. Plus software is more flexible.

Concerning the correct rounding of sin and cos in glibc, this is just a conjecture; but AFAIK, it computes up to 768 bits, which should be highly enough (there has also been a bug in the past, but now fixed). The hardest-to-round cases are not known for sin/cos/tan functions. For the best known algorithms to find them: http://hal.inria.fr/inria-00126474 (but this would still require many years of computations).

Remember that fsin is a x87 instruction, and that x87 has been deprecated for a decade.
Also, the documentation does clearly say that only 1 ulp is guaranteed for values that are in range.

Range reduction is extremely costly to do accurately, which is why they chose not to do it, especially since it is a 80-bit computation. Once truncated to 64-bit, the result is correct but except for very large values.

No one serious about floating-point uses 32-bit. It’s ancient, and the ABI precisely causes large problems because of its usage of x87. Move to amd64 like the rest of the world.
I see that your glibc results are also based on the 32-bit variants, something that anyone in its right mind shouldn’t be using at all.

I see you conveniently quoted only part of my sentence. 1 ulp is only guaranteed *when the input is in range*. This also appears in the original documentation.

There are many types of software (games, for instance, but also many others) where 32-bit support is still required. Unfortunately enough consumers still use 32-bit operating systems that dropping 32-bit support is not always practical.

The “in range” part of your statement doesn’t really change the meaning because the range is defined elsewhere in the documentation as being +/- 2^63. pi is well within that range.

Are you actually suggesting that the current documentation is correct? Because even Intel doesn’t believe that — see their blog post from Thursday that I link to.

There is no good reason for games to run on 32-bit, even if a lot still do so due to special legacy issues that only people in the field would know. All hardware that has been released for ten years is 64-bit compatible, and games require relatively recent hardware.

I think most or all PC game developers would *love* to release their games as 64-bit only. And yet, so far most of them have not. That is because a non-trivial number of customers still run 32-bit operating systems — sometimes on quite new hardware. Yes, this is annoying. But it is reality, especially for games that have are targeting a Chinese market where 32-bit Windows XP is quite common.

I’m curious; does anyone have a non-contrived, real world example where this apparent deficiency in the documentation for the FPU has a genuine practical significance?

In most real world applications, the argument for the sin() function will be arrived at through computation. The argument will therefore be subject to the normal errors which arise in such FPU computations, meaning that in general there will be an error in the argument of the order of a ulp or more. The extra error of a fraction of a ulp due to the FPU approximation of pi will therefore by minor in almost all cases.

Some years ago, I had a brief look at the issue of the accuracy of the FPU approximation of pi and put a few notes in a README which can (or could) be found in the Linux kernel sources, or in places such as:https://www.kernel.org/doc/readme/arch-x86-math-emu-README
This gives the example of the integer 8227740058411162616 which is close to an integer multiple of pi and would require an approximation for pi with around 150 bits of precision in order to reduce the argument to a 64 bit precision number in the range +/- pi/2.

There’s a maths graduate on Slashdot who says that hardware instructions or anything less than “Maple, Matlab, Mathematica and similar” is insufficient. He certainly claims to know when this extra accuracy is necessary, but I’m not sure if he’ll explain.

Are you implying that being a math graduate makes someone an expert on numerical computing and floating-point arithmetic?
That’s not quite the case. Even among professors and researchers specialized in these field, there are relatively few experts.

“I’ll tell you now that I wouldn’t rely on a FPU instruction to be anywhere near accurate. If I was doing anything serious, I’d be plugging into Maple, Matlab, Mathematica and similar who DO NOT rely on hardware instructions.”

Grouping Matlab in with Maple and Mathematica — both of which have arbitrary precision arithmetic capabilities is an immediate red flag. Matlab uses the floating point capabilities of the underlying hardware (single or double precision) and not arbitrary precision.

“Nobody serious would do this. Any serious calculation requires an error calculation to go with it. There’s a reason that there are entire software suites and library for arbitrary precision arithmetic.”

In the majority of physics and engineering applications the error from either the model or in the uncertainty of the initial conditions dominates. It is unusual for issues to come about due to accumulation of round-off errors. This is especially true when working at double precision. (One would be surprised just how many engineering packages — both open source and commercial — run in single precision without issue.)

As an example if I want to work out the circumference of a circle with a radius of 1m and use a value of pi good to 10 decimal places then the resulting error will be measured in the vicinity of 1e-10; or similar to the radius of an atom. Irrelevant.

“And just because two numbers “add up” on the computer, that’s FAR from a formal proof or even a value you could pass to an engineer.”

The addition of two floating point numbers (single or double precision) is specified to the exact with respect to the currently-employed rounding mode on all major hardware platforms. If this is round to nearest then c = a + b yield the closet floating point value to the ‘true’ value of a + b. While it is uncommon for manufacturers to provide formal proofs for their FPUs (AMD did many years ago, through), they are expected for the elementary operations to be compliant. Expect for the Pentium FDIV bug I am not aware of any evidence to the contrary.

Thanks for the detailed information. I was not aware of the distinction between MatLab and the others, and that is clearly important.

Regarding error accumulation, it certainly is possible to have errors accumulate if you use a poorly conditioned algorithm. However the fix is not to give up on hardware floating point, but to use a well conditioned algorithm or, in some cases, use double instead of float. I may have made both of these fixes in software I have worked on and the fixes have worked very nicely. Throwing arbitrary precision arithmetic at a problem when the algorithm is poorly conditioned is like using a sledge hammer on a thumb tack.

As has been mentioned already, the issues with large arguments to trigonometric functions are well known and the relationship of accuracy to the approximation for pi that is used in argument reduction. The 66bit approximation of pi used in Intel X87 hardware implementations certainly will afford large errors for large arguments, which Intel developers certainly knew and documented.

The trigonometric functions implemented in Sun/Oracle libm/libsunmath/libmopt have for a long time used a so-called “infinite” precision approximation of pi, i.e. enough bits to guarantee a sufficiently accurate argument reduction to afford less than 1 ulp error even for large arguments to trig functions.

For a few years Sun/Oracle also made available the source code for libmcr, a library of common transcendental functions, including the trig functions, that produced correctly rounded results (i.e. the algorithm used was proved to compute correctly rounded results subject to the correctness of the implementation). What was not determined, though, was the upper bound of computation time due to the aforementioned problem of not knowing the maximum number of bits necessary for “hard” cases.

I don’t think that the issues with large arguments to trigonometric functions were “well known” except in a very narrow community — I know more than most about Intel floating point and I had no idea.

And, while I agree that the Intel developers new about the argument reduction issues they were *not* documented. If they were always documented then Intel wouldn’t have done a blog post on Thursday (linked from the fourth paragraph) detailing the doc fixes they will make. That is, the 66-bit approximation of pi was documented, but its implications were not documented — the documentation actually denied them.

I have a genuine question about this – just how important is “correct rounding” to 1 bit?

Given you already have finite precision, would having, say, 3 bits maximum precision extension so you have a 87% probability of 1 LSB correctness and 13% chance of it extending to 2 LSB error, be good enough to deal with the “table maker’s dilemma” and a provable limit on computation time?

For the basic operations, absolutely critical, since otherwise error analysis is impossible. And, there are times with sin/cos, I am sure, where perfect rounding is critical, although I am sure they are less common.

Getting to within 1 ULP of the correct answer is not, I believe, too hard. Getting the rounding perfect can be much trickier.

One significant advantage to perfect rounding is that makes for consistency. Right now every sin() function is different, so code can break in subtle ways during compiler upgrades or when moving to a different machine. If perfect rounding is available everywhere then perfect consistency is available everywhere, and *that* is hugely valuable.

Am I right then in thinking that “perfect rounding” is 0.5 ULP, and if you don’t quite get that then you could be 0.50000…1 ULP away due to the “Table maker’s dilemma”?

I am not an expert in numerical analysis but I have written a lot of number crunching software, and I never expect to have perfect consistency from one compiler to another, nor from one optimisation level to another!

This is due to things like how the code is optimised (i.e. order of operations), and if the FPU uses higher internal precision along with temporary variables in the FPU registers (typical of the x87 family), rather than writing it out at float/double/etc. So while I can appreciate that NA folk would like to have such a degree of consistency for error analysis, my usual feeling is if it going to be a problem then you have an unstable algorithm in the first place, and you really need to address that first long before you worry about 1 or 2 ULP.

Perfect rounding is defined as being equivalent to taking the infinitely precise result and rounding it according to the current rounding mode, which defaults to round-to-nearest even (ties go to even answers). In this default mode the maximum error is 0.5 ULP. This is easy to implement for add/subtract/multiply/divide/square-root, and tough for sin/cos.

You are correct that order of operations and other factors can make getting identical results extremely difficult (see https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/ for a discussion of many such factors). Given how hard determinism is, any time that one of the problems can be removed it is very helpful for those who need determinism.

As floating-point moves away from x87, and as the rules are strengthened, determinism is becoming far more practical. Because of that, and because tight error bounds are better than sloppy/uncertain error bounds, I think that provably correct transcendental functions are a nice step forwards, even if they will not affect most developers.

An interesting comparison, but I think that the fdiv bug was *far* more serious. It was impractical to work around and it could hit almost any software, because division is one of the core operations used to implement higher-level math operations, such as a software implementation of sin(). A bad fdiv can make implementing sin() difficult, but not vice-versa.

I enjoyed your floating point articles that I decided to look at how msvcr100.dll implements sin(). sin(double(pi)) is correct, and when x is restricted to some range (around [-28000pi, 28000pi]), around 15/10000 values of sin(x) are 1 ULP off, the rest are correct.

msvcr100 uses a 122 bit value of pi, Cody and Waite’s method is used for range reduction. The interval [0, 2pi] is divided into 64 equal regions, then sin(x) is computed using a 9th order Taylor series approximation around the center of the region containing x. Kahan summation and some other techniques are used improve accuracy. I’ll probably write a complete analysis some time.

Very cool. How did you find this out? IIRC VC++ doesn’t publish source code for the math functions — did you derive this from the assembly code? It would be interesting to look at the Dev 14 CRT (VMs in Azure available) to see if they’ve changed it at all.

From the disassembly of the DLL. The algorithm is the same from msvcr80 to msvcr120, SSE2 is used for arithmetic. The older ones use FSIN. I haven’t looked at appcrt140, I’m still looking for a copy of the DLL without downloading VS14.

Sometimes it seems like tweeting is the only way to get documentation bugs fixed, and that’s a pretty poor method. It is annoying when software projects or documentation websites don’t have a way of submitting feedback or bugs. The lack of formal ways to report Windows Performance Toolkit bugs is particularly frustrating to me.

Mark, this is essentially a documentation bug. These operations implemented in hardware are intended to be fast. The loss of accuracy is here related to the range reduction, which can take time. There exist fast range reduction algorithms, but Intel hasn’t implemented one. The reason may be that they want to remain “compatible” with past processors. On this subject, at some time, AMD x86 processors were more accurate than Intel’s, and AMD users complained and reported bugs! Also note that in practice, even if fsin were correctly rounded, you would probably get the same loss of accuracy in your program, because the inputs would have to be represented with much more precision than the one supported by fsin (except in the very rare cases of exact inputs for fsin); see: https://en.wikipedia.org/wiki/Condition_number

Vincent has answered this quite nicely. I’ll just add that the expectation (poorly documented) is that users will do their own range reduction and therefore avoid these inaccuracies. Most sin() implementations have gone farther than that and just skip the fsin instruction entirely, in the case of glibc giving perfectly rounded results.

After some search on Google… What was done in the AMD K5 processor and reverted later is an accurate range reduction, i.e. fixing the issue discussed here. On https://blogs.oracle.com/jag/entry/transcendental_meditation (posted in 2005), it is said: “This error has tragically become un-fixable because of the compatibility requirements from one generation to the next. The fix for this problem was figured out quite a long time ago. In the excellent paper /The K5 transcendental functions/ by T. Lynch, A. Ahmed, M. Schulte, T. Callaway, and R. Tisdale a technique is described for doing argument reduction as if you had an infinitely precise value for pi. As far as I know, the K5 is the only x86 family CPU that did sin/cos accurately. AMD went back to being bit-for-bit compatibility with the old x87 behavior, assumably because too many applications broke. Oddly enough, this is fixed in Itanium.”

The problem here is just bad range reduction. The range reduction for numbers near Pi is done by subtracting a hard-coded Pi constant, and that constant has only a bit more precision than a double. In fact, hand adding of double(pi) + fsin(double(pi)) produces the exact constant that Intel uses, since the result of fsin(double(pi)) is just their reduction constant minus double(pi).

But, to blame your reported inaccuracy on the “range reduction” just does not help people to realize the deeper reason about this matter — the bad conditioning of your proposed procedure.

Here, it is useful that we realize that to calculate fsin(pi) correctly is highly difficult by nature, because of its high conditioning. In numerical analysis, your forward error (which is the kind of error we are talking about) is controlled by your backward error and conditioning number, where the condition number is known to be problem-dependent, but implementation-independent (see Higham’s book referenced above) — meaning that if your proposed procedure is badly conditioned, you will probably experience hard time to make your numerical computation accurate.

I get your point (more or less), but I think my method is still valid. It’s a stress test for sin implementations for sure – a stress test that VC++ handles well, and modern glibc implementations handle *perfectly*. At this point it is only Intel’s implementation (decades old and not changeable) that can’t handle my calculation.

Well conditioned algorithms are important to be sure. I fixed a function at my last job that had catastrophic cancellation that left close to zero digits of precision. The (trivial) fix caused it to give perfect (no rounding error) results. A clear win for better conditioning.

In this case, even though the range reduction is arguably a case of catastrophic cancellation, it is a situation that sin should handle. That’s its job.

Thanks for your feedback. But no, your method is highly sensitive to input errors, and therefore, *invalid* numerically. This sensibility is quantified by, as mentioned earlier, the condition number that is *born with* your proposed procedure.

Of course, when your procedure is too sensitive to input errors, you can always use the more precise glibc to compensate it. But this does not contradict with the fact that your proposed procedure behave badly from a numerical perspective.

To sum up, in your case, IMHO, do not blame Intel, but blame the way you estimate pi. Disclaimer: I am not affiliated with Intel.

The right approach would be to standardize a set of trig functions that scale their argument by 2pi or some other similar number. Most applications use angles that, if represented accurately, would be rational multiples of a circle. Outside of trying to read out the modulus used for range reduction, can you think of any non-contrived situations where an implementation which computed the double nearest the the sine of exactly x would be more useful than one which computed the double nearest the sine of some angle within 1/4ulp of x, or the x scaled by the ratio of “real” 2pi to the value closest to 2pi? Even computation of the behavior of a harmonic oscillator whose natural period would be 2pi seconds will only benefit if the step size is a power-of-two fraction of a second. If it isn’t (e.g. if there are 1,000,000 steps/second) optimal accuracy will require that range reduction be performed with a modulus of 2,000,000pi so as to get the value into the range +/-500,000pi before values are scaled by 1,000,000, so attempts to improve accuracy outside the +/-pi/2 range will be irrelevant.

Some games use integers to represent angles such that, for example, 4096 represents a full rotation. This allows for table lookup and allows sin(pi) (actually sin(2048)) to return exactly zero.

But to call this “the right approach” in general is to focus too much on this one tiny problem. Radians are the correct units for angles because they make the math for calculating sin/cos etc., whether with Taylor series, trig identities, or otherwise work out well. Coming up with yet another unit for measuring angles (we already have degrees, radians, and gradians) creates a large new problem while solving a small existing problem.

I am well aware of why radians make sense in mathematics. What I have not seen are any non-contrived circumstances in which code which is interested in precision would want to compute trig functions on angles outside the range 0 to pi/2. Unless one has an unusual problem domain in which the only angles one is interested in are precisely equal to “double” values and never fall between them, getting optimal precision will require that range reduction be performed before computing trig functions’ arguments. I do not think it would be unreasonable to assume that any code which passes a value beyond pi/2 to a trig function isn’t really interested in precision.

I also know of no efficient algorithms for computing sines and cosines of radian-based angles which would not be more efficient with angles specified as power-of-two fractions of a circle. Having functions which operate on power-of-two fractions, along with some functions to convert radians to quadrants and vice versa would increase both speed and precision in cases where angles could be precisely expressed in units other than radians, and would have minimal effect in cases where angles would naturally be expressed as radians (a range-reducing radians-to-quadrants function which was designed for use with sin(x) would map values near pi/2 to values near zero, so “sin2pi(radToQuadForSin(x))” would be within one ulp of a perfectly-computed radian-based sin(x).

My understanding is that *all* of the efficient algorithms for computing sines and cosines use radians – you’d have to invent new algorithms or adjust the existing ones, probably with precision loss. Every time you converted from base-2-multiple-of-pi to radians you would lose precision, and the only time you would gain precision would be at exact fractions of pi.

Anyway, I don’t think that a new angle format is a good idea, and it will never happen anyway so it’s all theoretical.

The functions sinPi(x) = sin(pi x) and cosPi(x) = cos(pi x) are standardized by IEEE 754-2008 (my suggestion was to use the factor 2pi instead of pi, but it hasn’t been followed). I think that indeed sinPi may be faster than sin for small arguments; but this depends on the kind of algorithms: if polynomial approximations are used with coefficients obtained from a table look-up, then sinPi may be as fast as sin. But for large arguments, an accurate sin needs a complex range reduction, which may take time, while the range reduction for sinPi is really trivial; so, sinPi should be faster than sin in such cases.

For double precision round to nearest even, the largest x where sin(x) = x is 2.1491193328908210e-8. It is a property of sin(), the precision, and the rounding mode.
In actual implementations of sin() the threshold for the “return x” case is usually lower. For example, the VS 2015 64-bit runtime (ucrtbase) uses 2^-27. For 2^27 < |x| < 2.149e-8, it will do some computations but the result should be equal to x.

Perhaps because the constant for the comparison is simpler thus faster to load? On the ARM, this would be the case for 0x3e400000 (2^(-27)), which fits on 8 bits with an even rotate, but not for 0x3e500000. If the VS 2015 source code is the same for the ARM, this could explain the reason.

x64 compares the whole 64-bit value so the constants have to be loaded from memory. ARM (Thumb-2) compares the top 32-bit half. 2^-27 is an immediate but 2^-13 has to be loaded from memory. I don't know how much code is shared between the x64 and ARM libraries but here are my observations:
– They all use the same 13th degree polynomial approximation
– x64 (FMA) and ARM have the x – x^3/6 part while x64 (SSE2) does not
– Range reduction is different on all three
This has been interesting so far, maybe I will study the range reduction next.

Digital Equipment Corporation VAX math library functions, circa 1978 and later did argument reduction for trigonometric functions by doing a 30000+ bit fixed point exact divide of the floating point
argument, including the exact 30000+ bit value of pi.

However, since the arguments were typically the results of floating
point operations and not exact values, the correct action would
have been to raise a loss of precision error.

For example arctan(X), were X is the result of a computation that
is stored as a quad rescission floating point number
actually has no value since the uncertainty in a result
that happened to be stored as 1Q50 is more than 2*pi.

So the arctan function should have raised a loss of precision error.
Then, if the programmer knew that X was actually an exact value
he could have let the result stand rather than setting the result to
NaN.

Q= real 16= total length 128 bits.

I worked at Digital at the time and reported the problem to the
developers they didn’t understand the loss of precision problem
and kept the accurate argument reduction of an inaccurate argument
stand with no error condition being raised.