Origin of Quake3's Fast InvSqrt() - Page 1

Published on 29th Nov 2006, written by Rys for Software - Last updated: 21st Mar 2007

Introduction

Note! This article is a republishing of something I had up on my personal website a year or so ago before I joined Beyond3D, which is itself the culmination of an investigation started in April 2004. So if timeframes appear a little wonky, it's entirely on purpose! One for the geeks, enjoy.

Origin of Quake3's Fast InvSqrt()

To most folks the following bit of C code, found in a few places in the recently released Quake3 source code, won't mean much. To the Beyond3D crowd it might ring a bell or two. It might even make some sense.

Finding the inverse square root of a number has many applications in 3D graphics, not least of all the normalisation of 3D vectors. Without something like the nrm instruction in a modern fragment processor where you can get normalisation of an fp16 3-channel vector for free on certain NVIDIA hardware if you're (or the compiler is!) careful, or if you need to do it outside of a shader program for whatever reason, inverse square root is your friend. Most of you will know that you can calculate a square root using Newton-Raphson iteration and essentially that's what the code above does, but with a twist.

How the code works

The magic of the code, even if you can't follow it, stands out as the i = 0x5f3759df - (i>>1); line. Simplified, Newton-Raphson is an approximation that starts off with a guess and refines it with iteration. Taking advantage of the nature of 32-bit x86 processors, i, an integer, is initially set to the value of the floating point number you want to take the inverse square of, using an integer cast. i is then set to 0x5f3759df, minus itself shifted one bit to the right. The right shift drops the least significant bit of i, essentially halving it.

Using the integer cast of the seeded value, i is reused and the initial guess for Newton is calculated using the magic seed value minus a free divide by 2 courtesy of the CPU.

But why that constant to start the guessing game? Chris Lomontwrote a paper analysing it while at Purdue in 2003. He'd seen the code on the gamedev.net forums and that's probably also where DemoCoder saw it before commenting in the first NV40 Doom3 thread on B3D. Chris's analysis for his paper explains it for those interested in the base math behind the implementation. Suffice to say the constant used to start the Newton iteration is a very clever one. The paper's summary wonders who wrote it and whether they got there by guessing or derivation.

So who did write it? John Carmack?

While discussing NV40's render path in the Doom3 engine as mentioned previously, the code was brought up and attributed to John Carmack; and he's the obvious choice since it appears in the source for one of his engines. Michael Abrash was mooted as a possible author too. Michael stands up here as x86 assembly optimiser extraordinaire, author of the legendary Zen of Assembly Language and Zen of Graphics Programming tomes, and employee of id during Quake's development where he worked alongside Carmack on optimising Quake's software renderer for the CPUs around at the time.

Despite having the noodle for it, John says nay and isn't sure if it's Michael either. The rebuttal was posted in the B3D thread along with Terje as a potential author and was then pretty much forgotten about by all until recently. During John's Quakecon keynote speech this year he mentioned the opening of the complete Quake3 v1.32 source under the General Public License, including the 3D renderer, to big cheers from the assembled crowd. With Doom3 recently finished and published to critical acclaim, hacking minds turned to id to ask when Q3's fairly ancient (by current 3D standards) renderer would be available for people to look at, learn from and work with.

Duly released in the week following Quakecon, Slashdot picked up the obvious story where the question of who wrote that implemenation of fast inverse square root came up again. Shortly after posting that I wondered why the hell I hadn't asked Terje!

Terje Mathisen, along with Michael Abrash and a few others, stands as one of the masters of assembly language optimisation for x86 microprocessors. Back in the days of software optimisation for 3D graphics, guys like Michael and Terje (and John, who's no slouch at it himself if you peek at the Quake source) would spend significant time testing hand-coded assembly optimisations for various critical-to-performance codes. The investment, one you wouldn't really make these days except in very special cases, paid off back in Doom and Quake's days. And if you hang around comp.lang.asm.x86 enough you'll spot Terje offering advice, optimisations, anecdotes and code snippets related to x86 assembly programming. The man knows his stuff. So, Terje, was it you?

Terje, any ideas?

> Hey Terje, > > This question has come up again since id released the source to Quake > 3 Arena. > > Are you the guy who wrote that fast implementation of inverse square root? > If so, do you have a history of where it came from and how you came up > with it? A whole bunch of hackers and geeks would love to know and > since John says it wasn't him or likely Michael, was it you?

Hello Ryszard, and hello again John, it's been a few years since we last met. :-(

Thanks for giving me as a possible author, when I first saw the subject I did indeed think it was some of my code that had been used. :-)

I wrote a very fast (pipelineable) & accurate invsqrt() 5+ years ago, to help a Swede with a computational fluid chemistry problem.

His simulation runs used to take about a week on either Alpha or x86 systems, with my modifications they ran in half the time, while delivering the exact same final printed results (8-10 significant digits).

The code shown below is not the same as what I wrote, I would guess it mostly stays within a fraction of a percent? The swede needed at least 48 sigificant bits in his results, so I employed a much more straightforward table lookup plus NR-iteration. Since water molecules contain three atoms it was quite straightforward to calculate three such invsqrt() values in parallel, this was enough to avoid almost all bubbles in the fp pipelines.

I do think I recognize the style of the Q3A code however, it looks a lot like something you'll find in the old HAKMEM documents from MIT. :-)

Regards,

Terje "almost all programming can be viewed as an exercise in caching"

Terje's not our man, but he does show he's got the cojones for it. His own assembly version of invsqrt, accurate to 48 significant bits and pipelined on the x86 CPUs of 2000 - which had just gained SSE about a year before with A80525 (I'll let the geeks look that one up, KNI ring any bells?) by the way - uses a LUT and Newton-Raphson to maintain the precision his friend needed. A little analysis of his LUT would likely have seen him close to the InvSqrt() this article refers to. You might also remember Terje as one of the main guys behind public analysis of the original Pentium's FDIV bug.

While he can't pin down the author, he does drop more crumbs. I first came across the M.I.T. HACKMEM documents ages ago as I dove deep into low-level programming around the 1998-2000 timeframe. I was a wet-behind-the-ears hacker interested in 3D graphics back then, and my grasp of vector math was decent (and now sucks relatively speaking where I'm constantly having minor Eureka moments as I write about 3D hardware for a living) and I even tried my hand at a Linux driver for the S3 Savage3D around the same time John was working on the Utah GLX project with Matrox hardware.

Drawing tris at a low level wasn't such a big deal 5 or 6 years ago, since there was no programmable hardware and only the basic OpenGL pipeline to follow. Getting hardware specs out of the IHVs was also easy enough. I keep my copy of the Savage3D's register and hardware spec close by to remind me how unlikely it would ever be to get the same documents for a modern GPU from the big IHVs.

More digging

With John and Terje unwilling to stake a claim on the code and Abrash mostly out of the running, more digging was required. Google, even with search-fu to make Larry and Sergey proud, didn't want to give up the goods. It was willing to provide some pointers towards NVIDIA though, with a posting to a Slashdot article that hinted that someone in Santa Clara was responsible.

A quick email to a friend at NVIDIA said that the T in SST, one Gary Tarolli, was the one at NVIDIA most likely to know.

In almost every tale of 3D legend or lore you'll find 3dfx

For those new to 3D graphics, Gary Tarolli is one of the founders of the late 3dfx. One of the early pioneers of consumer 3D graphics hardware, 3dfx blazed a trail in 3D that started with coin-op arcade hardware, before the SST1 and Voodoo Graphics paired separate framebuffer and texture unit chips on a PCI board for the PC in late 1996. Voodoo 2 followed in 1997, powered by the brand new SST96 at arguably 3dfx's peak. Before the V2, 3dfx had no real competition in the consumer space, with NVIDIA's NV3 (Riva128) not quick enough and Rendition's Verite architecture not getting the market share needed to have the company ponder a follow up.

Voodoo3 arrived in 1997 after SLI Voodoo2s had ruled the roost for so long, and 3dfx dropped the SST prefix for their chips. The forgettable Voodoo3 was slapped around by NVIDIA and NV5 (TNT2 Ultra) and then NV10 barely half a year later (the first GeForce 256). On the ropes, the 3dfx VSA-10x architecture saw Voodoo4 and 5 hit the market in 2000, but even the monster Voodoo5 5500 wasn't enough to keep 3dfx afloat. Some business decisions gone bad, and problems introducing the Voodoo5 6000, saw NVIDIA buy the ailing 3D chip company, bringing the legend to a close.

Mentioning Rampage, Sage and Spectre today is enough to widen eyes and get saliva glands working overdrive in 3D geeks. 3dfx's stillborn next generation parts were to be the company's saviours. That wasn't to be, sadly, and most of 3dfx's staff were assimilated into NVIDIA, with the rest joining the likes of ATI or leaving 3D altogether to persue other careers.

So, with Mr. T-Buffer the next person likely to know where the code comes from, I decided to ask him not if he knew, but if it was him. Known as a coder, his simulation code is what brought up SST1 and SST96 on their design hardware before production. His ability to hack made it a valid question to ask.

Must be you, Gary, surely?

A blast from the past! I definitely recognize the code below, but I can't take credit for it. I remember running across it over 10 years ago, and I also remember rederiving it. I think it's just Newton-Raphson iteration with a very clever first approx.

I also remember simulating different values for the hex constant 0x5f3759df. I may have done this for the IRIS indigo work I did, or some consulting at Kubota, I'm not 100% sure.

Given the amount of math it does, and its accuracy, and not requiring a table, it is a pretty great piece of code.

I especially like the integer ops i = 0x5f3759df - (i >> 1); which actually is doing a floating point computation in integer - it took a long time to figure out how and why this works, and I can't remember the details anymore.

Ah those were the days - fast integer and slow floating point....

So it did pass by my keyboard many many years ago, I may have tweaked the hex constant a bit or so, but other than that I can't take credit for it, except that I used it a lot and probably contributed to its popularity and longevity.

p.s. sorry in taking so long to reply

We'll forgive Gary for taking so long to reply since we pretty much hit the jackpot. While he can't take all the credit, he's definitely one of the guys responsible for it and likely back when he worked at Silicon Graphics on the Indigo. At one point in the past, around 2001, I actually owned an R4K with 4 XS24Zs (Elan-class since it had a depth buffer chip on the graphics boards and 4 geometry engines!) Indigo for a short while. Now you know where 3dfx got their multi-chip 3D ideas from :grin:

The R4K stands out as an SGI box to actually use commodity PC parts in its construction, which helped me get it up and running after I bought it non-working for a collection of old SGI hardware that never really got off the ground. Those things are still pretty pricey to this day, even moreso than 3 or 4 years ago!

What software Gary did for the Indigo project I've still to ask him, but it's likely core IRIX work on the code to drive the 3D hardware. His journey from SGI to NVIDIA via 3dfx is a common one with many current NVIDIA employees having made a similar journey. Maybe most notably is Emmett Kilgariff, the man who's likely responsible for a large chunk of NV40 and G70's performance with the design of their fragment processors.

All done and dusted?

While Gary can't take the full credit, he's likely the last person easily available that had a hand in writing and refining the fast inverse square root implementation that sparked this investigation. When it takes you via John Carmack, Michael Abrash and Terje Mathison, the source to a id-made Quake game, and then finally 3dfx, a 3D geek can't really complain. A really cool hack that deserves some limelight after all these years. It's over 15 years old at this point.

To all those wondering why John bothers to push out the source to id's game engines after the fact, the snippet of code at the very top of this article is a poster child for why. Not only do you get well-programmed and well-optimised 3D engines to modify and learn from, you get gems like the fast invsqrt function to show you that it's not all about the 3D hardware, and that software is arguably even more of a factor when analysing 3D performance.

So not all done and dusted in finding who wrote it, but maybe as close as it's likely to get. Hope you enjoyed the journey.