Doing Math in FPGAs, Part 3 (Floating-Point)

Floating-point numbers are similar to the "scientific notation" we learned in high school, but they are stored and manipulated using binary representations.

It seems I have something of a mini-series of blogs going on here.

First, I muttered some inanities about multiplication and division by 10 (see Doing Math in FPGAs, Part 1). Next, I rambled on about doing math in BCD (see Doing Math in FPGAs, Part 2 (BCD)). Now, it seems it's time to mutter something about floating-point representations of numbers and how to do some math with them. I considered using floating-point representations for this mysterious project that I've been alluding to (I'll get to that, one of these days, maybe), so I took a quick look at how to implement them.

Now, of course, there are plenty of ways one could represent a floating-point number. You can do it your way, I can do it my way, or we can all agree to follow a standard such as the IEEE 754 2008 standard, for example. Of course, I'm not the first person here on EE Times to cover the topic of floating-point representations; in fact, Mr. Kjodavix described this way back in 2006 (see Tutorial: Floating-point arithmetic on FPGAs). Because of Mr. Kjodavix's article, I wondered whether I should even bother expounding on floating-point concepts. However, we all speak a little differently and we all learn a little differently, so maybe my take on this will make someone else's grasp a little better (I do recommend reading Mr. Kjodavix's article, though).

So what are floating-point numbers? Well, let's start with the fact that, due to the way in which we build our computers using two-state logic (let's not worry about experiments with tertiary, or three-state, logic), we have to store numbers using some form of binary representation. It's relatively easy to use binary values to represent integers, but they don't lent themselves to directly storing real numbers; that is, numbers that include fractional values with digits after the decimal point. In other words, it's relatively easy to use binary to represent a value like 3, but it's less easy to represent a value like 3.141592. Similarly, it's relatively easy to create logic functions to implement mathematical operations on integer values, but it's less easy to work with real numbers.

Of course, we can store numbers in BCD (I talked about this in my previous blog), or we could use fixed-point representations (I will talk about this next time), but what do we actually mean by floating-point? Well, it's a lot like the "scientific notation" we learned at high school (e.g. 31.41592x10 -1), but it's stored and manipulated using binary representations.

So, how we might perform the mighty feat of representing a real number in binary? If we would just assume a binimal point (the binimal point is the same as the decimal point in base 10, only it's the binary equivalent in base 2) at some fixed point in the middle, then we'd have a fixed-point representation as illustrated below:

I won't yammer on about this right now (that's for next time); suffice it to say that we would need a lot of bits to represent either a really big number or a really small one. Floating-point solves this problem by breaking the number up into three pieces: the sign, the mantissa (a.k.a. significand or coefficient), and the exponent (a.k.a. characteristic or scale). This gives us a fairly large dynamic range. The generic form is as follows:

Where:

n = the number being represented

± = the sign of the number

x = the mantissa of the number

b = the number system base (10 in decimal; 2 in binary)

y = the exponent (power) of the number (which can itself be positive, or negative)

Easy, right? Well, maybe not so -- there are some tricks involved, as well as a variety of benefits and drawbacks. So, how do we represent floating-point in our device? Well, there's plenty of different ways to do this, there's your way, there's my way, and there's some other guy's way.

For example, the exponent is usually an integer. We could extend this by allowing the exponent to have a fractional representation if we really wanted. In general, though, I don't know why we'd want to do that, as the result would just be another fractional number that we could easily represent (unless the exponent and the mantissa were both negative, in which case we'd have a complex number, and there are easier ways to represent those).

The 9511 was a 32 bit floating point chip with 8 bit bus. Easy to hook to a Z80 etc. It directly handled a bunch of curve type functions and the like. Mostly used it for sin/cos/tan things doing earth curvature work. The 9512 was much simpler but wider inside. Both ran hot, and cost a lot.

Oddly enough it looks like MicroMega currently sells an FPU for microcontroller projects. That has to be going away though, as the ARM 32F4 part I'm using today does floating point so fast I regularly use it in interrupt routines.

Those were AMD bit-slice micros? I used the Intel 300x and the AMD 291x, which were 2 and 4 bit slices.

16 bit FP is making a comeback. You can find it supported in some current GPUs. I believe it is used mostly to represent high dynamic range graphical data but there are probably other uses.

Of course, 8 bit FP was actually hugely important. The A-law and mu-law codecs used by all phone networks in the ISDN days, still used in some landlines and voice exchanges, were essentially FP with a sign, 3 bit exponent, and 4 bit fraction (with implied leftmost 1, just like IEEE formats).

BetaJet, well yeah it would be nice to have a "proper analysis" but modern optimization software handles problems so huge (matrix dimensions millions of rows and columns) that no-one really has a proper theory of what happens. As you say, FP is actually a set of fractional approximations and there are situations where severe loss of precision can occur. Observation suggests there are real world reasons why actual optimization problems routinely come close to singularity. In practice all commercial packages have black art tweaks to detect and recover.

One of my colleagues wrote an infinite (ulimited rationals) precision arithmetic package and we used that to get some insights and to check what the true optimal solutions were for some test cases. It was educational but too slow for real world use.

The field has changed enormously since JvN's time. Heck I think he died in that car crash before Simplex even became widespread. Numerical optimization theory blossomed in the 1980s with real insights into non-linear, and then the implementations accelerated enormously in the 1990s and 2000s. Only the square root of the improvement due to hardware, the rest due to clever algorithms. I'm sure that John would love the kinds of optimization which we do today for monster problems like deep neural networks but it is a hugely different field than what he helped start.

LOL I think the main thing is to understand what one is trying to do and take the expected data and application into account. As you note, if you perform Y + X where Y is a very big value and X is a very small one, you will end up with just Y .... but if X and Y are both in the same ball-park size-wise, then the problem is much reduced.

Speaking of home built libs, I made some rough 16 bit floating point stuff a long time ago during the dinosaur micros. Also a couple of crude 12 and 8 bit versions. Don't laugh, it was sometimes kinda useful for working with curves like audio gains etc. And less bits made certain lookup tables possible, giving mathless math and fast calcs to a slow bit banger if you had a bodacious eprom etc.

If your calculations are becoming unstable even with double precision, it's time to step back and do a proper numerical analysis of your problem. Here's what too many people forget: floating-point numbers are not real numbers, so the normal laws of real numbers -- like associativity of addition -- do not apply. When you add a tiny floating-point number X to a big floating-point number Y, all the bits of X fall into the bit bucket and you end up with Y, not X+Y. Sometimes you need to use algebraic tricks to re-write your formulas into expressions that are stable for your problem and hope the compiler doesn't "optimize" them.

I've read that John von Neumann greatly disliked floating-point because (1) he'd rather use those exponent bits for more precision, and (2) once you've done your numerical analysis you've already completed most of the work needed to represent your problem using fixed-point arithmetic.

@TanjB: ...as for the need for precision in a world where resistors might be accurate only to a percent, it is amazing how easy it is to get yourself into trouble with the math once you start doing simulations and...

Yep. And in practice decimal FP is not ideal for financial calculations.

FP calculations (in any radix) are common in engineering, science, and anything approximate. Even in finance they are perfectly fine to use in situations like estimating future or present value, or allocating budgets.

When it comes to accounting for the cents, however, fixed point is more likely what you want. Most of those operations are multiplies, adds and subtracts, which are exact in fixed point, with the occasional fraction like taxes which have rounding rules built in.

And as for the need for precision in a world where resistors might be accurate only to a percent, it is amazing how easy it is to get yourself into trouble with the math once you start doing simulations and (much, much trickier) optimizations. Simple components like transformers are nearly singularities. Numerical optimization packages are black arts mostly because of the clever tweaks needed to efficiently detect and work around problems with the limited (!) precision of 64 bit doubles.