Say I have two real number types. They may be floating or fixed point. How can I construct a new type whose values are at least the union of the two with the minimal number of bits?

There are 3 cases to consider:

Fixed (Qa.x) $\cup$ Fixed (Qb.y) - I think the best here is to use Qmax(a,b).max(x,y). I think this is optimal since I can't come up with anything smaller that will accurately represent the type.

Float (FaEx) $\cup$ Float (FbEy) - I think the best here is to use Fmax(a,b)Emax(x,y). Again I can't think of a more optimal solution.

I am using Q notation for representing fixed point types. I don't know how floating point types are typically represented; I'm using an analogous representation where FaEx means a bits of mantissa and x bits of exponent.

The difficult case is:

Fixed (Qa.x) $\cup$ Float (FbEy) - The best I can come up with is Qmax(a,n).max(x,m) where n is the minimal bits to represent the biggest number the float can be and m is the minimal number of bits to represent the smallest positive fraction the float can be. This seems extremely inefficient as it extends the floating point's most accurate precision to its entire range. Thus for any decent sized floating point type the resulting union type will be extremely large.

Here are some ASCII diagrams of the three cases (simplified), and why I think I'm wrong:

From my math the best I could do would be Q2.3, but it is fairly obvious that there should exist some floating point type that stops having the necessary accuracy once the floating point part's accuracy is no longer needed. Of course I have to be careful if the fixed point type is more accurate than even the most accurate range of the floating point type, but I still feel like I'm missing a nice solution.

Any idea what binary type will be the smallest superset of the union between a fixed and floating point type?

NOTE: I know that this also emphasizes the benefits and drawbacks of fixed and floating types, but I feel like it should be possible to do at least a little bit better. Especially in the situation where the types have known range boundaries.

2 Answers
2

Fixed (Qa.x) $\cup$ Float (FbEy) can be represented as Float (FcEz) where $c = a+x$ and $z = \max(\lceil \lg(x) \rceil,y)$. This is significantly more efficient than what you propose in the question.

In particular, Fixed (Qa.x) can always be coerced to Float (FcEw) where $c = a+x$ and $w = \lceil \lg(x) \rceil$. Then, once you've coerced Fixed (QaX) to Float (FcEw), you then reduce to the case of Float (FcEw) $\cup$ Float (FbEy), which you already described how to handle.

I believe the only way you can do better (you can't save fractional bits without going into entropy coding in a larger context, which is a whole other story) is if one of the types fully contains the other, in which case you can simply ignore the other type.