Advertisements

Guest

Kai Koehne wrote:
> Hi,
>
> I need an int to real32 conversion with IEEE-754 round-towards-zero mode
> ... That means, a method x, so that
>
> int i = 44554435
> float f = x(i)
>
> results in f = 3355432.0, and not f = 335546.0 .
>
> Are you aware of any mathematical packages that provide such a function,
> or a way to do this with pure Java?
>
> Regards,
>
> Kai Koehne

Huh?

I was going to mumble something automatic integer to floating point
promotion should work as advertised, but your question does not seem to
make sense.

If you absolutely must work with 32 bit floats instead of 64 bit
doubles or java.math.BigDecimal, you will lose precision when promoting
large integers. There's no way to squeeze 32 bits of precision into
less than 32 bits of precision, expecially when a 32 bit float needs a
few bits for the exponent.

But if you really need a method that converts 44554435 stored in an int
to either 3355432.0 or 335546.0 in 32 bit floating point
representation, you can always write your own. I'd prefer you give it a
reasonable name, like "mungeNumber()" or "hashAFloat()" or
my.own.strange.Promotions.conversion_x() so that if it ever gets close
to any of my code my code doesn't get infected. But you can definitely
build it. I'd point you to the xxxToYyyBits() and yyyBitsToXxx()
methods of java.lang.Double and java.lang.Float, but I wouldn't want to
be accused of contributing to the crime.

The first has the same semantics as (float)n , but I couldn't find a way
to implement the second conversion.
> [...]
> But if you really need a method that converts 44554435 stored in an int
> to either 3355432.0 or 335546.0 in 32 bit floating point
> representation, you can always write your own. I'd prefer you give it a
> reasonable name, like "mungeNumber()" or "hashAFloat()" or
> my.own.strange.Promotions.conversion_x() so that if it ever gets close
> to any of my code my code doesn't get infected. But you can definitely
> build it. I'd point you to the xxxToYyyBits() and yyyBitsToXxx()
> methods of java.lang.Double and java.lang.Float, but I wouldn't want to
> be accused of contributing to the crime.
>

Huh, this is the tough way ... but I will try to find my way through the
implementation details, thank you for the idea

Guest

Kai Koehne wrote:
> schrieb:
> > Kai Koehne wrote:
> >> Hi,
> >>
> >> I need an int to real32 conversion with IEEE-754 round-towards-zero mode
> >> [...]
> >> Kai Koehne
> >
> > Huh?
> >
> > I was going to mumble something automatic integer to floating point
> > promotion should work as advertised, but your question does not seem to
> > make sense.
>
> Just to clarify my motivation: I am building an Occam 2 Java compiler
> ... and as it happens, occam supports two ways of converting an integer
> to a floating-point variable:
>
> REAL32 ROUND n -- conversion with round-towards-middle
> REAL32 TRUNC n -- conversion with round-towards-zero
>
> The first has the same semantics as (float)n , but I couldn't find a way
> to implement the second conversion.

Ahh. I see.
> > [...]
> > But if you really need a method that converts 44554435 stored in an int
> > to either 3355432.0 or 335546.0 in 32 bit floating point
> > representation, you can always write your own. I'd prefer you give it a
> > reasonable name, like "mungeNumber()" or "hashAFloat()" or
> > my.own.strange.Promotions.conversion_x() so that if it ever gets close
> > to any of my code my code doesn't get infected. But you can definitely
> > build it. I'd point you to the xxxToYyyBits() and yyyBitsToXxx()
> > methods of java.lang.Double and java.lang.Float, but I wouldn't want to
> > be accused of contributing to the crime.
> >
>
> Huh, this is the tough way ... but I will try to find my way through the
> implementation details, thank you for the idea

If what Chris suggests doesn't get there fast enough for you, you might
want to consider implementing it borrowing from java.math.BigDecimal
and BigInteger. If it turns out too slow, it could at least give you a
good baseline.
> Kind regards,
>
> Kai Koehne

That's clever. At least, I /hope/ it's clever because I don't understand why
it works. Do you care to expand a bit ? Please ?

I got interested in this myself and have tried out a number of approaches,
including an attempt to convert rounding into truncation by consideration of
ULPs (as I suggested to the OP earlier). The best, IMO, is to do the
truncation in the integer domain before converting into a float. Its a little
complicated to do, but it runs fast, and has the aesthetic advantage of not
requiring double-precision operations to produce a single-precision result
(which all the other things I tried do).

I've appended the code, including the test-harness, for anyone who's
interested.

-- chris

=========== Test.java ===============
/*
* contains several static implementations of truncating a 32-bit integer
* to a 32-bit float with rounding "towards zero" rather that to the nearest
* float (as in the Java language spec for casting an int to a float).
* Also contains an exhaustive test which checks all the implementations
* against the simplest one (presumed to be correct).
*
* Approximate times when run on a 1.5.0_06-b05 JVM, in "client" mode on
* a 1.5 GHz celeron WinXP Pro box. In each case the time is the time in
* nanoseconds to convert one value averaged over the full 32-bit range of
* integers.
*
* Slow: 130
* ULP: 63.5
* FastULP: 65
* Twiddle: 19
* Patricia: 74.5
*
* Approximate times for inputs in the resticted range [-10,000,000,
10,000,000),
* 83% of which is exactly representable as a float.
*
* Slow: 12
* ULP: 12
* FastULP: 20.5
* Twiddle: 18.5
* Patricia: 24
*/
public class Test
{
public static void
main(String[] args)
{
int i = 0;
int errors = 0;
for (;
{
if (i % 10000000 == 0)
{
// progress...
System.out.printf("%11d...\r", i);
System.out.flush();
}
float shouldBe = slowTruncate(i);
float is;

/**
* Slow implementation of truncating an int to a 32-bit floating point.
* This implementation is intended to be the "so simple it can't possibly
* be wrong" benchmark against which other implementations can be compared
*/
public static float
slowTruncate(int i)
{
// since all +ve numbers have negations, but the reverse
// isn't true, we work with -ve numbers
if (i > 0)
return -slowTruncate(-i);

// adjust by one ulp towards 0
// the double ulp() is because ulp is the distance to the next
// number away from 0, we need the distance to the next float
// nearest zero
if (i > 0)
{
float ulp = Math.ulp(f-Math.ulp(f));
f -= ulp;
}
else
{
float ulp = Math.ulp(f+Math.ulp(f));
f += ulp;
}

return f;
}

/**
* Faster implementation of truncating an int to a 32-bit floating point.
* This implementation uses a simple test to catch a useful range of
* "eaasy" cases. If the input is likely to be in that range then
* this provides a useful optimisation, otherwise it's marginally
* slower.
* NB1: this version uses ulpTruncate() for the difficult cases, but
* it could just as easily use a different one -- even slowTruncate().
* NB2: since this consists only of a simple test guarding simple
operations,
* it seems reasonable to hope that the JIT will inline it in callers
*/
public static float
fastTruncate(int i)
{
// these are all exact
if (i >= -8388608 && i <= 8388608)
return (float)i;
else
return ulpTruncate(i); // or any other xxxTruncate()
}

/**
* Faster implementation of truncating an int to a 32-bit floating point
based
* on bit-twiddling.
* The idea is to take the number, discover how many high-bits it has which
* don't fit into the 24-bit mantissa of a float, and to mask off that many
* low bits before converting to a float. I.e. we do the truncation in the
* integer domain before converting to a float, so the conversion is always
* is always exact.
* The actual implementation is complicated by sign issues, but that's the
* basic idea.
* Note that we don't actually mess with the physical layout of an IEEE
float,
* but only use the fact that it has exactly 24 bits in which to represent
the
* mantissa
*/
public static float
twiddleTruncate(int i)
{
if (i >= 0)
{
// because i is +ve the high bit is always zero,
// and so (i >>> 24) is always in the range [0, 128)
return (float)(i & MASKS[i >>> 24]);
}

Sure. First I split the problem into two cases. If the result did not
require rounding, or the rounding was towards zero, the rounded result
and truncated result are equal, so I just returned the rounded result.

The code you quote is for those cases that round away from zero. The
float we want is the one with the same sign as rounded, and the largest
magnitude that is less than rounded.

NaNs, infinities, and zeros would all be special cases in a general
attempt to make the smallest possible reduction in the magnitude of a
float, but can be ignored here. Zero can only result from converting int
zero, and getting an exact result. The others cannot result from int
conversion.

If rounded is not an exact power of two, the truncation has a mantissa
one less than rounded. If it is an exact power of two, it has an
exponent one less than rounded, and mantissa all ones. Either way,
subtracting one from the rounded bit pattern gets the bit pattern for
truncated.
> I got interested in this myself and have tried out a number of
> approaches, including an attempt to convert rounding into truncation
> by consideration of ULPs (as I suggested to the OP earlier). The
> best, IMO, is to do the truncation in the integer domain before
> converting into a float. Its a little complicated to do, but it runs
> fast, and has the aesthetic advantage of not requiring
> double-precision operations to produce a single-precision result
> (which all the other things I tried do).

Wow! Thank you all for this thorough discussion ... Two days ago I was
not even sure if truncating conversion can be done at all, and now I
have five working solutions + performance tests. I have the strong
feeling to owe you a beverage of your choice

Patricia Shanahan wrote:
> If rounded is not an exact power of two, the truncation has a mantissa
> one less than rounded. If it is an exact power of two, it has an
> exponent one less than rounded, and mantissa all ones. Either way,
> subtracting one from the rounded bit pattern gets the bit pattern for
> truncated.

Chris Uppal wrote:
> Patricia Shanahan wrote:
>
>
>>If rounded is not an exact power of two, the truncation has a mantissa
>>one less than rounded. If it is an exact power of two, it has an
>>exponent one less than rounded, and mantissa all ones. Either way,
>>subtracting one from the rounded bit pattern gets the bit pattern for
>>truncated.
>
>
> Ah, I see. The IEEE fp layout is subtle.
>
> Thank you.
>
>
>
>>If you prefer to avoid double, just do the abs in int:
....
> Unfotunately, the comparison doesn't work for Integer.MAX_VALUE, nor for values
> near Integer.MIN_VALUE. Two example failure cases are:
....
> I find the negative output from Math.abs() particularly entertaining ;-)

Good point. Shows the folly of not testing at least the maximum and
minimum value, as well as some ordinary values.

Share This Page

Welcome to The Coding Forums!

Welcome to the Coding Forums, the place to chat about anything related to programming and coding languages.

Please join our friendly community by clicking the button below - it only takes a few seconds and is totally free. You'll be able to ask questions about coding or chat with the community and help others.
Sign up now!