Joe Darcy has posted [URL="http://blogs.sun.com/darcy/entry/jvmlang_numbers"]a fine note[/URL] on the problem of managing integer overflow in Java. This was [URL="http://wiki.jvmlangsummit.com/What_the_JVM_needs"]raised again[/URL] last week at the JVM Language Summit. Joe&rsquo;s blog has already [URL="http://blogs.sun.com/darcy/entry/jvmlang_numbers#comments"]attracted some good comments[/URL]. Here is my comment, which has inflated itself into a blog entry of its own.
The problem is to take two machine words, do a common arithmetic operation on them, and obtain the arithmetically correct results. In particular, it means that at least one bit must be collected besides the machine word of non-overflowing result data. For example (as Joe notes), the extra bit required to report overflow could be implied by an exceptional return. In any case, there are a couple of degrees of freedom in this problem: What we will use the extra bit(s) for, and how to deliever it (or them). After examining a couple of half measures, I will enumerate about ten ways to return normally with a pair of machine words.[B]Use Cases[/B]

If overflow is expected to be rare (at most 10**-4, say) then an argument-bearing exception is reasonable and harmless. That is, the intrinsic that performs add (or multiply) can be specified to throw an exception that contains &ldquo;the rest of the story&rdquo;, if that story does not need to be told often. Joe suggests putting helpful information into the exception to help slow-path code. I think the slow path is likely to be able to &ldquo;start from scratch&rdquo;, so exception-carried information is likely to have only a diagnostic value.
If overflow is somewhat common (10**-2 or more, say), a [URL="http://blogs.sun.com/jrose/entry/longjumps_considered_inexpensive"]pre-constructed exception[/URL] is an option to consider. The catch for such an exception amounts to an alternate return point, which would contain slow-path code to the desired results in a robust way. The pre-constructed exception would have no information content.
If the overflowing operation is being used to build multi-precision math, overflow will be more common, even pushing 50% on random-looking inputs. (Considerations like [URL="http://en.wikipedia.org/wiki/Benford's_law"]Benford&rsquo;s law[/URL] suggest that random-looking inputs are paradoxically rare. Bignum addition is not likely, in practice, to create a bignum longer than its inputs, so the overflow condition will be rare on one out of the [I]N[/I] adds in the bignum operation.) If overflow is common (more than 10%), it is important to process the requested operation as fast (or nearly as fast) as the non-overflowing case.
The last use case is the demanding one. It seems to require that the machine code generated in the relevant inner loop be nearly as good as an assembly programmer would write, using any available CPU-specific overflow indicator. This takes us, oddly enough, to the question of API design.[B]API Design Cases[/B]

Various source code patterns are compatible with various optimizations. APIs which involve exceptions can be partially optimized, but not (reliably) down to the single instructions required for really tight bignum loops. So let&rsquo;s talk about APIs for delivering the extra result bits on the fast path. For ease of reference in [URL="http://blogs.sun.com/darcy/entry/jvmlang_numbers#comments"]subsequent commenting on Joe&rsquo;s blog[/URL], I will make a link for each type of API.[B][URL="http://fixunix.com/#overthrow"]Throw to Slow Path[/URL][/B]

Besides the half-measure of a slow path reached by an exception, there is another ingenious trick that [URL="http://blogs.sun.com/darcy/entry/jvmlang_numbers"]Joe considers[/URL] and rejects: Report the exceptional condition by returning null instead of an boxed result. The effect of this is to extend the dynamic range of the operand word (32 or 64 bits) by one more code point, a sort of NaN value. Since JVMs are very good at optimizing null checks, and (more recently) good at optimizing box/unbox optimizations, this trick might be easier to optimize than an equivalent throw-based trick. But it still feels like a trick, not a real API. Also, it cannot deliver additional result bits to the fast path.[B][URL="http://fixunix.com/#longintpair"]Longs as Int Pairs[/URL][/B]

Another option would be to work only with 32-bit ints and forget about longs. Then the API could stuff two ints into a long as needed. The code for addExact become trivial: return (long)a+b. Then, for the cases where overflow is rare and simply needs detection, it is a relatively simple pattern-matching optimization to extract the overflow bit from an expression like z!=(int)z. This little comparison could be intrinsified, perhaps, for greater likelihood of recognition, but it is already probably simple enough.
This techinque requires stereotyped shift-and-mask expressions to pack and unpack the value pairs. (The expressions contain characteristic 32-bit shift constants.) Such expressions are probably optimized in all JVM compilers. Therefore, it is probably best, at present, to code bignums in radix 2**32. (Or, if signed multiplication is a problem, perhaps radix 2**31 is better.)
Although the code of such things is really simple, it is not so simple as to prevent inventive programmers from creating alternative formulations which the optimizer doesn&rsquo;t expect and can&rsquo;t optimize. So to make this hack optimize reliably, it is worth defining standard constructors and selectors between int-pairs and longs, and encourage programmers to use that API.[INDENT]static long intPair(int a, int b) { return ((long)a >> 32)); }static int lowInt(long ab) { return (int)(ab >> 0); }static int highInt(long ab) { return (int)(ab >> 32); }[/INDENT]I have been unspecific about how many bits are really flying around. The simplest case is 32 bits of operand, and one extra bit of overflow. The pattern just presented has 32 bits of operand and 64 bits of result. A more general case is 64 bits of operand and 128 bits of result. Any exact (overflow-handling) arithmetic routine needs to return a logical tuple of (long,long). This case resists half-measures that report state but not information, or that work for ints but not longs. For the rest of this note, let us consider the problem of returning a 128 bit result.
[B][URL="http://fixunix.com/#tuples"]Multiple Return Values[/URL][/B]

You may have guessed already that my favorite (future-only) realization of this would be [URL="http://blogs.sun.com/jrose/entry/tuples_in_the_vm"]multiple return values[/URL]. The intrinsics would look something like this:[INDENT]static (long,long) multiplySignedExact(long a, long b) { ... }static (long,long) multiplyUnsignedExact(long a, long b) { ... }static boolean signedOverflow(long a, long b) { return (b != (a>>63)); }static boolean unsignedOverflow(long a, long b) { return (b != 0); }...(long z0, long z1) = multiplySignedExact(a, b);if (signedOverflow(z0, z1)) { ... }[/INDENT]The idiom for detecting overflow is simple enough that it could be detected by simple pattern matching in the optimizer. Making it into an intrinsic (as noted before) helps programmers to use patterns which the compiler is expecting. The overflow detection intrinsic itself does not need any special magic. The 128-bit multiple intrinsic will almost certainly be special-cased by an optimizing dynamic compiler.
If your eyes are still stinging from reading the pseudo-code above, I don&rsquo;t need to remind you that we don&rsquo;t have tuples or multiple value returns right now. But there are alternatives for wrapping up the return data. The sad story is that none of the alternatives is a compelling win.[B][URL="http://fixunix.com/#pairarray"]Arrays as Values Pairs[/URL][/B]

A simple expedient is to use a two-element array to hold the two returned values. Escape analysis (EA) is present in modern optimizing compilers. (See [URL="http://blogs.oracle.com/ohrstrom/2009/05/pulling_a_machine_code_rabbit.html"]this example with JRockit[/URL].) One could hope that the arrays would &ldquo;just evaporate&rdquo;. But EA patterns are fragile; a slight source code variation can unintentionally destroy them. Most ameliorations by the programmer, such as attempting to cache the array for reuse, will break the real optimization.[B][URL="http://fixunix.com/#pairwrapper"]Wrappers as Value Pairs[/URL][/B]

The same point goes for a specially constructed wrapper type modeled on the standard wrappers (like Long) but with multiple fields (think LongPair). This wrapper type would be immutable and therefore less liable to abuse, but its identity as a Java object would still be an obstacle to full optimization.[B][URL="http://fixunix.com/#pairvalue"]Value-like Pairs[/URL][/B]

If we add value types to the JVM, with relaxed identity semantics, a value-like LongPair would play better with EA, since pair values could be destroyed and reconstituted at will by the optimizer. Alas, that also is in the future.[B][URL="http://fixunix.com/#retbyref"]Return by Reference[/URL][/B]

Some Java API designers have tried the C-like expedient of passing a pointer to a result variable, as an extra parameter. As far as I know, the experience with this tactic is not satisfactory. (Such a buffer in Java is usually a one-element array.) Buffer re-use is a continual temptation to the programmer, and a downfall to EA. Returning a value by poking it into a mailbox is not something Java optimizers are looking for. If the whole idiom needs to boil down to an instruction or two, allocating an array is a hazard to optimization.[INDENT]static long multiplySignedExact(long a, long b, long[] z1a) { ...; z1a[0] = z1; return z0; }...long[] z1a = {0}long z0 = multiplySignedExact(a, b, z1a), z1 = z1a[0];if (signedOverflow(z0, z1)) { ... }[/INDENT][B][URL="http://fixunix.com/#retbythread"]Return by Thread Local[/URL][/B]

Another possible mailbox for delivering a second result would be a thread-local variable. The abstract model is like that of continuation-passing machines, in which multiple arguments and multiple return values are passed through a set of virtual registers. Java (pre-tuples) only provides access to one such register for return values, but additional registers could be modeled as thread-local values.
The optimization hazard here is that the thread local values are too widely visible, and it is hard to analyze their use in live ranges, as one can do with registers. (Specifically, they can be reached via pointers!) So the optimizer has to be careful working with side effects to such things. Still, I think this is a promising area of research.[B][URL="http://fixunix.com/#retbyssa"]Static Single Assignment[/URL][/B]

The idea of thread-locals is interesting, since they are almost structured enough to optimize as if they were registers. (This is counter-intuitive if you know that they are implemented, at least in the interpreter, via hash table lookups.) Perhaps there is a subclass of ThreadLocal that needs to be defined, with a restricted use pattern that can be optimized into simple register moves. (Such a standard restricted use pattern is [URL="http://en.wikipedia.org/wiki/Static_single_assignment_form"]static single-assignment[/URL], used in HotSpot.) If, after intrinsics are expanded, the second return value looks like a register (and nothing more) then the optimizer has full freedom to boil everything down to a few instructions.[B][URL="http://fixunix.com/#arithengine"]Return to an Engine Field[/URL][/B]

So far we have supposed that the arithmetic methods are all static, and this is (IMO) their natural form. But it could be argued that arithmetic should be done by an explicit ArithmeticEngine object which somehow encapsulates all that goodness of those natural numbers. If you can swallow that, there is an unexpected side benefit: The engine is an acceptable (I cannot bring myself to say natural) place to store the second return value from intrinsics that must return one.[INDENT]/*non-static*/ long multiplySignedExact(long a, long b) { ... this.highWord=z1; return z0; }...ArithmeticEngine ae = new TitanArithmeticEngine();long z0 = ae.multiplySignedExact(a, b), z1 = ae.getHighWord();if (signedOverflow(z0, z1)) { ... }[/INDENT]With some luck in EA, the engine field (highWord above) might get scalarized to a register, and then participate in SSA-type optimizations.[B][URL="http://fixunix.com/#cps"]Continue in a Lambda[/URL][/B]

With closures (coming to a JDK7 near you...) you could also drop to [URL="http://c2.com/cgi/wiki?ContinuationPassingStyle"]continuation-passing style[/URL] (CPS) to receive the two result words from the intrinsic:[INDENT]static long multiplySignedExact(long a, long b, LongReceiver z1r) { ... z1r.receive(z1); return z0; }...long z1 = 0;long z0 = multiplySignedExact(a, b, #(long z)->{z1=z});if (signedOverflow(z0, z1)) { ... }[/INDENT]Optimizing this down to an instruction or two requires mature closure optimization, something we&rsquo;ll surely get as closures mature. On the other hand, we will surely lack such optimizations at first.
Those are a lot of choices for API design! I encourage you to add your thoughts to [URL="http://blogs.sun.com/darcy/entry/jvmlang_numbers#comments"]Joe&rsquo;s blog[/URL].

[url=http://blogs.sun.com/jrose/entry/after_the_deluge]Read More about [after the deluge: how to tidy an overflow...[/url]