Monday, May 25, 2009

I’ve often said that Java’s original sin was not being a pure object oriented language - a language where everything is an object.

As one example, consider type char. When Java was introduced, the Unicode standard required 16 bits. This later changed, as 16 bits were inadequate to describe the world’s characters.

Tangent: Not really surprising if you think about it. For example, Apple’s internationalization support allocated 24 bits per character in the early 1990s. However, engineering shortcuts are endemic.

In the meantime, Java had committed to a 16 bit character type. Now, if characters were objects, their representation would be encapsulated, and nobody would very much affected how many bits are needed. A primitive type like char, however, advertises its representation to the world. Consequently, people dealing with unicode in Java have to deal with encoding code points themselves.

I was having this conversation for the umpteenth time last week. My interlocutor asked whether it would be possible for Java to be as efficient as it was without primitive types. The answer is yes, and prompted this post.

So how would we go about getting rid of primitive types without incurring a significant performance penalty?

Java has a mandatory static type system; it is compiled into a statically typed assembly language (Java byte codes, aka JVML). It supports final classes. I do not favor any of these features, but we will take them as a given. The only changes we will propose are those necessary to eradicate primitive types.

Assume that the we have a final class Int representing 32 bit integers. The compiler can translate occurrences of this type into type int. Hence, we can generate the same code for scalars as Java does today with no penalty whatsoever.

To make Int a suitable replacement for int, we would like the syntactic convenience of using operators:

Int twice(Int i) { return i + i;}

This is easy enough. We don’t want operator overloading of course. What we want is simply the ability to define methods whose names are operators. The language can support the same set of binary operators it does today, with the same fixed precedence. No need for special syntax and rules to define the precedence of operators etc. I leave that sort of thing to people who love complexity.

It is crucial that Ints behave as values. This requires that the == operator on Ints behaves the same as equality. There should be no way to detect if we’ve allocated one copy of the number 3 or a million of them. Since we can define == as a method, we can define it in Object to work the way it does for most objects, and override it with a method in Int and other value classes to work differently. We’ll want to take care of identityHash as well of course.

At this point, we can simply say that int stands for Int. Except for locking. What would it mean to synchronize on an instance of Int? To avoid this nastiness, we’ll just say that all instances of Int are locked when they are created. Hence no user code can ever synchronize on them.

In this hypothetical dialect, literal integers such as 3 are considered instances of Int. They, and all other integer results, can be stored in collections, passed to polymorphic code that requires type Object etc. The compiler sees to it that they are usually represented as the primitive integer type of the JVM. When they are used as objects, they get boxed into real instances of Int. When an object gets cast to an Int, it gets unboxed.

This is similar to what happens today, except that it is completely transparent. Because these objects have true value semantics for identity, you can never tell if a new instance was allocated by the boxing or not; indeed, you cannot tell if boxing or unboxing occur at all.

So far I haven’t described anything really new. A detailed proposal along the lines above existed years ago. It would allow adding new user defined value types like Complex as well.

A more restricted, but somewhat similar proposal was considered as part of JSR201 - the idea being that boxing would create value objects, without actually replacing the existing primitive types. It was rejected by the expert group. As I recall, only myself and Corky Cartwright supported it. There were concerns as to what would become of the existing wrapper classes (Integer and friends); those are used by reflection etc., and are hopelessly broken because they have public constructors that allocate new instances for all the world to see.

Tangent: I don’t know who came up with the idea of an Integer class that could have multiple distinct instances of 3. I assume it was another engineering shortcut (see above) by someone really clever.

To be clear - I am not suggesting that this discussion should be reopened. This is just a mental exercise.

There is only one place where this proposal introduces a performance penalty: polymorphic code on type Object that tests for identity. Now that identity testing is method, it will be slower. Not by all that much, and only for type Object.

Why? Because we preclude overriding == in non-value classes. There are several ways to arrange this. Either by fiat, or by rearranging the class hierarchy with types Value (for value types) and Reference (for regular objects) as the only subtypes of Object, making == final in those two types. Or some other variation

What has not been seriously discussed, to my knowledge, is what to do with arrays. As indicated above, scalar uses of primitive types don’t pose much of a problem.

However, what of types like int[]? Again, we can allocate these as arrays of 32 bit integers just as we do today. No memory penalty, no speed penalty when writing in ints or reading them out.

What makes things complicated is this: If int is a subtype of Object (as it is in the above) we’d expect int[] to be a subtype of Object[], because in Java we expect covariant subtyping among arrays.

Of course that isn’t type safe, and one could certainly argue that this our chance to correct that problem. But I won’t. Instead, assume we want to preserve Java’s covariant array subtyping.

The thing is, we do not want to box all the elements of an int[] when we assign it to an Object[].

The zeroth order solution is to leave the existing subtype relations among arrays of the predefined types unchanged. It’s ugly, but we are no worse off than we are today - and have a bunch of advantages because we are free of primitive types. But we can do better.

Suppose Object[] has methods getObjectElement and setObjectElement. These (respectively) retrieve and insert elements into the array.

We can equip int[] with versions of these methods that box/unbox elements that are being retrieved or inserted. We ensure that a[i] and a[i] = e are compiled differenty based upon the static type of a. If a is Object[], we use the getObjectElement and setObjectElement. If a is of type int[], we use methods that access the integer representation directly, say, getIntElement and setIntElement.

At this point, we can even introduce meaningful relations among array types like int[] and short[] using the same technique.

In essence, Object[] provides an interface that different array types may implement differently. We just provide a sugar for accessing this interface magically based on type. On a modern implementation like Hotspot, the overhead for this would be minimal.

Of course, if you make arrays invariant, the whole issue goes away. That just makes things too easy. Besides, I’ve come to the conclusion that Java made the right call on array covariance in the first place.

All in all - Java could have been purely object oriented with no significant performance hit. But it wasn’t, isn’t and likely won’t. Sic Transit Gloria Mundi.