SID-7 - Scala 2.8 Arrays

The Problem

Arrays have turned out to be one of the trickiest concepts to get right in
Scala. This has mostly to do with very hard constraints that clash with what’s
desirable. On the one hand, we want to use arrays for interoperation with
Java, which means that they need to have the same representation as in Java.
This low-level representation is also useful to get high performance out of
arrays. But on the other hand, arrays in Java are severely limited.

First, there’s actually not a single array type representation in Java but
nine different ones: One representation for arrays of reference type and
another eight for arrays of each of the primitive types byte, char,
short, int, long, float, double, and boolean. There is no common
type for these different representations which is more speciﬁc than just
java.lang.Object, even though there are some reﬂective methods to deal with
arrays of arbitrary type in java.lang.reflect.Array. Second, there’s no way
to create an array of a generic type; only monomorphic array creations are
allowed. Third, the only operations supported by arrays are indexing, updates,
and get length.

Contrast this with what we would like to have in Scala: Arrays should slot
into the collections hierarchy, supporting the hundred or so methods that are
deﬁned on sequences. And they should certainly be generic, so that one can
create an Array[T] where T is a type variable.

The Past

How to combine these desirables with the representation restrictions imposed
by Java interoperability and performance? There’s no easy answer, and I
believe we got it wrong the ﬁrst time when we designed Scala. The Scala
language up to 2.7.x “magically” wrapped and unwrapped arrays when required in
a process called boxing and unboxing, similarly to what is done to treat
primitive numeric types as objects. “Magically” means: the compiler generated
code to do so based on the static types of expressions. Additional magic made
generic array creation work. An expression like new Array[T] where T is a type
parameter was converted to new BoxedAnyArray[T]. BoxedAnyArray was a special
wrapper class which changed its representation depending on the type of the
concrete Java array to which it was cast. This scheme worked well enough for
most programs but the implementation “leaked” for certain combinations of type
tests and type casts, as well as for observing uninitialized arrays. It also
could lead to unexpectedly low performance. Some of the problems have been
described by David MacIver [1] and Matt Malone [2].

Boxed arrays were also unsound when combined with covariant collections. In
summary, the old array implementation technique was problematic because it was
a leaky abstraction that was complicated enough so that it would be very
tedious to specify where the leaks were to be expected.

Exploring the Solution Space

The obvious way to reduce the amount of magic needed for arrays is to have two
representations: One which corresponds closely to a Java array and another
which forms an integral part of Scala’s collection hierarchy. Implicit
conversions can be used to transparently convert between the two
representations. This is the gist of the array refactoring proposal of David
MacIver (with contributions by Stepan Koltsov) [3]. The main problem
with this proposal, as I see it, is that it would force programmers to choose
the kind of array to work with. The choice would not be clear-cut: The Java-
like arrays would be fast and interoperable whereas the Scala native arrays
would support a much nicer set of operations on them. With a choice like this,
one would expect different components and libraries to make different
decisions, which would result in incompatibilities and brittle, complex code.
MacIver and Koltsov introduce some compiler magic to alleviate this. They
propose to automatically split a method taking an array as an argument into
two overloaded versions: one taking a Java array and one taking a generic
Scala array. I believe this would solve some of the more egregious plumbing
issues, but it would simply hide the problem a bit better, not solve it.

A similar idea—- but with a slightly different slant—- is to “dress up” native
arrays with an implicit conversion that integrates them into Scala’s
collection hierarchy. This is similar to what’s been done with the String to
RichString conversion in pre-2.8 Scala. The difference to the MacIver/Koltsov
proposal is that one would not normally refer to Scala native arrays in user
code, just as one rarely referred to RichString in Scala. One would only rely
on the implicit conversion to add the necessary methods and traits to Java
arrays. Unfortunately, the String/RichString experience has shown that this is
also problematic. In par- ticular, in pre 2.8 versions of Scala, one had the
non-intuitive property that

The problem here was that the reverse method was inherited from class Seq
where it was defined to return another Seq. Since strings are not sequences,
the only feasible type reverse could return when called on a String was
RichString. But then the equals method on Strings which is inherited from
Java would not recognize that a String could be equal to a RichString.

2.8 Collections

The new scheme of Scala 2.8 solves the problems with both arrays and strings.
It makes critical use of the new 2.8 collections framework which accompanies
collection traits such as Seq with implementation traits that abstract over
the representation of the collection. For instance, in addition to trait Seq
there is now a trait

trait SeqLike[+Elem, +Repr] { ... }

That trait is parameterized with a representation type Repr. No assumptions
need to be made about this representation type; in particular it not required
to be a subtype of Seq. Methods such as reverse in trait SeqLike will
return values of the representation type Repr rather than Seq. The Seq
trait then inherits all its essential operations from SeqLike, instantiating
the Repr parameter to Seq.

trait Seq[+Elem] extends ... with SeqLike[Elem, Seq[Elem]] { ... }

A similar split into base trait and implementation trait applies to most other
kinds of collections, including Traversable, Iterable, and Vector.

Integrating Arrays

We can integrate arrays into this collection framework using two implicit
conversions. The first conversion will map an Array[T] to an object of type
ArrayOps, which is a subtype of type VectorLike[T, Array[T]]. Using this
conversion, all sequence operations are available for arrays at the natural
types. In particular, methods will yield arrays instead of ArrayOps values as
their results. Because the results of these implicit conversions are so short-
lived, modern VM’s can eliminate them altogether using escape analysis, so we
expect the calling overhead for these added methods to be essentially zero.

So far so good. But what if we need to convert an array to a real Seq, not
just call a Seq method on it? For this there is another implicit conversion,
which takes an array and converts it into a WrappedArray. WrappedArrays
are mutable Vectors that implement all vector operations in terms of a given
Java array. The difference between a WrappedArray and an ArrayOps object
is apparent in the type of methods like reverse: Invoked on a
WrappedArray, reverse returns again a WrappedArray, but invoked on an
ArrayOps object, it returns an Array. The conversion from Array to
WrappedArray is invertible. A dual implicit conversion goes from
WrappedArray to Array. WrappedArray and ArrayOps both inherit from an
implementation trait ArrayLike. This is to avoid duplication of code between
ArrayOps and WrappedArray; all operations are factored out into the common
ArrayLike trait.

Avoiding Ambiguities

So now that we have two implicit conversions from Array to ArrayLike
values, how does one choose between them and how does one avoid ambiguities?
The trick is to make use of a generalization of overloading and implicit
resolution in Scala 2.8. Previously, the most specific overloaded method or
implicit conversion would be chosen based solely on the method’s argument
types. There was an additional clause which said that the most specific method
could not be defined in a proper superclass of any of the other alternatives.
This scheme has been replaced in Scala 2.8 by the following, more liberal one:
When comparing two different applicable alternatives of an overloaded method
or of an implicit, each method gets one point for having more specific
arguments, and another point for being defined in a proper subclass. An
alternative “wins” over another if it gets a greater number of points in these
two comparisons. This means in particular that if alternatives have identical
argument types, the one which is defined in a subclass wins.

Applied to arrays, this means that we can prioritize the conversion from
Array to ArrayOps over the conversion from Array to WrappedArray by
placing the former in the standard Predef object and by placing the latter
in a class LowPriorityImplicits, which is inherited from Predef. This way,
calling a sequence method will always invoke the conversion to ArrayOps. The
conversion to WrappedArray will only be invoked when an array needs to be
converted to a sequence.

Integrating Strings

Essentially the same technique is applied to strings. There are two implicit
conversions: The first, which goes from String to StringOps, adds useful
methods to class String. The second, which goes from String to
WrappedString, converts strings to sequences.

Generic Array Creation and Manifests

That’s almost everything. The only remaining question is how to implement
generic array creation. Unlike Java, Scala allows an instance creation
new Array[T] where T is a type parameter. How can this be implemented, given
the fact that there does not exist a uniform array representation in Java? The
only way to do this is to require additional runtime information which
describes the type T. Scala 2.8 has a new mechanism for this, which is
called a Manifest. An object of type Manifest[T] provides complete
information about the type T. Manifest values are typically passed in implicit
parameters; and the compiler knows how to construct them for statically
known types T. There exists also a weaker form named ClassManifest which can
be constructed from knowing just the top-level class of a type, without
necessarily knowing all its argument types. It is this type of runtime
information that’s required for array creation.

Here’s an example. Consider the method tabulate which forms an array from
the results of applying a given function f on a range of numbers from 0
until a given length. Up to Scala 2.7, tabulate could be written as follows:

In Scala 2.8 this is no longer possible, because runtime information is
necessary to create the right representation of Array[T]. One needs to
provide this information by passing a ClassManifest[T] into the method as an
implicit parameter:

When calling tabulate on a type such as Int, or String, or List[T],
the Scala compiler can create a class manifest to pass as implicit argument to
tabulate. When calling tabulate on another type parameter, one needs to
propagate the requirement of a class manifest using another implicit parameter
or context bound. For instance:

def tabTen[T: ClassManifest](f: Int => T) = tabulate(10, f)

The move away form boxing and to class manifests is bound to break some
existing code that generated generic arrays as in the first version of
tabulate above. Usually, the necessary changes simply involve adding a
context bound to some type parameter.

Class GenericArray

For the case where generic array creation is needed but adding manifests is
not feasible, Scala 2.8 offers an alternative version of arrays in the
GenericArray class. This class is defined in package
scala.collection.mutable along the following lines.

Unlike normal arrays, GenericArrays can be created without a class manifest
because they have a uniform representation: all their elements are stored in
an Array[AnyRef], which corresponds to an Object[] array in Java. The
addition of GenericArray to the Scala collection library does demand a
choice from the programmer—- should one pick a normal array or a generic
array? This choice is easily answered, however: Whenever a class manifest for
the element type can easily be produced, it’s better to pick a normal array,
because it tends to be faster, is more compact, and has better
interoperability with Java. Only when producing a class manifest is infeasible
one should revert to a GenericArray. The only place where GenericArray is
used in Scala’s current collection framework is in the sortWith method of
class Seq. A call xs.sortWith(f) converts its receiver xs first to a
GenericArray, passes the resulting array to a Java sorting method defined in
java.util.Arrays, and converts the sorted array back to the same type of
Seq as xs. Since the conversion to an array is a mere implementation
detail of sortWith, we felt that it was unreasonable to demand a class
manifest for the element type of the sequence. Hence the choice of a
GenericArray.

Conclusion

In summary, the new Scala collection framework resolves some long-standing
problems with arrays and with strings. It removes a considerable amount of
compiler magic and avoids several pitfalls which existed in the previous
implementation. It relies on three new features of the Scala language that
should be generally useful in the construction of libraries and frameworks:
First, the generalization of overloading and implicit resolution allows one to
prioritize some implicits over others. Second, manifests provide type
information at run-time that was lost through erasure. Third, context bounds
are a convenient shorthand for certain forms of implicit arguments. These
three language features will be described in more detail in separate notes.