CSE 341: Mutable and optional data

In languages like Java (or C/C++), every object reference (or
pointer) may be null, and by default can be mutated.

In ML, as we have seen, all references to data default to being
immutable (unchangeable), and always point to some value; an
int * string tuple must contain an integer value and
a string value. This simplifies reasoning considerably:

When you extract parts of the value, you do not have to
worry about whether you'll get a value out.

When you construct a value, you know that it will retain
that value forevermore --- you cannot pass the value to some
other part of the program, which will update it behind your
back, thereby breaking some important invariant of your data
structure.

However, sometimes you need mutation (the ability to update a
value) and optional-ness. ML provides data types in the standard
library that provide them. Optional data is fairly
straightforward. Mutable data, however, represent a significant
departure from what we've covered previously.

Optional data

Usually, when you have a data type that requires an "empty"
case, you will define a customized constructor for that data type
--- for example, our polymorphic tree:

datatype 'a Tree = Empty | Node of 'a * 'a Tree * 'a Tree

However, sometimes it's annoying to define a new data type
whenever something is optional. What if you want to define a
find function over lists that only optionally returns
a value? You could define a new datatype:

datatype 'a FindResult = NotFound | Found of 'a

and then find could have type

(('a -> bool) * 'a list) -> 'a FindResult

But this is overkill; and you would have to do it for every
function that might optionally return an empty value. So ML
provides a standard polymorphic library datatype
option:

You, as a user, can choose whether to use pattern-matching over
both cases, or raise an exception in the case of none. There's
also a getOpt function that allows you to provide a
default value to be returned in the NONE case:

- getOpt (NONE, ~1);
val it = ~1 : int

Why not option everywhere?

Note that we could have used option instead of
defining multiple cases for our tree data:

datatype 'a Tree = Node of ('a * 'a Tree * 'a Tree) option;

In this representation, the argument of Node is optional; an
empty value is represented as follows:

This is more cumbersome, obviously. But actually, this is how
many languages --- e.g., Java and C --- typically encode data
types with an "empty" case. This is because in such languages,
all pointers can be null. Consider the Java tree node
class:

Therefore, in Java-like languages, every reference to a type T
is really a reference to a type "T option". This means that the
programmer always has to consider whether some value might be null
and lead to a null pointer exception.

Mutable data

Mutable data is handled in ML primarily using the 'a
ref polymorphic datatype, which has a single constructor,
ref:

In languages like Java or C, essentially all bindings ---
including object fields, local variables, and class variables ---
are actually bound to refs, because they can be updated. In fact,
in Java, all non-final object references are actually references
to options, because they point to updatable locations that may be
null.

(Thought question: what is the difference between a int
option ref and a int ref option?)

This is another example of ML's clean design and orthogonality
--- you do not get "more than you asked for" in a type, but you
can freely combine properties like mutability or optional-ness
when you want them.

Iteration in ML

Suppose you wanted to write an iterative sumList
function instead of a recursive one. Now that we have assignment,
we can do so --- it looks like this:

Note our use of the (expr;
... ;expr) expression sequence syntax. Even
allowing some ugliness for the fact that ML forces you to put lots
of dereferences, I claim this is clearly uglier than the recursive
version, even taking into account the tail-recursion
conversion.

Suggested exercise: try to write map,
filter, and foldl using iteration.
Which do you prefer, the iterative or recursive formulations of
these functions?

The polymorphic ref problem

Mutable data brings us to an interesting and rather type system
problem. Suppose we could have a value of type 'a
ref (note: the following is not legal ML code, for
reasons we'll discuss shortly):

val x:'a list ref = ref [];

Seems to make perfect sense: [] has type 'a
list (it's a polymorphic value), so we should be able to
allocate a ref cell and assign that to a binding of type 'a
list ref. But now suppose we have the following
code:

fun f y = x := y;
f [17];

Since x has the type 'a list ref, the
function f ought to have the type 'a list ->
unit, and the body of f ought to typecheck ---
we're updating the contents of 'a list ref with a
value of type 'a list.

We should then be able to apply f to the value
[17] by instantiating f's type to
int list ref -> unit. Evaluation of f
[17] results in the list value [17] becoming
the target of x's ref cell.

Now, suppose we do this:

fun g () = !x;
val y:bool list = g();
if hd(y) then "hi" else "bye";

(Pretend you don't know about f and f
[17], because the typechecker doesn't.) This code ought to
typecheck as well! Consider the body of g: it
dereferences x, which has type 'a list
ref. Therefore, g should get type unit
-> 'a list (the return type is the result type from
dereferencing a 'a list ref).

Now, when we bind the result of 'a list to a
bool list binding, we simply instantiate
'a with bool, so that binding is
well-typed.

Finally, we take the head of y and use it as a
boolean value. But, supposing we executed f [17] as
we did above, the head of y will not be a boolean
value --- it will be an integer. We have just violated type
safety. This is known as the "polymorphic ref problem" and comes
up wherever we have mutation and polymorphism together.

Where did we go wrong?

ML's answer is that we should not allow the type 'a list
ref for a val binding, because it could be
instantiated later with two different types for 'a
--- which, as we've shown, can lead to writing the ref cell at one
type, and reading it at another.

More generally, ML strongly restricts the introduction of
polymorphic types for val bindings. For a
binding

val name = expr

name is given polymorphic type only if
expr is a syntactic value.
Recall that a value is an expression that is "done" evaluating ---
a syntactic value is a syntactic representation
of an immutable value. Syntactic values include only the
following kinds of expressions:

Literal constants.

Anonymous function expresions (fn ... =>
...).

Constructors of immutable types applied to expressions that
are (recursively) syntactic values.

Note that function calls are not included. This rule is called
the value restriction. It suffices to make sure
that you're not creating mutable locations, either directly (by
constructing a mutable location) or indirectly (e.g., by calling a
function that constructs a ref cell).

When you get a polymorphic type from a non-syntactic-value
expression, and attempt to bind it to a name, ML will instantiate
the polymorphic type with a dummy type. This is why ML gives an
error when you write:

Recall that NONE has polymorphic type 'a
option. ref NONE therefore, naively, has type
'a option ref; but this is not a syntactic value, so
the 'a, rather than being "passed through" to the
type of x, is instantiated with a fresh,
non-polymorphic dummy type that SML/NJ prints as
?.X1.

Arrays

ML has other updatable data structures, including arrays, which
work similarly to refs. Array functions are found in the
Array structure (we haven't covered structures, but
for now think of a structure as something like a Java package or a
C++ namespace):

The problem with this is that using ref cell has some overhead
compared to using an ordinary value reference; and it is quite
challenging to remove this overhead in the general case. The
naive implementation of a vector of ref cells is shown in
Fig. 2.

Because programs that use arrays (for example, numerical
programs) typically require high time and space performance in
array operations, this cost was considered prohibitive. ML chose
to compromise its "purity" and offer an Array data
type that stands for a direct array of mutable locations.

This work is licensed under a Creative Commons
License. Rights are held by the University of Washington,
Department of Computer Science and Engineering (see RDF in XML
source).