JEP draft: Type operator expressions in the JVM

Summary

Extend the space of JVM type descriptors to include type operators,
which are symbolic references to factory-made types. This is a
separable component of template classes.

Goals

Allow JVM type descriptors (for methods, fields, and constants) to
make new distinctions between types not already present in the system
of classes, primitives, and arrays. Support future translation
strategies which must make distinctions between different usages of
the same basic JVM type, or which must provide a way to specify
factory input to a class factory or template species factory.

Non-Goals

This work is a low-level VM hook, like invokedynamic, not a language
feature like lambdas. As such, it will not propose any specific
mechanism for representing parameterized types; it will only provide a
necessary "hook" to name such types. It will not provide a new way to
define classes; it will only provide a way to associate such classes
with a public symbolic descriptor. It will not define any language
features, nor translation strategies. It will not attempt to extend,
conflict, or rationalize the current syntax for static generic
signatures (JVMS 4.7.9.1).

Success Metrics

Experimental translation strategies can be created which distinguish
List<Integer> from List<String> in classfiles. Experimental class
templating mechanisms will be able to create species that are
denotable from JVM type descriptors. Designers of language features
and translation strategies will be able to vary the encodings of new
source-level types by changing a bootstrap method, rather than
changing the JVM's core logic. Security proofs will be easier to
construct, given the black-box nature of type operators, decoupled
from the complex details of templates and other advanced language
features. Experimental migration strategies can be tested without
fully instantiating new language features, since new place-holder
types easily be posited by simple changes in javac.

Motivation

Descriptors which can denote complex type instances, such as
List<int> or List<ComplexDouble> are a necessary component of
"reified generics", which in turn are a goal of Project Valhalla. If
a value type is to "code like a class, work like an int", then it
seems necessary to be able to denote container types which are
customized to that value type, rather than being erased to Object
like a reference type.

Description

We will extend the JVM's fundamental syntax for field descriptors,
once for all future type schemes (we hope!). The syntax will allow
any single type descriptor to be modified by an optional suffix, which
has the effect of constraining the original type descriptor in an ad
hoc, programmable manner. The combination of the original type
descriptor and the suffix is called a type operator expression.

The resolvable semantic elements of this expression are:

carrier type: the original type descriptor (before the suffix)

type operator name: a resolved class name and/or simple identifier

type arguments: one or more type descriptors and/or other constants

All of the above semantic elements are optional; any may be omitted.
If the type operator name is omitted, it will be derived from the
carrier type, as in the case of a template class whose top type is the
unspecialized class itself. If the carrier type is omitted, it is
defined to be Object, the customary carrier for untyped values in
the JVM.

For example, here are some potential use cases for type operator
expressions:

reified generics: The carrier type is Map, the type operator
name is omitted, and the arguments are int and String.
The whole expression denotes Map<int,String>.

wildcards: The carrier type is List, the type operator name
is omitted, and the argument is the symbol (not type) ?. The
whole expression denotes List<?>, as distinct from raw List.
Given that wildcards are a special case of a concept called
"existential types", it is notable that type operator expressions
provide a way to wrap any bounded type (a carrier type) inside a
symbolically labeled existential type.

non-nullable references: The carrier type is String and the
type operator is ! (or java/lang/NotNull) with no arguments.
The whole expression denotes String!, a non-nullable string
reference.

nullable values: The omitted carrier type defaults to Object
and the type operator is ? (or java/lang/Nullable) with one
argument int. The whole expression denotes int?, a nullable
integer.

reified intersections: The carrier type is some interface I
and the type operator is & with a type argument J. The whole
expression denotes the intersection type I&J.

reified unions: The omitted carrier type defaults to Object
and the type operator name is | with two or more type arguments
I, J. The whole expression denotes the intersection type
I|J.

fixed-sized arrays: The carrier type is an array type
double[] and the type operator name is Array.length with one
argument 5. The whole expression denotes double[5], an
length-constrained array.

range constraints: The carrier type is a primitive type int
and the type operator name is Integer.interval with arguments
ge and 0. The whole expression denotes int constrained to
non-negative values.

null and notreached type tokens: The omitted carrier type
defaults to Object and the type operator is java/lang/Null or
java/lang/NotReached. The whole expression denotes a reference
constrained to be null, or a reference that is never delivered to
its consumer (i.e., the constraint always fails).

The concrete grammar for such descriptors, including new productions,
will be something like the following:

This grammar is built on slightly edited form of the one in JVMS 4.3.
The new productions which support type operators are TypeExpr,
TypeCarrier, TypeOpName, TypeArg, NumberArg, and NameArg.
(They are starred.) The production for Identifier is taken from
JVMS 4.7.9.1.

A TypeExpr denotes a fresh type which is treated by the JVM as
distinct from any other type with a different descriptor string,
including primitives, arrays, classes, and other TypeExprs.

The syntactic components of a TypeExpr are a TypeCarrier, a
TypeOpName, and a sequence of zero or more TypeArgs. These denote
the resolvable semantic components of a resolved type operator
expression, which are respectively the carrier type, the type
oeprator name, and the type arguments.

Two TypeExprs with exactly the same spelling denote the same type.
Any FieldType which is a proper prefix of another FieldType is a
proper supertype of the longer FieldType. Other than those
relations, the JVM does not recognize any equivalences or relations
between types with differently spelled TypeExprs.

In particular, the verifier treats every distinct type operator
expression as a generic "black box" type, which starts with the
carrier type and constrains it in some way, unknowable to the
verifier.

Thus, the verifier will allow values of the type operator type to
implicitly convert to its carrier type, or any supertypes of its
carrier type, but it will not allow such values to be converted to any
other type. Also, the verifier will not convert implicitly from a
carrier type to a type operator type built on top of that carrier
type; such conversions must be performed by explicit bytecode
execution.

Here are some syntax examples of descriptors containing type operator
expressions (along with some hypothetical meanings):

L/[$Arg;] (lone argument with no type operator: no hypothetical meaning)

L/LFoo[LBar;/$N;] (the type species Foo<N-Bar>)

L/LFoo[LBar;]/$N; (the type species N-Foo<Bar>)

[D/$length[5;]/$N; (N variant of fixed-sized array double[5])

[D/$N/$length[5;]; (fixed-sized variant of N-variant of double[])

The last four examples show that type operator expressions can nest.
For example, L/LFoo[LBar;/$N;] denotes a type which is derived first
from Bar by modifying it with N, then passing the modified type to
the parameterized type constructor Foo. (The carrier type of the
result is Object, not Foo.) The last two examples show that type
expressions can nest by piling up several TypeOp suffixes. The
order of these suffixes is significant purely because the descriptor
strings are different: I/$J;/$K; is a different verifier type from
I/$K;/$J; even if the computational effects of the J and K type
modifiers happen to commute.

The JVM will accept type operator expressions, structured as
TypeExpr strings, in the following contexts:

Normally, descriptor syntaxes are disjoint from the syntax of class
names that appear with CONSTANT_Class constants. For example, the
descriptor I is very different from the class name I. However, in
some cases the syntaxes can overlap; the class name of an array is the
same as its descriptor, including the trailing semicolon. We use this
trick with type operator expressions also, so that the same type
operator expression can be inserted directly into a descriptor, and
also used as a class name.

A class name string can be unambiguously distinguished as a type
operator expression in three steps:

check if the last character is ] or ';' (otherwise, fail)

if the string begins with [, parse the array type name and look for a following /

otherwise, scan the string to see if the character ; or [ occurs

If the first step and any of the remaining steps pass, then the class
name string is proven not to be a plain class name or an array class
name, and may be assumed to be a type operator expression (or else
an erroneous input). Otherwise it can be assumed to be a plain
class name (or array class name). Another simpler technique (though
perhaps a slower one) is simply to parse the class name string as a
simple class or array name, and see if the end is reached, or else the
next remaining character is slash / introducing a type operator
suffix; in that case the second step must be executed first.

The second and third steps are expensive but necessary, but can be
deferred until after the first step, which is cheap. Note that the
JVMS specifies that a class name may not contain an open bracket [
unless it is an array type name, and in that case the bracket will not
follow a package separator /. Therefore the class name grammar is
not ambiguous, even after type operator expressions are added.

Some operations on a type expression require access to the inside of
the black box. These include loading a reflective constant for a type
expression, making a type test (checkcast), making an array type
whose component is the type expression, calling a method on an
instance whose verified type is a type expression, etc.

The built-in resolution mechanism for type operator expressions will
perform the following jobs:

Derive a bootstrap meethod ("BSM") from the TypeCarrier and TypeOpName.

Call the BSM on the TypeArgs, suitably parsed and reified.

(Also pass relevant context, such as the current class, the carrier, and the operator name.)

Receive in reply from the BSM a resolved type descriptor for the type.

Permanently and atomically record that descriptor for that exact type expression.

Use the descriptor to derive the various behaviors required for that type.

The details of these steps and the associated APIs are defined
elsewhere, and may be extended over time. See below for a sketch of
resolved type descriptors and their behavior. Type operators are
named by an optional class and optional identifier. If the class is
present, it will help determine the bootstrap method; for example, if
it is a template, the template will be specialized to the given
arguments. If the identifier only is present, the BSM will be a
centralized one which assigns fixed standard meanings to a small
number of names.

When value types become available, type operator expressions will also
be allowed to interoperate with value types. A given type operator
expression will always be unambiguously assigned a kind, as a value or
a reference. If other kinds are invented, type operator expressions
will be "kinded" in the same way. For example, the '$' could be
followed by a kind character, or additional characters besides '$'
could be assigned to introduce type operator expressions of distinct
various kinds.

The descriptor will not be a Class but will have its own
reflective type and API. The descriptor will report a concrete
carrier Class which is compatible with all values described by the original
type operator expression. The BSM for a type operator may return a
resolved type descriptor which reports only Object as its carrier
class, or it may spin and load a new anonymous class, and use that.
In either case, the JVM will be able to use the carrier class as a
safe supertype for the type operator expression. The JVM will not
freely convert from the carrier class to the type operator type,
except via a checkcast bytecode, whose behavior is under the control
of the resolved type descriptor selected by the BSM.

Note that the type operator expression language is self-contained and
pre-normalized. It does not make references into any constant pool,
nor is there any "calculus" for proving that two distinct type
expressions denote the same type.

It is an open question whether any of the ResolvedTypeDescriptor API should
be merged into the Class API. That decision could create a set of
secondary "crasses" (runtime type quasi-classes) which do not directly
represent a classfile, but instead represent a type somehow derived
from or related to one or more classfiles. There is some precedent
for this, since the existing Class instances for primitives and
void, and for arrays, may be viewed as "crasses". In that case, the
carrierClass API would probably be named getPrimaryClass, and
would map a "crass" to its nearest proper supertype (or Object or an
interface), and there would be a new query isTypeExpression.

Keeping the ResolvedTypeDescriptor API disjoint from the legacy
Class API would be cleaner, but would also require us to duplicate
or extend many APIs, such as Lookup, in which Class is a proxy for
a JVM type descriptor. An interface TypeDescriptor (proposed by the
Constable project) may give us a hook to generify those APIs, rather
than brutally duplicating them, and without introducing "crasses".

Alternatives

This design can be viewed as a refinement of an earlier experimental
mechanism called "class-dynamic", which decoded a sub-language from
class name strings and spun classfiles on the fly in response to
resolution requests. That mechanism funneled the type operator
expression through the class name, which is similar to the above
design, but makes no distinction between a regular class reference and
a type operator expression.

The integration of type operators into the JVM seems to be cleaner if
the distinction between regular named classes and type expressions is
explicit from the beginning. In addition, we do not want to commit to
spinning classfiles in response to type operators; some use cases of
type operators intentionally alias regular classes, but with some
extra "annotation" payload injected. This cannot be done in a
framework which confuses class names with type expressions.

When we design template classes, we could attempt to add a
purpose-built descriptor syntax designed expressly for templates.
However, a design like the one in this JEP would be needed anyway.

We could try to live without reified generics altogether, in which
case the existing type descriptors would be serviceable.

Testing

// What kinds of test development and execution will be required in order
// to validate this enhancement, beyond the usual mandatory unit tests?

Risks and Assumptions

// Describe any risks or assumptions that must be considered along with
// this proposal.

Dependencies

// Describe all dependencies that this JEP has on other JEPs, JBS issues,
// components, products, or anything else.

Design FAQ

DRAFT DRAFT DRAFT
The following section will be part of the comments, not the JEP proper.

You didn't use dot . for type operator syntax; why not?
Because in some pathways, descriptors flow through class names, and
slashes are converted to dots and vice versa. Any distinction
between slash and dot would be lost at that point, without
complicated context-sensitive rules for dot-preservation or
dot-recovery.

That grammar is complicated: Everything seems optional. Why not
get rid of some optionality? Briefly, each optionality is
motivated as follows. The TypeCarrier could be removed in favor
of making it always Object, but many use cases for type operators
work within a static bound type, and it is wasteful not to allow
that static bound to appear as a true verifier type. Given a
TypeCarrier, it makes sense that the actual type operator should
sometimes be derived directly from the carrier and other types be a
separately specified parameter, hence the optionality of the
TypeOpName. But if the TypeOpName is unrelated to the carrier
type, the carrier is often Object, hence a special abbreviation
for that common case that makes the TypeCarrier optional. So the
carrier can be either identical with the type operator, or
completely separate. The argument list is optional since some type
operators inherently require arguments while some are "just the
mode" (as with "not null"). The trailing semicolon ; for missing
a TypeArg list is a judgment call; it could be denoted instead by
[], but that seems egregiously noisy for a simple modifier like
"not null", and requiring a non-empty TypeArg list in the brackets
adds trivial complexity.

That grammar is complicated: Why are there different ways to
denote a type operator? Dropping the TypeOpName allows the
carrier and the type operator to come from the same class, as noted
above, while allowing the type operator to be a fully resolved class
name give obvious modularity benefits. In the latter case allowing
an additional name to select a class member gives a way for one
class to expose a library of type operators. The final case, of a
simple identifier, allows either the carrier type to selected a
class member (or "mode" argument such as "wildcarded"), or the
system to globally define a handful of type operators outside of the
package scoping system: ! (for "not null") and ? (for "maybe
null") are two such likely global operators.

That grammar is complicated: You allow too many kinds of type
operator arguments. Why not just have types as parameters? Type
arguments are clearly all you need to upgrade today's generics in
place, to reify their types inside of descriptors. But this is
short-sighted, since C++ generics allow many other kinds of
arguments. The grammar chosen above allows specification a
reasonable array of non-type arguments corresponding to common use
cases of template arguments in C++ and other languages. After
types, strings are the obvious next candidate, and indeed strings
can denote anything else we need, and are agreeably fundamental in
the JVM. We threw in MethodType because that is a fundamental
construct in the JVM, and shouldn't be passed through a stringy
encoding channel. We threw in NumberArg because small integral
numbers are fundamental in various use cases, such as definite
arrays. All of the above correspond to natively encoded constant
pool entries (except integers which are larger than a long).

You forget MethodHandle and Double arguments, aren't those
fundamental also? Yes, they are, but they can be readily encoded
to bootstrap methods using combinations of the other argument types,
and designing a hardwired stringy encoding for them would be
needlessly complex. For a method handle, just pass several
arguments denoting its class, name, and type, with maybe a ref-kind
also. For a floating point number, consider using a string
containing its hex-float representation, to avoid problems with
rounding and ambiguity.

Those identifier strings are useless without a way to quote the
illegal characters; why not have strings with proper quoting? The
limitations on TypeArg strings are the same as those on class
names, and there are standard systems (such as the "Symbolic
Freedom" encoding) for representing the handful of illegal
characters using escape sequences. Bootstrap methods which need
general strings should use such a scheme. This is much easier than
somehow telling the JVM it must start allowing hitherto "dangerous
characters" in small parts of the descriptor grammar.

Constrained primitive types, seriously? An earlier version of
the grammar assumed that the only carrier type was Object,
allowing the "head" of the type operator expression to be type
operator name (such as a template class). This had two major
downsides: First, it didn't capture the fact that a template might
well be the supertype of all its instances; this is certainly true
for containers like List<int>; throwing away that type bound means
more checkcast bytecodes to restore it in method code, which seems a
sorry waste. Second, the L descriptor letter might be augmented
(at some point) by additional classy descriptors (such as the Q
descriptor of the "minimal value type" prototype). Allowing carrier
types to be any pre-existing verifier types seems prudent. Given
that, the primitive types and arrays come in pretty much "for free",
although it would be reasonable to disallow constrained primitives
if that turns out to be hard to implement, and add them in later
when primitives are unified more fully with other types.

Why doesn't the ArrayType production mention FieldType any
more? The array type syntax is our sole legacy syntax that is
similar to a type operator. From a prior component type it creates
a complex new array object type. We don't want to pretend that
there is a way to customize that array object type by adding
arbitrary "tweaks" to its component type -- it is hard enough to
manage constrained scalar types without cutting them into the "guts"
of the JVM's built-in array object mechanism. We take the simpler
choice of allowing array instances to be constrained without asking
questions about what is inside them. When arrays are virtualized
(made instances of interfaces) then we can fully nest constraints
within array component types, but not until then.