Summary

Improve Nashorn performance by making assumptions about specific types
used in arithmetic and array indexing operations, by being prepared to
recover when those assumptions prove incorrect, and by improving the
HotSpot JVM's ability to optimize non-Java bytecode.

Goals

Ensure, by guessing types, that bytecode generated by Nashorn
contains as many primitive unboxed operations as possible and that
runtime libraries in Nashorn use that fact.

Ensure, by guessing types, that bytecode generated by Nashorn uses
as much Java integer arithmetic as possible.

The resulting performance must be stable. We cannot degrade over
time or measure peak performance only. This would block Nashorn as an
industrial strength product.

Ensure that bytecode generated by Nashorn can revert optimistic
assumptions of the above type with an continuation passing/exception
throwing mechanism. This is necessary as you can't just switch out an
integer instruction to, e.g., a double instruction in bytecode without
causing verify errors and having to modify a method on the fly.

Ensure that the built in library functions in Nashorn are optimal
Java and use primitive types well.

Ensure that the JVM does a much better job of optimizing "alien"
(non-Java) bytecode for example from Nashorn or JRuby, which has
similar problems.

Non-goals

This is not an effort that can be targeted by "we must be X% of the
performance of a native JavaScript runtime like v8" as the areas of
optimization are many an varying. The initial phase of the
implementation will be considered successful if it adds no more than
acceptable overhead to warmup and speeds up common JavaScript
benchmarks orders of magnitude, as the technology promises, and causes
no regressions on benchmarks targeting unexplored performance areas.

Motivation

The Nashorn JavaScript runtime needs to cooperate with the JVM to
produce more highly performant code. Native JavaScript runtimes
typically outperform Nashorn on many tasks, such as pure number
crunching. We need to do our best to make our on-top-of-the-JVM
bytecode based solution approach that performance to as large an
extent as possible. This effort should take Nashorn performance closer
to (but probably not past, in its first iteration) native JavaScript
runtime performance.

Using only Object operations ("a"-prefixed-bytecodes) and
invokedynamic produces bytecode that runs too slowly on HotSpot, even
though it conservatively implements JavaScript correctly. We are not
sure that a world containing only boxed objects and invokedynamics
will ever be fast, but there needs to be some exploration of this in
the JIT as well.

We need to attack this from two ways - generating bytecode that
contains as much primitives as possible and being able to revert code
when an assumption about primitive types fails. HotSpot also needs to
get better at optimizing non-Java bytecode, mainly constructs around
invokedynamic and in java.lang.invoke.

Description

We propose the following solution, a two-fold one, for Nashorn and the
JVM respectively. Note that this is to some part speculation of what
is feasible to do in the bytecode space and what the JVM will be able
to optimize. Thus, there is still some research scope, both for
Nashorn and the JVM in this proposal.

For Nashorn

We know that hardcoding explicit types at compile time gives us a
significant performance boost of the resulting bytecode. An
interesting thesis project would be to code up a TypeScript frontend
for Nashorn to show this concept. This is something that should be
done anyway in the interest of dynamic language implementations on the
JVM.

Creating optimistic methods

We need to generate optimistic methods based on callsite types
known at runtime (code for this is in place already but
disabled). That is, if a method is called with integer parameters, we can
generate a specialized version for that method.

Gradual deoptimization

We also need to generate intra-method code that is optimistic, i.e., use
integer additions even when the compiler can't statically prove that we
are dealing with ints, assuming that in fact we are. If we are wrong we
need to be able to generate a new version of the code that no longer
holds the wrong assumption (and is thus somewhat deoptimized compared to
the previous one). We also need to interrupt the execution of the wrong
code and continue execution from that point in the new code. For that,
we need a continuation-based on-stack-code-replacement mechanism
implemented purely using existing JVM capabilities (bytecode and method
handles).

With JavaScript, it is more often than not impossible to precisely
infer the data type of many expressions statically. Such expressions
can conservatively be treated as Objects, but such treatment severely
reduces the speed of execution. By starting out with the computationally
fastest types and doing just-in-time gradual deoptimization to slower
types, we end up with the fastest code shape that can operate on the
given input.

Primitive field representation

Field representation in Nashorn (objects in the scope are
represented as fields in Java classes) needs to be modified to be
something else than just "Object". We have experimented with a pair of
long/Object fields, using the long for all primitive types, and the
Objects for non primitives. Microbenchmarks have shown that this gives
us significant performance improvements in statically provable
primitive types, but in the general case, again we have too little
information for speedup. This will most likely be much, much better if
we mate it with the optimistic approach, and early experiments have
confirmed this.

We should also, in the interest of memory overhead, investigate the
earlier POC for sun.misc.TaggedArray that is a "both reference and
primitive, either / or" shortcut to get around the Object constraints.

Warmup is slow

For warmup we might need to do lazy code generation, which is already
implemented but not enabled. For some projects, like "npm" in the
node.js to Java port, we have shown that lazy code generation indeed
gives us significant startup performance. The JVM needs to do a lot of
warmup work too, though: see below.

Classic IR optimizations

We can do some classic code optimizations in the byte code, because we
know more than the VM. For example, in the loop

var sum = 0;
for (i = 0; i < 10; x++) {
sum += y;
}

y might be a method with side effects and valueOf overridden, but if
we check that before the loop we can turn the code into

For the JVM

Intrinsics for exact math

The various java.lang.Math operations for exact arithmetic (which
throw ArithmethicException on overflow) must be intrinsified. Every
optimistic JavaScript addition will have to check for overflow using
this mechanism, and unless it compiles to just an arithmetic operation
and a jump-on-overflow instruction, the overhead will be too great.

This is needed to preserve JavaScript semantics.

Invoke

The MethodHandle.invoke (not invokeExact) case (used to implement,
e.g., apply) needs to be faster. According to John Rose, no
optimization has taken place there so far, but it is very common that
boxing gets in the way.

The JRockit implementation of invokedynamic could turn the test method
from the example:

The LambdaForms implementation needs to be improved. They contribute to a
lot of bytecode generation which makes warmup an issue (even more than
just having a bootstrap method per invokedynamic, which is unavoidable).
This can probably be alleviated with lambda form caching. JIT profile
information must not be polluted by this, though.

Lambda form interpretation overhead is another issue that we frequently
bump into. Given that Nashorn reaches a steady state, there should be
no more MethodHandles created and no more lambda forms interpreted.

Cached lambda forms should be retrieved as their compiled versions
upon trying to reinterpret them so that they never are interpreted
again if they have once been compiled.

Furthermore, we have to ensure that all combinators in
java.lang.invoke have a fast path that avoids boxing or apply-like
semantics. Currently, the catchException combinator is an example of
something we commonly use that needs that required optimization. Other
examples probably exist, especially in the cases that take many
parameters.

Better type analysis is needed

Early experiments show that me miss optimization opportunities to
inline virtual dispatch if type analysis is too conservative. As
JavaScript uses plenty of type guards to due to its dynamic nature,
this must be addressed.

What we need to do in the JVM has been less explored, and we will
require at least one full time compiler team resource to help us here.

Testing

For semantic correctness the standard Nashorn test suites will do.

For performance verification existing benchmarks will do, and should
be augmented with a microbenchmark suite that tests invalidations of
optimistic assumptions.

Any torture tests that are run for a long time by SQE should still
run to ensure that no reasource leakage is introduced.

Risks and Assumptions

The main risk is that the JIT still won't be able to generate efficient
code from our optimistic type information.

A risk is that bytecode might prove to be a too narrow representation
for efficient implementations of dynamic languages. We might need to be
a lot more explicit with our generated code, possibly inventing a
mechanism to communicate more information to HotSpot's JITs than can be
done with the current bytecode format. An existing example of a system
that can talk more directly with the JIT is the Truffle project at
Oracle Labs that tells the Graal compiler through annotations in the
Java based interpreter which assumptions to make. Experiments have
shown that this indeed can produce very good peak performance.

In the long run LambdaForms may have to be totally rewritten and
replaced with something else. The JRockit JVM basically just generated
IR for a callsite and fed it back to the JIT compiler, which enabled
all its optimizations like constant propagation and unboxing removal
to run transparently on the callsite. This is not as simple in C2, due
to its various issues with graph splicing, platform dependencies and
global dependencies.