Java HotSpot™ Virtual Machine Performance
Enhancements

Tiered compilation, introduced in Java SE 7, brings client startup speeds to
the server VM. Normally, a
server VM uses the interpreter to collect profiling information about
methods that is fed into the compiler. In the tiered scheme, in addition to the
interpreter, the client compiler is used to generate compiled versions of
methods that collect profiling information about themselves. Since the compiled
code is substantially faster than the interpreter, the program executes with
greater performance during the profiling phase. In many cases, a startup that is
even faster than with the client VM can be achieved because the final code
produced by the server compiler may be already available during the early stages
of application initialization. The tiered scheme can also achieve better peak
performance than a regular server VM because the faster profiling phase allows
a longer period of profiling, which may yield better optimization.

Both 32 and 64 bit modes are supported, as well as compressed oops (see the
next section). Use the -XX:+TieredCompilation flag with the
java command to enable tiered compilation.

An "oop", or ordinary object pointer in Java Hotspot
parlance, is a managed pointer to an object. An oop is normally the
same size as a native machine pointer, which means 64 bits on an
LP64 system. On an ILP32 system, maximum heap size is somewhat less
than 4 gigabytes, which is insufficient for many applications. On
an LP64 system, the heap used by a given program might have to be
around 1.5 times larger than when it is run on an ILP32 system.
This requirement is due to the expanded size of managed pointers.
Memory is inexpensive, but these days bandwidth and cache are in
short supply, so significantly increasing the size of the heap and
only getting just over the 4 gigabyte limit is undesirable.

Managed pointers in the Java heap point to objects which are
aligned on 8-byte address boundaries. Compressed oops represent
managed pointers (in many but not all places in the JVM software)
as 32-bit object offsets from the 64-bit Java heap base address.
Because they're object offsets rather than byte offsets, they can
be used to address up to four billion objects (not bytes),
or a heap size of up to about 32 gigabytes. To use them, they must
be scaled by a factor of 8 and added to the Java heap base address
to find the object to which they refer. Object sizes using
compressed oops are comparable to those in ILP32 mode.

The term decode is used to express the operation by
which a 32-bit compressed oop is converted into a 64-bit native
address into the managed heap. The inverse operation is referred to
as encoding.

Compressed oops is supported and enabled by default in Java SE 6u23 and
later.
In Java SE 7, use of compressed oops is the default for 64-bit JVM
processes when -Xmx isn't specified and for values of
-Xmx less than 32 gigabytes. For JDK 6 before the 6u23
release, use the
-XX:+UseCompressedOops flag with the java
command to enable the feature.

When using compressed oops in a 64-bit Java Virtual Machine
process, the JVM software asks the operating system to reserve
memory for the Java heap starting at virtual address zero. If the
operating system supports such a request and can reserve memory for
the Java heap at virtual address zero, then zero-based compressed
oops are used.

Use of zero-based compressed oops means that a 64-bit pointer
can be decoded from a 32-bit object offset without adding in the
Java heap base address. For heap sizes less than 4 gigabytes, the
JVM software can use a byte offset instead of an object offset and
thus also avoid scaling the offset by 8. Encoding a 64-bit address
into a 32-bit offset is correspondingly efficient.

For Java heap sizes up around 26 gigabytes, any of Solaris,
Linux, and Windows operating systems will typically be able to
allocate the Java heap at virtual address zero.

Based on escape analysis, an object's escape state might be one
of the following:

GlobalEscape – An object escapes the method
and thread. For example, an object stored in a static field, or,
stored in a field of an escaped object, or, returned as the result
of the current method.

ArgEscape – An object passed as an argument
or referenced by an argument but does not globally escape during a
call. This state is determined by analyzing the bytecode of called
method.

NoEscape – A scalar replaceable object,
meaning its allocation could be removed from generated code.

After escape analysis, the server compiler eliminates scalar
replaceable object allocations and associated locks from generated
code. The server compiler also eliminates locks for all
non-globally escaping objects. It does not replace a heap
allocation with a stack allocation for non-globally escaping
objects.

Some scenarios for escape analysis are described next.

The server compiler might eliminate certain object allocations.
Consider the example where a method makes a defensive copy of an
object and returns the copy to the caller.

The method makes a copy to prevent modification of the original
object by the caller. If the compiler determines that the
getPerson method is being invoked in a loop, it will
inline that method. In addition, through escape analysis, if the
compiler determines that the original object is never modified, it
might optimize and eliminate the call to make a copy.

The server compiler might eliminate synchronization blocks
(lock elision) if it determines that an object is thread
local. For example, methods of classes such as
StringBuffer and Vector are synchronized
because they can be accessed by different threads. However, in most
scenarios, they are used in a thread local manner. In cases where
the usage is thread local, the compiler might optimize and remove
the synchronization blocks.

The Parallel Scavenger garbage collector has been extended to
take advantage of machines with NUMA (Non Uniform Memory
Access) architecture. Most modern computers are based on NUMA
architecture, in which it takes a different amount of time to
access different parts of memory. Typically, every processor in the
system has a local memory that provides low access latency and high
bandwidth, and remote memory that is considerably slower to
access.

In the Java HotSpot Virtual Machine, the NUMA-aware allocator
has been implemented to take advantage of such systems and provide
automatic memory placement optimizations for Java applications. The
allocator controls the eden space of the young generation of the
heap, where most of the new objects are created. The allocator
divides the space into regions each of which is placed in the
memory of a specific node. The allocator relies on a hypothesis
that a thread that allocates the object will be the most likely to
use the object. To ensure the fastest access to the new object, the
allocator places it in the region local to the allocating thread.
The regions can be dynamically resized to reflect the allocation
rate of the application threads running on different nodes. That
makes it possible to increase performance even of single-threaded
applications. In addition, "from" and "to" survivor spaces of the
young generation, the old generation, and the permanent generation
have page interleaving turned on for them. This ensures that all
threads have equal access latencies to these spaces on average.

The NUMA-aware allocator is available on the Solaris™
operating system starting in Solaris 9 12/02 and on the Linux
operating system starting in Linux kernel 2.6.19 and glibc
2.6.1.

The NUMA-aware allocator can be turned on with the
-XX:+UseNUMA flag in conjunction with the selection of
the Parallel Scavenger garbage collector. The Parallel Scavenger
garbage collector is the default for a server-class machine. The
Parallel Scavenger garbage collector can also be turned on
explicitly by specifying the -XX:+UseParallelGC
option.

The -XX:+UseNUMA flag was added in Java SE 6u2.

Note: There was a known bug in the Linux Kernel that may cause the JVM to crash when being run with -XX:UseNUMA. The bug was fixed in 2012, so this should not affect the latest versions of the Linux Kernel. To see if your Kernel has this bug, you can run the native reproducer.

NUMA Performance Metrics

When evaluated against the SPEC JBB 2005 benchmark on an 8-chip
Opteron machine, NUMA-aware systems showed the following
performance increases: