Oracle Blog

Alejandro Murillo's Weblog

There is a widespread perception that Java is still lagging behind C
in performance regarding numerically intensive applications. However,
in the last two major releases of the Hotspot implementation of the
Java Platform Standard Edition, Java SE 5 and Java SE 6, a great
number of performance improvement features have been incorporated to
Hotspot that has ultimately closed and sometimes given the edge to
Hotspot in some benchmarks intended to measure numerical
performance.

Below we compare the performance, with numerically intensive
applications, of Hotspot with respect to gcc using the Scimark 2.0
benchmark. We use this benchmark because it happens to provide both a
Java and a C based version. SciMark\* 2.0 is a benchmark from the
National Institute of Standards for scientific and numerical
computing widely-used to measure CPU performance. It measures
several computational kernels and reports a composite score in
approximate Mflops (Millions of floating point operations per
second). In the scores higher is better. It consists of five computational kernels:

FFT – performs a complex 1D fast
Fourier transform

SOR – solves the Laplace
equation in 2D by successive

MC – computes ! by Monte Carlo
integration

MV – performs sparse
matrix-vector multiplication

LU – computes the LU factorization of a dense N x N matrix

These kernels represent the types of calculations that commonly
occur in numerically intensive scientific applications. Each kernel
except MC has small and large problem sizes. The small problems are
designed to test raw CPU performance and the effectiveness of the
cache hierarchy. The large problems stress the memory subsystem
because they do not fit in cache. The MC kernel only uses scalars so
there is no distinction between the small and large problems.

Figure 1 below shows a SpiderWeb Chart with the scores for the
Small dataset, while Table 1 shows the same scores in a tabular way.
Figure 2 shows a SpiderWeb Chart with the scores for the Large
dataset, while Table 2 shows the same scores in a tabular way. On the
SpiderWeb Charts green represents the score with GCC, red represents
the score for out-of-the-box JDK 6u6p (BASE or no tuning) and blue
the corresponding PEAK (tuned) JDK 6u6p score. Similarly, In each table
the baseline is the score with gcc, Result 1 represents the score for out-of-the-box JDK 6u6p (BASE or no tuning)
and Result 2 corresponds to the PEAK (tuned) JDK 6u6p score.

As we can see in those charts and tables, in both, the Large and
Small problem, the JDK score beats the gcc overall or composite score, even without any tuning. With the
Small dataset, gcc has better scores than perfJDK on both the
Sparse and FFT workloads, but with the Large dataset the gap in those workloads is much smaller. For the Monte Carlo workload,
gcc is better than the performance JDK without any tuning, but the JDK beats the gcc score when tuned.

One could argue that the baseline C scores used here could be
greatly improved with further compilation tuning or by using
optimized and sometimes parallelized libraries as they do in this
document Optimizing
SciMark\* 2.0 Using Intel® Software Products (By the way,
since I used an Intel based machine to obtain the numbers shown
below, I used the exact same baseline they used in that document as
well). But that requires changing the benchmark itself. Furthermore,
most of the times, those optimized libraries are tightly coupled
to a specific hardware and/or system. By contrast, most of the
performance optimizations in the Java side are incorporated to the
virtual machine itself and do not require changes to the arbiter
(benchmark), which makes life easier for the programmer. In addition
to those VM improvements, the Java platform has also made available a
concurrency library that is optimized for every platform where
Hotspot is implemented. One could use this library to change the source code
of the benchmark as well, but the significant difference here is that the resulting
code can be executed on any compatible Java virtual machine without
any changing and even without recompiling, which is the old and sometimes
forgotten advantage that Java provides over any other language.

So, the obvious question you now have is: what are these
improvements that have made Java catchup with C ?
First and foremost are the changes related to ergonomics and secondly the runtime
and just in time compiler optimizations that have been
and are still being added to Hotspot. Among the most important
features added recently are:

Biased locking: this is a class of optimization that improves
uncontended synchronization performance by eliminating atomic
operations associated with the Java language’s synchronization
primitives. These optimizations rely on the property that not only
are most monitors uncontended, they are locked by at most one thread
during their lifetime. An object is "biased" toward the
thread which first acquires its monitor via a monitor enter bytecode
or synchronized method invocation; subsequent monitor-related
operations can be performed by that thread without using atomic
operations resulting in much better performance, particularly on
multiprocessor machines. Locking attempts by threads other that the
one toward which the object is "biased" will cause a
relatively expensive operation whereby the bias is revoked. The
benefit of the elimination of atomic operations must exceed the
penalty of revocation for this optimization to be profitable.
Applications with substantial amounts of uncontended synchronization
may attain significant speedups while others with certain patterns
of locking may see slowdowns. Biased Locking is enabled by default in Java SE 6 and later.

Lock coarsening: There are some patterns of locking where a
lock is released and then reacquired within a piece of code where no
observable operations occur in between. The lock coarsening
optimization technique implemented in hotspot eliminates the unlock
and relock operations in those situations (when a lock is released
and then reacquired with no meaningful work done in between those
operations). It basically reduces the amount of synchronization work
by enlarging an existing synchronized region. Doing this around a
loop could cause a lock to be held for long periods of times, so the
technique is only used on non-looping control flow.Lock coarsening is enabled by default in Java SE 6 and later

Adaptive spinning: Adaptive spinning is an optimization
technique where a two-phase spin-then-block strategy is used by
threads attempting a contended synchronized enter operation. This
technique enables threads to avoid undesirable effects that impact
performance such as context switching and repopulation of
Translation Lookaside Buffers (TLBs). It is “adaptive"
because the duration of the spin is determined by policy decisions
based on factors such as the rate of success and/or failure of
recent spin attempts on the same monitor and the state of the
current lock owner.

Array Copy Performance Improvements

Background Compilation in HotSpot™ (both the server and
client VM): in early versions of HotSpot, the compiler did not
compile Java methods in the background by default. As a consequence,
Hyper-threaded or Multi-processing systems couldn't take advantage
of spare CPU cycles to optimize Java code execution speed.
Currently, both the server and client Hotspot Vms perform
compilation in the background (concurrently with the application)

Garbage collection: there have been numerous Garbage
collection optimizations in the last two major releases of Hotspot.
One that deserves to be highlighted is the introduction of the
parallel compaction collector. Parallel compaction is a feature
that enables the parallel collector to perform major collections in
parallel resulting in lower garbage collection overhead and better
application performance particularly for applications with large
heaps. It is best suited to platforms with two or more processors or
hardware threads. Previous to Java SE 6, while the young generation
was collected in parallel, major collections were performed using a
single thread. For applications with frequent major collections,
this adversely affected scalability. Previous to Java SE 6, while the young generation was
collected in parallel, major collections were performed using a
single thread. For applications with frequent major collections,
this adversely affected scalability.

Ergonomic: In Java SE 5, platform-dependent default
selections for the garbage collector, heap size, and runtime
compiler were introduced to better match the needs of different
types of applications while requiring less command-line tuning. New
tuning flags were also introduced to allow users to specify a
desired behavior which in turn enabled the garbage collector to
dynamically tune the size of the heap to meet the specified
behavior. In Java SE 6, the default selections have been further
enhanced to improve application runtime performance and garbage
collector efficiency.

These improvements along with other non trivial code generation
optimizations to the Hotspot compiler have improved the Scimark numbers we
now obtain with Java. In follow up blogs, I will try to pick
specific optimizations and pin point how that optimization helped
improved the scores with the Scimark benchmark. In the mean time, for
further details on these and other improvements added to Hotspot in
the latest releases I encourage you to check these documents and the
pointers they provide: J2SE
5.0 Performance White Paper and Java
SE 6 Performance White Paper.

It is important to highlight, that as we can see in the scores
below, the difference between the base or out-of-the-box scores with
the peak (or tuned) Java scores are not that big and that is mostly
due to the ergonomics features in Hotspot that automatically select
the best default values for some garbage collection tunable
parameters. This is an ongoing source of performance improvements for
Hotspot that greatly reduces the burden on the programmer.

The scores below compare Java with gcc based exclusively on the
Scimark2 standalone benchmark. It is also appropriate to mention that
perfJDK is also showing excellent results with Java benchmarks like
SPECjvm2008 that includes a series of sub-benchmarks that cover a
wider range of applications that would make for a better comparison.
SPECjvm2008 includes cryptographic applications, XML processing
applications, a database core application and many others.