Portability of the Java programming language [12]
is obtained by compiling Java programs into
architectural neutral instructions (bytecode)
of the Java Virtual Machine (JVM) [15],
rather than into native machine code. Bytecode runs on any platform
that supports an implementation of the JVM.
Although the interpretation of bytecode is substantially faster than
the interpretation of most high level languages,
still a performance penalty must be paid for this portability.
Clearly, for Java applications that are computational intensive,
it would be desirable to reduce this performance penalty
without sacrificing the portability of the language.
One approach, for example, is
to optimize the Java bytecode [2,4],
either at compile-time (where a
machine independent bytecode to bytecode optimization
is added as additional phase to the Java compiler), or
at run-time (where optimizations that require
knowledge of the target machine are applied before execution).
Some implementations of the JVM further improve
performance by means of `just-in-time compilation' (JITC),
where, at run-time, bytecode is compiled into native machine code.

In this note, we explore a way to speedup certain
operations in Java programs using ideas from the mathematical software community.
Here, it has been widely accepted
that adopting a set of basic routines for problems in linear
algebra can help in improving the clarity, portability,
modularity, maintenance, robustness, and even the efficiency of
mathematical software. The most well-known example of such a set of routines
is formed by the Basic Linear Algebra Subprograms [9, ch5].
The original set of vector-vector operations is
now commonly referred to as Level 1 BLAS [13,14].
The set has been extended to Level 2 BLAS [7,8] and Level 3
BLAS [5,6] to provide more opportunities to exploit vector
processing facilities for matrix-vector operations
and memory hierarchies or parallelism
for matrix-matrix operations, respectively.
Once an efficient implementation of BLAS
is available, new mathematical software
can be easily build on top of the primitives.

Obviously, a similar approach can be taken for Java by
extending the Java API (Application Programming Interface) with
an appropriate set of mathematical primitives.
In first instance, a Java implementation can be provided
for all these primitives to preserve the portability of all Java programs
in which the mathematical primitives are used.

On a particular machine, however,
the performance of all Java software that uses these
primitives is simply improved by providing native implementations
of the mathematical primitives.
Although providing a broad range of highly optimized
mathematical primitives would offer the best potential to exploit all characteristic
of a particular target machine, this approach would
also require the most programming efforts to port the mathematical
primitives in the API to different machines. Therefore, in this
research note, we explore the potential of extending the
API with straightforward native implementations of Level 1 BLAS only. We will
see that this extension alone already can improve
performance substantially, while combining these native Level 1 BLAS
with multi-threading in Java may even provide a simple and portable way
to outperform compiled serial C code on multi-processors.

In section 2, we briefly discuss how native methods are integrated
in Java. In section 3, we present
the results of a series of experiments, followed
by conclusions in section 4.