Search

Compiling Java with GCJ

Java has not become as pervasive as the
original hype suggested, but it is a popular language, used a lot
for in-house and server-side development and other applications.
Java has less mind-share in the free software world, although many
projects are now using it. Examples of free projects using Java
include Jakarta from the Apache Foundation
(jakarta.apache.org),
various XML tools from W3C
(www.w3.org) and Freenet
(freenet.sourceforge.net).
See also the FSF's Java page
(www.gnu.org/software/java).

One reason relatively few projects use Java has been the real
or perceived lack of quality, free implementations of Java. Two
free Java implementations, however, have been around since the
early days of Java. One is Kaffe
(www.kaffe.org),
originally written by Tim Wilkinson and still developed by the
company he cofounded, Transvirtual. The other is GCJ (the GNU
Compiler for the Java language), which I started in 1996 at Cygnus
Solutions (and which this article discusses). GCJ has been fully
integrated and supported as a GCC language since GCC version
3.0.

Traditional Java Implementation Model

The traditional way to implement Java is a two-step process:
a translation phase and an execution phase. (In this respect Java
is like C.) A Java program is compiled by javac, which produces one
or more files with the extension .class. Each such file is the
binary representation of the information in a single class,
including the expressions and statements of the class' methods. All
of these have been translated into bytecode, which is basically the
instruction set for a virtual, stack-based computer. (Because some
chips also have a Java bytecode instruction set, it also can be a
real instruction set.)

The execution phase is handled by a Java Virtual Machine
(JVM) that reads in and executes the .class files. Sun's version is
called plain “java”. Think of the JVM as a simulator for a
machine whose instruction set is Java bytecodes.

Using an interpreter (simulator) adds quite a bit of
execution overhead. A common solution for high-performance JVMs is
to use dynamic translation or just-in-time (JIT) compilers. In that
case, the runtime system will notice a method has been called
enough times to make it worthwhile to generate machine code for
that method on the fly. Future calls to the method will execute the
machine code directly.

A problem with JITs is startup overhead. It takes time to
compile a method, especially if you want to do any optimization,
and this compilation is done each time the application is run. If
you decide to compile only the methods most often executed, then
you have the overhead of measuring those. Another problem is that a
good JIT is complex and takes up a fair bit of space (plus the
generated code needs space, which may be on top of the space used
by the original bytecode). Little of this space can be in shared
memory.

Traditional Java implementation techniques also do not
interoperate well with other languages. Applications are deployed
differently (a Java Archive .jar file, rather than an executable);
they require a big runtime system, and calling between Java and
C/C++ is slow and inconvenient.

The GCJ Solution: Ahead-of-Time
Compilation

The approach of the GCJ Project is radically traditional. We
view Java as simply another programming language and implement it
the way we implement other compiled languages. As Cygnus had been
long involved with GCC, which was already being used to compile a
number of different programming languages (C, C++, Pascal, Ada,
Modula2, Fortran, Chill), it made sense to think about compiling
Java to native code using GCC.

On the whole, compiling a Java program is actually much
simpler than compiling a C++ program, because Java has no templates
and no preprocessor. The type system, object model and
exception-handling model are also simpler. In order to compile a
Java program, the program basically is represented as an abstract
syntax tree, using the same data structure GCC uses for all of its
languages. For each Java construct, we use the same internal
representation as the equivalent C++ would use, and GCC takes care
of the rest.

GCJ can then make use of all the optimizations and tools
already built for the GNU tools. Examples of optimizations are
common sub-expression elimination, strength reduction, loop
optimization and register allocation. Additionally, GCJ can do more
sophisticated and time-consuming optimizations than a just-in-time
compiler can. Some people argue, however, that a JIT can do more
tailored and adaptive optimizations (for example, change the code
depending on actual execution). In fact, Sun's HotSpot technology
is based on this premise, and it certainly does an impressive job.
Truthfully, running a program compiled by GCJ is not always
noticeably faster than running it on a JIT-based Java
implementation; sometimes it even may be slower, but that usually
is because we have not had time to implement Java-specific
optimizations and tuning in GCJ, rather than any inherent advantage
of HotSpot technology. GCJ is often significantly faster than
alternative JVMs, and it is getting faster as people improve
it.

A big advantage of GCJ is startup speed and modest memory
usage. Originally, people claimed that bytecode was more
space-efficient than native instruction sets. This is true to some
extent, but remember that about half the space in a .class file is
taken up by symbolic (non-instruction) information. These symbols
are duplicated for each .class file, while ELF executables or
libraries can do much more sharing. But where bytecodes really lose
out to native code is in terms of memory inside a JVM with a JIT.
Starting up Sun's JVM and JIT compiling and applications' classes
take a huge amount of time and memory. For example, Sun's IDE Forte
for Java (available in the NetBeans open-source version) is huge.
Starting up NetBeans takes 74MB (as reported by the
top command) before you actually
start doing anything. The amount of main memory used by Java
applications complicates their deployment. An illustration is
JEmacs
(JEmacs.sourceforge.net),
a (not very active) project of mine to implement Emacs in Java
using Swing (and Kawa, discussed below, for Emacs Lisp support).
Starting up a simple editor window using Sun's JDK1.3.1 takes 26MB
(according to top). XEmacs, in contrast, takes 8MB.

Running the Kawa test suite using GCJ vs. JDK1.3.1, GCJ is
about twice as fast, causes about half the page faults (according
to the time command) and uses
about 25% less memory (according to top). The test suite is a
script that starts the Java environment multiple times and runs too
many different things for a JIT to help (which penalizes JDK). It
also loads Scheme code interactively, so GCJ has to run it using
its interpreter (which penalizes GCJ). This experiment is not a
real benchmark, but it does indicate that even in its current
status you can get improved performance using GCJ. (As always, if
you are concerned about performance, run your own benchmark based
on your expected job mix.)

GCJ has other advantages, such as debugging with GDB and
interfacing with C/C++ (mentioned below). Finally, GCJ is free
software, based on the industry-standard GCC, allowing it to be
freely modified, ported and distributed.

Some have complained that ahead-of-time compilation loses the
big write-once, run-anywhere portability advantage of bytecodes.
However, that argument ignores the distinction between distribution
and installation. We do not propose native executables as a
distribution format, expect perhaps as prebuilt packages (e.g.,
RPMs) for a particular architecture. You still can use Java
bytecodes as a distribution format, even though they don't have any
major advantages over Java source code. (Java source code tends to
have fewer portability problems than C or C++ source.) We suggest
that when you install a Java application, you should compile it to
native code if it isn't already so compiled.

Compiling a Java Program with GCJ

Using GCC to run a Java program is familiar to anyone who has
used it for C or C++ programs. To compile the Java program
MyJavaProg.java, type:

gcj -c -g -O MyJavaProg.java

To link it, use the command:

gcj --main=MyJavaProg -o MyJavaProg MyJavaProg.o

This is just like compiling a C++ program mycxxprog.cc:

g++ -c -g -O mycxxprog.cc

and then linking to create an executable mycxxprog:

g++ -o mycxxprog mycxxprog.o

The only new aspect is the option --main=MyJavaProg. This is needed
because it is common to write Java classes containing a main method
that can be used for testing or small utilities. Thus, if you link
a bunch of Java-compiled classes together, there may be many main
methods, and you need to tell the linker which one should be called
when the application starts.

You also have the option of compiling a set of Java classes
into a shared library (.so file). In fact, the GCJ runtime system
is compiled to a .so file. While the details of this belong in
another article, if you are curious you can look at the Makefiles
of Kawa (discussed below) to see how this works.

Features and Limitations of GCJ

GCJ is not only a compiler. It is intended to be a complete
Java environment with features similar to Sun's JDK. If you specify
the -C option to gcj it will compile to standard .class files.
Specifically, the goal is that gcj -C should be
a plugin replacement for Sun's javac command.

GCJ comes with a bytecode interpreter (contributed by Kresten
Krab Thorup) and has a fully functional ClassLoader. The standalone
gij program works as a plugin replacement for Sun's java
command.

GCJ works with libgcj, which is included in GCC 3.0. This
runtime library includes the core runtime support, Hans Boehm's
well-regarded conservative garbage collector, the bytecode
interpreter and a large library of classes. For legal and technical
reasons, GCJ cannot ship Sun's class library, so it has its own.
The GNU Classpath Project now uses the same license and FSF
copyright that libgcj and libstdc++ use, and classes are being
merged between the two projects. We use the GPL but with the
special exception that if you link libgcj with other files to
produce an executable, this does not by itself cause the executable
to be compiled by the GPL. Thus, even proprietary programs can be
linked with the standard C++ or Java; runtime libraries.

The libgcj library includes most of the standard Java classes
needed to run non-GUI applications, including all or most of the
classes in the java.lang, java.io, java.util, java.net,
java.security, java.sql and java.math packages. The major missing
components are classes for doing graphics using AWT or Swing. Most
of the higher-level AWT classes are implemented, but the
lower-level peer classes are not complete enough to be useful.
Volunteers are needed to help out.

Interoperating with C/C++

Although you can do a lot using Java, sometimes you want to
call libraries written in another language. This could be because
you need to access low-level code that cannot be written to Java,
an existing library provides functionality that you need and don't
want to rewrite, or you need to do low-level performance hacks for
speed. You can do all of these by declaring some Java methods to be
native and, instead of writing a method body, provide an
implementation in some other language. In 1997, Sun released the
Java Native Interface (JNI), which is a standard for writing native
methods in either C or C++. The main goal of JNI is portability in
the sense that native methods written for one Java implementation
should work with another Java implementation, without recompiling.
This was designed for a closed-source, distributed-binaries world
and is less valuable in a free-software context, especially because
you do have to recompile if you change chips or the OS type.

To ensure JNI's portability, everything is done indirectly
using a table of functions. This makes JNI very slow. Even worse,
writing all these functions and following all the rules is tedious
and error-prone. Although GCJ does support JNI, it also provides an
alternative. The Compiled Native Interface (CNI, which could also
stand for Cygnus Native Interface) is based on the idea that Java
is basically a subset of C++ and that GCC uses the same calling
convention for C++ and Java. So, what could be more natural than
being able to write Java methods using C++ and using standard C++
syntax for access to Java fields and calls to Java methods? Because
they use the same calling conventions and data layout, no
conversion or magic glue is needed between C++ and Java.

Examples of CNI and JNI will have to wait for a future
article. The GCJ manual
(gcc.gnu.org/onlinedocs/gcj)
covers CNI fairly well, and the libgcj sources include many
examples.

Kawa: Compiling Scheme to Native via
Java

Java bytecodes are a fairly direct encoding of Java programs
not really designed for anything else. However, they have been used
to encode programs written in other languages. See
grunge.cs.tu-berlin.de/~tolk/vmlanguages.html
for a list of other programming languages implemented on top of
Java. Most of these are interpreters, but a few actually compile to
bytecode. The former could use GCJ as is; the latter potentially
can use GCJ to compile to native code.

One such compiler is Kawa, which I have been developing since
1996. Kawa is both a toolkit for implementing languages using Java
and an implementation of the Scheme programming language. You can
build and run Kawa using GCJ without needing any non-free software.
The Kawa home page
(www.gnu.org/software/kawa)
has instructions for downloading and building Kawa with GCJ.

You can use Kawa in interactive mode. Here, we first define
the factorial function and then call it:

An interesting thing to note is the factorial function
actually gets compiled by Kawa to bytecode and is immediately
loaded as a new class. This process uses Java's ClassLoader
mechanism to define a new class at runtime for a byte array
containing the bytecodes for the class. The methods of the new
class are interpreted by GCJ's bytecode interpreter.

To compile the class file to native code, you can use gckawa, a
script that sets up appropriate environment variables
(LD_LIBRARY_PATH and CLASSPATH) and calls gcj:

$ gckawa -o factorial
--main=factorial -g -O factorial*.class

Using the wildcard in factorial*.class is not needed in this case,
but it is a good idea in case Kawa needs to generate multiple
.class files.

Then, you can execute the resulting factorial program, which
is a normal GNU/Linux ELF executable. It links with the shared
libraries libgcj.so (the GCJ runtime library) and libkawa.so (the
Kawa runtime library).

The same approach can be used for other languages. For
example, I am currently working on implementing XQuery, W3C's new
XML-query language, using Kawa.

Other applications that have been built with GCJ include
Apache modules, GNU-Paperclips and Jigsaw.

Conclusion

GCJ has seen a lot of activity recently and is a solid
platform for many tasks. We hope that you consider Java for your
free software project, using GCJ as your preferred Java
implementation and that some of you will help make GCJ even
better.