The source code files from that site are used and Java is benchmarked with -server as they instructed. -client was used, as well, just for comparison purposes, but that apparently still used the server VM because client VM is not available on 64 bit platforms. Well, at least the results were double-verified, as there was no difference between the Java results with -client and -server.

Memory usage is not measured, only the program execution time matters in this benchmark. Please note that JVM gets some (small) benefit from this because it can use other (otherwise idle) cores of the CPU for its background tasks. The biggest difference in this benchmark is the use of new versions of both Java and GCC, where the original benchmark used versions from year 2003.

In this benchmark, unlike in the "Java is Faster than C++" benchmark, or in the great programming language shootout at http://shootout.alioth.debian.org/, each algorithm is analyzed (what it tests) and the implementation are examined to find out whether they actually are equivalent or not.

Note: Java was updated to 1.6.0_04 in the middle of the testing period, causing some tests to have been run with that version instead.

gcc version 4.2.2 (Gentoo 4.2.2 p1.0)

The C++ versions were compiled with all the prepackaged optimization levels

-O0 (disable all optimizations, the default)

-O1 (some basic optimizations)

-O2 (more optimizations, this is what people normally use)

-O3 (extra optimizations that might make the program run slower)

Finally, the fastest optimized version (-O1, -O2 or -O3) was compiled with -fprofile-generate, then ran to collect profiling data, then recompiled with -fprofile-use, that will choose some things (branch prediction, etc) based on the collected data. This is the kind of optimization that a good JIT compiler could do.

Benchmarks

Numeric calculation with recursion

Recursion benchmarks ackermann and fibo test the compiler's ability to optimize the recursion into some kind of loop structure, and when that fails, they test the performance of making recursive function calls and also stack memory management speed, as the stack needs to be enlarged when function calls are made. These are virtually identical on the majority of programming languages, free of optimization differences on platform libraries or from differences from benchmark implementation quality. They only rely on the very core of the language. Branching is also in an important role in both algorithms and compiler knowledge of which path will be taken most often is helpful.

Note: In order to make fibo.cpp functionally equivalent to the Java version, it was modified to use unsigned ints (32 bit) rather than unsigned longs (64 bit). This had about 5 % performance effect in favor of 32 bit integers. On 32 bit platforms the unsigned longs are also 32 bit, so this was not an issue in the original benchmark.

Library facilities

The hash and hash2 benchmarks test the performance of the library implementation of hash maps. On Java, the standard library container HashMap is used, while C++ uses GCC's non-standard hash_map container. Additionally, integer-to-string conversions are used heavily, also testing the quality of implementation of the library. Both C++ versions use the C string facilities instead of C++'s.

Java's results on hash2 test have large variation on each run, jumping between 5.9 and 8.8 seconds. An average is recorded in the results. In the first hash test Java manages to heavily benefit from concurrency, using almost twice as much CPU time as it uses real time (meaning that it fully uses two of the four available cores).

C++ hash2 had an implementation problem. Where the Java version browses thru hash1 with an iterator and takes the value from the iterator, the C++ implementation instead took the key of the iterator and looked up on the hashmap to get the value. This is obviously much slower operation. Fixing the error reduced C++ version runtime from 16.8 to 8.4 seconds, still slightly lower than Java, due to worse hashmap implementation or bad hashing algorithm, as a profiler tells that nearly all execution time is spent doing lookups in operator[].

Usermade data structure

The heapsort and matrix benchmarks do not rely on the library, but rather implement their own data structures and perform operations on those. Floating-point numbers are used as the data in both tests.

Note: Java -server execution time was either 5.9 or 6.3 seconds on each heapsort run; an average was recorded.

Polymorphism

The purpose of the methcall test is to see how long it takes to make a call on a pointer on which the program needs to resolve the actual embed type runtime. Unfortunately Java apparently detects that the test does nothing and optimizes most of it away. The first half, which is supposed to do 2e9 function calls takes no time at all, while in C++ both halves take approximately same time, as is expected. A more complex benchmark is needed for meaningful results.

The objinst test is based on the same classes, with more emphasis on embed creation and destroying. It, however, is also too simple for any meaningful results as the construct-destruct pairs and other things could easily be optimized away by the compiler.

Since the original tests were clearly flawed, I decided to write a more realistic version of the the methcall test, storing an array of base class pointers and calling methods on those polymorphically. This test, methcall2, seems to be giving correct results, showing that Java is slightly faster than C++ in Java's area of expertise.

Profiling shows that C++ does the virtual method resolving only on the first t->activate() call, making that the bottle neck of the program. The second identical line uses practically no CPU time at all because it can benefit from the work already done earlier.

Nested loops

This test consists of six nested loops, each of which runs N rounds. In the innermost loop, a variable is incremented. Effectively this program calculates n6, in a very slow way. I was actually somewhat surprised to find out that neither Java nor GCC decide to entirely optimize away these loops in favor of calculating the result directly. In GCC's case it seems to be precisely because of benchmarks: applying -funroll-loops (an optimization flag not included even by -O3) makes the program execute quickly, as the loops are largely optimized away. The profiler optimization also implies this flag, affecting that part of the benchmark.

Update 2010-01-27: The LLVM C++ compiler Clang actually does figure it out, optimizing the entire test into few mul instructions and no loops all, making it execute instantly no matter how large the value of n.

Random number generation

More pure numeric calculation with just a few variables and no data structures.

Sieve of Eratosthenes

This is a simple algorithm for finding prime numbers efficiently. An array of bool values is used, with each element representing whether its index value can be prime or not.

As a sidenote, the C++ version would be three times as fast if it used vector<bool> instead of vector<char>, for packing 8 bools into each byte instead of using a byte per bool. On Java the bit-packing approach is, for some reason, actually much slower. Java's boolean[] array is internally implemented with bytes (AFAIK). Anyway, the benchmark rules dictate that bit packing must not be used, so we'll stick to that.

String operations and I/O

There are three tests that deal with strings, two of them which also deal with I/O and one of those also converts strings into numbers. These are heavily dependent on the library implementation and a profiler will show that practically all execution time is spent inside the library and almost nothing is spent inside the program itself.

The strcat test does nothing more than a repated string catenation (adding a string to the end of another), a large number of times. The trick here is that the string usually contains one large continuous buffer that needs to be enlarged whenever so much data is added that the entire string no longer fits in the original buffer. This enlarging usually required copying all the data to a new location, and this gets slow as the string gets longer. The C++ version is hand-optimized to double the buffer size whenever the string no longer fits, but this optimization is unnecessary as the standard library can do that too. Removing the "optimization" actually reduced the runtime by 2 %. Java version uses no tricks, just a standard library StringBuffer.

The Java strcat initially used StringBuffer, but this has been replaced with a non-thread-safe StringBuilder, which is otherwise functionally equivalent. This change reduced the runtime from 1.5 to 0.9 seconds.

A one gigabyte data file was used for the sumcol and wc tests. Since otherwise the hard drive would have been the limiting factor, the file was preloaded into cache before doing the actual benchmark, so that all disk I/O could be avoided, but the program still had to do a system call to fetch the data from the kernel.

The C++ sumcol also uses hand-made optimizations, setting the buffer size used internally by iostreams. Unfortunately that is not the right optimization to do in this case, but one should disable syncing C++ I/O with C I/O, a feature complete unnecessary when not using C stdio facilities. This can be done by calling ios_base::sync_with_stdio(false), reducing the runtime to 18 seconds instead of the 42 seconds that it was before. Further, removing the buffer size setting "optimization", the runtime is reduced to 11 seconds, as seen in the results.

In the other I/O test, wc, that same bad optimization was used in the original code, but is now replaced with the sync disable.

OpenGL 3D graphics

One area was almost completely missed by all the previous tests: system call performance. In order to test this, we'll run a simple OpenGL benchmark on both Java and C++. Why OpenGL? First of all, it looks cool. Secondly, JOGL is one of the most popular OpenGL libraries for Java and it makes embedive comparisons easy since it is just a simple binding to the C OpenGL API. The C++ version uses the C API directly, so the Java and C++ programs can be identical, albeit small syntactical differences. As 3D mathematics are involved, this also puts some burden on the math performance, but the program precalculates the sine function so that this shouldn't be a limiting factor.

The gltest program renders a scene with 220 triangles in it, 10000 times (the test parameter) as fast as it can. This gives a fairly good estimation of what a 3D game might do, even though it obviously is far simplified from an actual game.

Conclusions

The benchmarks used could be divided into four separate groups: tests for raw performance of the language, tests for raw performance with dynamic memory allocation, tests for the standard library implementation and tests for system call performance. Most tests fall in the first three categories, with only sumcoll slightly testing syscall performance (any good library implementation will read the data in very large chunks, so this is not going to be the bottle neck in that test) and the OpenGL test being the only test to fully stress the external function call performance.

One somewhat surprising result is that while C++ allows one to control buffer sizes manually, the benchmarks clearly show that this should be avoided, as the standard library actually does it better by itself, as demonstrated by the performance improvements in the string tests, removing the hand-made optimizations. The test programs had been originally implemented for GCC 3.3 (in year 2003 or earlier) and there have been quite tremendous improvements in the C++ library implementation since then, possibly explaining the need for handmade optimization at that time but not anymore.

What comes to the comparison of the languages, Java is already surprisingly fast, actually managing to win C++ in the methcall2 test. Generally C++ is two or three times as fast as Java in the benchmarks that seem to be working correctly, with a few exceptions in each direction. While this is an excellent result, comparing to languages such as Python (which has 8-20 times the runtimes C++ has¹), even such a small difference can be important when the CPU is a limiting factor. Assuming just a 2x difference, one would need a 6 GHz Core2 CPU and equally faster RAM in order to get the speed that C++ gives on 3 GHz CPUs in computationally intensive tasks. Such CPUs do not exist and even the high-end models boosting 3+ GHz cost a fortune.

It is also important to notice that this benchmark is far from conclusive. Memory usage was not measured at all and there was more than enough scratch RAM available. The benchmarks also do not properly test high numbers of data allocations and frees, mostly because writing a meaningful benchmark that does would require a large amount of code. However, it is generally know that Java's lazy GC benefits in speed from having a lot of extra RAM around because then it does not have to do collection runs often. It is also known that memory-management intensive algorithms consume more memory on Java since the memory no longer used is freed only after the GC runs. GC on the other hand can avoid memory fragmentation issues, thus reducing memory usage in some cases. More benchmarks are needed on this area.

¹) As measured by The Computer Language
Benchmarks Game, linked to at the beginning of the article.

The file data.txt was generated with datagen.cpp. It contains 100 million lines, with one value on each line, with total size of 1046512071 bytes.

Comments?

I would like to be informed of any mistakes that there are in the tests. Also, if you feel that some area lacks testing, feel free to submit a new test in both Java and C++ and I'll include it. Examine the source code of the current tests for an idea on how the test should be written.

Contact me on IRC, by the nick Tronic, with your questions or feedback.