Also, if you replace your use the std::partial_sum function, with gcc 4.7, you get ~1100, but it get slower with clang …

Looks like JVM outperformed most C++ compilers in loop-unrolling / vectorisation in this particular microbenchmark.

Faster execution speeds(because it's fully compiled)

The biggest C++ performance advantage over Java right now is its support for value-types + move semantics. Something that Java is missing and it can result in a huge performance penalty if you're not aware of it. So e.g. coding a thing like Point2D in Java as a class is a performance nogo.

There are plans to change it though, and IBM has already got a prototype of PackedObjects - which let you control memory layout manually (and even interface with native code without copying). Hope it gets into mainstream JVMs soon.

Another misconception about Java is that slow startup is caused mainly by JITing and slow execution until the code gets JITTed. In most cases it is not. It is caused by lazy classloading and object model requiring to load classes fully. E.g. a basic Swing application needs to load thousands (!) of classes from rt.jar which is ~50MB - this is a huge I/O impact.

@Lumpkin the benchmark files are on the blog rapidcoder linked to, although you might want to fix the access out of bounds in the iterator-based test (it segfaulted when I tried running it) and add std::partial_sum as the actual default C++ approach.

Daniel Lemire wrote:

Of course, from a sample of 3 compilers on a single problem, I only provide an anecdote

As a matter of anecdote, on Oracle's own M5000 (32x2.5 GHz), I got:

Java (1.6.0_22) best out of 50 runs was 352.11 (ran as java -server -d64)

C++ (Sun Studio 12, compiled as CC -m64 -xO3) gave (best of 5)

straight sum (C-like) 381.679
basic sum (C++-like) 413.223
iterator-based sum (C++-like) 413.223 <- had to fix this one
std::partial_sum 409.836 <- added this one
...the "smart" sums were all much slower than Java

(it was hard to find a dev box which had java on it)

As for Intel, I looked at the assembly produced by JIT (1.7.0_45) on my old core i7 920, the main loop went this way: