On linux/x86 {{{dons+chris}}} beats {{{chris}}} by a factor of two and on a G4 {{{dons+chris}}} is a few percent slower. Not too surprising, since the floating point hardware differs. The performanace for the this benchmark (AMD Sempron, Inten P4) is untunable on a G4.

So Don: you can submit this when you are happy with the tweaking since I am blind to x86 performance here. Can you try passing architecture switches to -optc such as -mtune? -- ChrisKuklewicz

2 Proposed entry

A fast and relatively nice entry that runs 1.5x the fastest c++ entry.

6 An STUArray Version (Joel Koerwer)

This one is slightly faster than chris+dons on my machine. The "proposed entry" has ben submitted, so I pushed it and its associated benchmarks down.

I get a speed up from the STUArray versus the dons+chris version on a Powerbook G4. Don+chris is 4.38 s user time, Joel's is 3.74 s user time (17% faster). This makes sense to me, since it does not need to shiftL and .|. the indices. I was cleaning it up without making it slower and I accidentally made it faster (I think by hoisting dt and size). -- ChrisKuklewicz

7 Proposed Entry (dons+chris)

A carefully tuned version of Chris' breakthrough entry. Runs faster than
unoptimised gcc C. Be sure not to use -optc-O3 on linux, or badness may
ensue. The OPTIONS pragma passes -O to gcc last, so it will override any
other -O args to gcc passed on the cmd line.

8 Improved entry (chris)

I made it go faster! By a factor of 2.5, which is not closing much of the gap. But any speed improvement is a breakthrough at this point. Hopefully someone else can see why it is faster and speed it up further. I am just tweaking things at random now.

All the n-body data (7 doubles: mass x y z vx vy vz) is put into a single IOUArray Int Double. Each body is offset by 8 indices, so there is a padding of 1 double to get an alignment of 8. This let me calculate the offset of each double with {{{shiftL _ 3}}} and . The array itself is a global variable.

9 Old attempt by ChrisKuklewicz

I heavily transformed Don's code. This now runs almost exactly as fast as Einar's code on my machine. So I am niether happy nor sad.

Note that this creates the {{{data MutVec = V ...}}} and {{{data Body = B}}} all at the beginning, puts them into an array, and never constructs any new ones. From them on it just looks up {{{IORef Double}}}'s and works on those.

I cannot find any Haskell idioms for making this run 10x faster, but other languages manage to do it, such as OCaml.

12.1 Smaller code

Anyone interested in seeing how small we can make the code? Here's my original attempt, which unfortunately suffers from a massive space-leak. I'd be interested in seeing if it is possible to have a succinct version without leaks.