Cooling is via 8 fans at the front in two rows and with power management software (RHEL5 extensions) the box is very little difference in loudness from a power desktop and I mean that... it simply is a quiet beast - quieter than a single core P4 XEON 2.6 equipped Dell 1750 1U server or a Dual CPU Sun V210 1U.

I know that on the whole 2U are quieter than 1U due to to the constraints in cooling a 1U requiring more powerful fans, but still, it is quiet... and power requirements aren't at all bad.

I'm not sure how an Athlon 64 would outperform overall (computationally and heavy load I/O) a Quad HT box (thats 8x3GHZ HT cores). I'm using most of those cores daily and the box as I said is handling a nice load very well.

For under $200 I assume (for an entire unit) you mean a Dual Core Athlon 64? May be you are right... never tried... 64-bit is tempting though simply because hiphop is a real issue compiling on a 32-bit box and hiphop would be very useful for me at the moment.

Also I'm in the UK, nothing is cheap here, in the US on ebay stuff is often 50% of the UK price... or less....

CentOS or RHEL is the only real OS for this box due to the fact that the RHEL extensions for power management manage the fans superbly (and make the machine pretty quiet). I suspect any other Linux Derivative would not have these Redhat developed extensions and the machine would just be hugely noisy and power hungry.

The RHEL extensions run without any issue under CentOS of course. I think it was the OS of choice for these servers.

I have a limited amount of academic exposure to computer architecture and microprocessor design theory, but I'll just mention a few things that came to mind while reading this thread:

First, congrats! The Proliant line is a very reliable and well-built family of otherwise boring x86 servers. I've never had hardware trouble with any of my Proliants, although the ILO software can be a PITA at times.

Second, on a single-threaded program you may or may not see linear improvements in execution time by changing the CPU. There are a number of factors to account for here. A big one is cache size - bigger cache = less clock cycles spent fetching. Floating-point performance also varies between processor families.

I think the best thing you could do at at this point is to parallelize your operations across the entire CPU. Perhaps divide and conquer with your dataset to utilize the unused threads?