Yenya's World

Thu, 14 Feb 2008

AMD versus Intel

For a long time, we have been using AMD Opterons and Athlons 64 for our
web and application servers. Everybody says that Intel has made a big
progress in the last year or so, so I wondered whether an Intel architecture
would be better than AMD one for our upcoming distributed computing project.
Usually all benchmarks display things like raw memory throughput,
encoding/decoding video (which can be done using the SIMD instructions), etc.
However, how would the architectures perform on heavily branched code?

A part of our project is sorting big quantities of data. We have chunks
consisting of 4-byte key, 4-byte value pairs, which have to be sorted
according to the key. Since the data is being generated relatively slowly,
I have decided to pre-sort it using a bucket sort into a set of 256
(for now) bucket files, then sort each bucket file separately,
and finally concatenate the results. I have tried to measure how long the
"sort all buckets" step will take on a single core:

Machine

cc -Os

cc -O6

cc -O6 w/o memcpy()

Athlon64 FX-512.2 GHz/1 MB L2

16.9s

12.5s

9.6s

Athlon64 x2 5600+2.8 GHz/2x 1 MB L2

12.5s

8.3s

7.1s

Pentium D3.0 GHz/2 MB L2

9.6s

9.0s

8.8s

The first two variants used memcpy() inside the quicksort routine for swapping
the two entries (in order to be prepared for possible future variable data
size), the last one used single 64-bit instruction instead. There are two interesting observations there:

AMD is apparently slightly faster there.

The -Os (optimize for size) GCC option is useless. I
wonder why it is the default optimization option for kernel compiles
nowadays.

Another interesting part was the cache size effect: four biggest buckets had
1088232, 1046624, 872792, and 776224 bytes, respectively. Sorting those four
buckets took 2.26, 2.22, 0.63, and 0.21
seconds (on the above FX-51 machine). This means that somewhere around 800 KB
of data size, the algorithm could no longer fit the data into the L2 cache,
resulting in a big slowdown: these four buckets together took more time
to sort than the remaining 254 buckets, even though they contain only
2.23 % of the total data size. I guess I will just use more (512?) buckets
in the production version.

So, what is your experience with compiler optimization settings, and with
speed of various CPUs and architectures?