Browsing through a manufacturer’s website can offer a startling view of the product line up. Such was the case when I sprawled through Gigabyte’s range, only to find that they offer server line products, including dual processor motherboards. These are typically sold in a B2B environment (to system builders and integrators) rather than to the public, but after a couple of emails they were happy to send over their GA-7PESH1 model and a couple of Xeon CPUs for testing. Coming from a background where we used dual processor systems for some serious CPU Workstation throughput, it was interesting to see how the Sandy Bridge-E Xeons compared to consumer grade hardware for getting the job done.

In my recent academic career as a computational chemist, we developed our own code to solve issues of diffusion and migration. This started with implicit grid solvers – everyone in the research group (coming from chemistry backgrounds rather than computer science backgrounds), as part of their training, wrote their own grid and solver classes in C++ which would be the backbone of the results obtained in their doctorate degree. Due to the idiosyncratic nature of coders and learning how to code, some of the students naturally wrote classes were easily multi-threaded at a high level, whereas some used a large amount of localized cache which made multithreading impractical. Nevertheless, single threaded performance was a major part in being able to obtain the results of the simulations which could last from seconds to weeks. As part of my role in the group, I introduced the chemists to OpenMP which sped up some of their simulations, but as a result caused the shift in writing this code towards the multithreaded. I orchestrated the purchasing of dual processor (DP) Nehalem workstations from Dell (the preferred source of IT equipment for the academic institution (despite my openness to build in-house custom hardware) in order to speed up the newly multithreaded code (with ECC memory for safety), and then embarked on my own research which looked at off-the-shelf FEM solvers then explicit calculations to parallelize the code at a low level, which took me to GPUs, which resulted in nine first author research papers overall in those three years.

In a lot of the simulations written during that period by the multiple researchers, one element was consistent – trying to use as much processor power as possible. When one of us needed more horsepower for a larger number of simulations, we used each other’s machines to get the job done quicker. Thus when it came to purchasing those DP machines, I explored the SR-2 route and the possibility of self-building the machines, but this was quickly shot down by the IT department who preferred pre-built machines with a warranty. In the end we purchased three dual E5520 systems, to give each machine 8 cores / 16 threads of processing power, as well as some ECC memory (thankfully the nature of the simulations required no more than a few megabytes each), to fit into the budget. When I left that position, these machines were still going strong, with one colleague using all three to correlate the theoretical predictions with experimental results.

Since leaving that position and working for AnandTech, I still partake in exploring other avenues where my research could go into, albeit in my spare time without funding. Thankfully moving to a single OCed Sandy Bridge-E processor let me keep the high level CPU code comparable to during the research group, even if I don’t have the ECC memory. The GPU code is also faster, moving from a GTX480 during research to 580/680s now. One of the benchmarks in my motherboard reviews is derived from one of my research papers – regular readers of our motherboard reviews will recognize the 3DPM benchmark from those reviews and in the review today, just to see how far computation has gone. Being a chemist rather than a computer scientist, the code for this benchmark could be comparable to similar non-CompSci trained individuals – from a complexity point of view it is very basic, slightly optimized to perform faster calculations (FMA) but not the best it could be in terms of full blown SSE/SSE2/AVX extensions et al.

With the vast number of possible uses for high performance systems, it would be impossible for me to cover them all. Johan de Gelas, our server reviewer, lives and breathes this type of technology, and hence his benchmark suite deals more with virtualization, VMs and database accessing. As my perspective is usually from performance and utility, the review of this motherboard will be based around my history and perspective. As I mentioned previously, this product is primarily B2B (business to business) rather than B2C (business to consumer), however from a home build standpoint, it offers an alternative to the two main Sandy Bridge-E based Xeon home-build workstation products in the market – the ASUS Z9PE-D8 WS and the EVGA SR-X. Hopefully we will get these other products in as comparison points for you.

I didn't read the article in full but what I did read was top notch. I found your simulations and mathematics very interesting. I took a Physics class which was focused in writing code to run mathematical simulations . Using the given java lib. I wrote my own code to calculate PI. When I returned from the gym the program had calculated 3.1 . I then re-wrote the program from scratch and ditched the built in libs. and reran. I had 20 decimals in 30 sec it was an epic improvement.

Hi Ian. Thanks for the nice article. I have one suggestion regarding the explicit finite difference code:

You could try to reorder the loops such that the memory access is more cache friendly. Right now 'pos' is incremented by NX (or even NX*NY in 3D) which will generate a lot cache misses for large grids. If you switch the x and the y loop (in the 2D case) this can be avoided.Reply

Either way I order the loops, each point has to read one up, one down, one left and one right. My current code tries to keep three as consecutive reads and jump once, keeping the old jump in local memory. If I adjusted the loops, I could keep the one dimension in local memory, but I'd have to jump outside twice (both likely cache misses) to get the other data. I couldn't cache those two values as I never use them again in the loop iteration.

When I did this code on the GPU, one method was to load an XY block into memory and iterate in the Z-dimension, meaning that each thread per loop iteration only loaded one element, with a few of them loading another for the halo, but all cache aligned.

Yes, but when you access the array 'cA' at 'pos' the CPU will fetch the entire cache line (64 byte in case of your machine, i.e. 16 floats) of the corresponding memory address into the CPU cache. That means that subsequent accesses to say 'pos + 1', 'pos + 2' and so on will be served by the cache. Accessing an array in such a sequential manner is therefore fast.

However, when you access an array in a nasty way, e.g. 'NX + x' -> '2*NX + x', -> '3*NX + x', then each such access implies a trip to main memory if NX happens to be large.

That you need to move up / down and sideways in memory does not matter. When you write down the accesses of the code with the reordered loops you will notice that they just access three "lines" in memory in a cache friendly way.

Not reusing the old values of the last iteration should not affect performance in a measurable way. Even if the compiler fails to see this optimization, the accesses will be served by the L1 cache.

Btw, did you allocate the array having NUMA in mind, i.e. did you initialize your memory in an OpenMP loop with the same access pattern as used in the algorithms? I am a bit surprised by the bad performance of your dual Xeon system.Reply

Memory was allocated via the new command as it is 1D. When using a 2D array the program was much slower. I was unaware you could allocate memory in an OpenMP way, which thinking about it could make the 2D array quicker. I also tried writing the code using the PPL and lambdas, but that was also slower than a simple OpenMP loop.

I'm coming at these algorithms from the point of view of a non-CompSci interested in hardware, and the others in the research group were chemists content to write single threaded code on multi-core machines. Transferring the OpenMP variations of that code from a 1P to a 2P, as the results show, give variable results depending on the algorithm.

There are always ways to improve the efficiency of the code (and many ways to make it unreadable), but for a large part moving to the 2P system all depends on how your code performs. Please understand that my examples being within my limits of knowledge and representative of the research I did :) I know that SSE2/SSE4/AVX would probably help, but I have never looked into those. More often than not, these environments are all about research throughput, so rather than spend a few week to improve efficiency by 10% (or less), they'd rather spend that money getting a faster system which theoretically increases the same code throughput 100%.

I'll have a look at switching the loops if I write an article similar to this in the future :)