Now, now, no reason to get so aggressive. Surely, all he did was abbreviate his suggestion a little, and he really meant to say:

Break into the SuperMicro offices, toss some Single-Slot Titans into the only four existing PCIe slots in this rather compact workstation, hook up a monitor, keyboard and mouse, find out what games will even install and run on Windows Server, and then explain to us in a lot of detail why games work even worse on multiple physical CPUs than some of those scientific benchmarks do. Start with something like "Well, turns out the Supermicro X9 doesn't actually support SLI.

It's a 20.000$ system : 1k for the board, 4x3k for the CPUs, annother 3k for the RAM, and then some for the SAS drives. The number of gamers considering to buy this is rather exactly 0 people. Testing what a multiple-CPU system will do to gaming is fun, but that's what what ASUS and EVGA have given us a number of Dual-Socket boards with SLI-support for. Installing and benchmarking a game on a Quad-CPU-Single-GPU Board with a C602 chipset would be a horrible waste of Ian in my book.Reply

Love my job, since I've been bringing in $82h… I sit at home, music playing while I work in front of my new iMac that I got now that I'm making it online. (Home more information)http://goo.gl/Qev5QReply

Of course and Why not test any actual GNU/Linux vs this MS WOS server 2012?more than 90% even 95% of servers are using Linux and this model must be tested with the target OSs to use, Cent OS - a Red Hat clone - would be the best choice, or not cheap RH EL and SUSE EE that surely you can have a free copy for testing and serve more the IT guys to pick - or not - this machine.

Benchmarking with MS WOS server 2012 after this results minds nobody is going to buy it for MS WOS server, and perhaps it canbe a good chopicefor a Linuxserver with some virtualized machines - even for a Xen or QEMU running MS WOS + Autocad with a VGA Passthrough configuration, that should be a great test.Reply

So, the only manufacturer of Quad-CPU-Boards has absolutely no clue of Multi-CPU systems, and is consistently running the wrong OS on his own test installations? Windows Server is a profitable product for MS, it has an existant market-share (see, for an example, http://w3techs.com/technologies/overview/operating... ) and it does not exactly cripple Multi-CPU performance for software which does support it. Just look at the PovRay benchmark in this very article, or read some well-written material provided by MS on the topic: http://goo.gl/A6f23 .

Informing people about linux as an option, and clarifying its capabilities and benefits is something I can get behind, but being an obnoxious linux-fanboy won't convince anybody of anything. Reply

Well, Windoze is a lame OS, no matter what the fanboys say. OTOH, SQL Server is a very good database, up to its limits. But that means using Windoze. If one goes the *nix way, then Oracle/DB2/Postgres are the databases to choose among.

Multiprocessor systems are more appropriate as heavy weight (for some definition of heavy) database machines. They can exploit CPU/RAM/SSD more than any other application.Reply

He tried out multiple versions of Windows Server. He seems to be using a very serious version of Remote Desktop... either that, or he does in fact have access to someone at SuperMicro who can format things for him. But, the most important thing to say about this whole affair: Even SuperMicro, the builders of this desktop, could not get all 64 cores working on Windows.Reply

He almost certainly used SuperMicro's IPMI 2.0 KVM-over-IP solution, which provides a remote desktop (including local optical storage and USB proxies) at the HW level. Doing BIOS setup and an OS install from remote DVD media (i.e. the media is physically at Ian's location instead of at the server) is a piece of cake.Reply

You don't need to consider Windows Server in order to run that type of hardware, just get Win 8 on there since it can support up to 640 logical CPU's as I'm aware of. So... yea, I wish Linux gaming was benchmarkable but it really still isn't in terms of graphics performance, only CPU benchmarks would be meaningful but DEF not GPU testing in a Linux distro.And anyway, Phoronix does a wonderful job on the Linux side of benchmark land.http://blogs.msdn.com/b/b8/archive/2011/10/27/usin... To see some stupid extreme CPU Task manager action.Reply

"The main issue moving to 4P was having an operating system that actually detected all the threads possible and then communicated that to software using the Windows APIs. In both Windows Server 2008 R2 Standard and 2012 Standard, the system would detect all 64 threads in task manager, but only report 32 threads to software."

This is actually an old Windows API issue. While a piece of software can scale to a near infinite number of threads per process (only limited by address space), the Windows scheduler will only run a maximum of 32 per process concurrently. Even MS SQL Server only supports a maximum of 32 threads per DB on a single system (MS SQL Server will spawn another process per DB to scale higher as necessary).

Though with 32 real cores, it may pay off to simply disable HyperThreading for better scaling.Reply

To clarify, this seems to be an issue with 32bit software running on 64bit hardware and making windows API calls while running under WOW64. A good example is noted in the remarks from the API documentation for the GetLogicalProcessorInformationEx function which describes issues with passing a 64bit KAFFINITY structure to a 32bit client and the side effects that can cause.http://msdn.microsoft.com/en-us/library/windows/de...

As noted in the article by the author, creating software that benefits from NUMA rather than being hamstrung by NUMA requires another layer of knowledge on top of single cpu software development. I'm sure Microsoft has figured out NUMA with MS Sql Server considering the prevalence of multi-cpu solutions for that software product essentially since multi-cpu hardware for windows became common. Note TPC result id 112032702 for NEC running Windows Server R2 Enterprise, and SQL Server 2012 Enterprise on 8 processors, 80 cores, and 160 threads.Reply

The keyword is "processor groups", and APIs that deal with group affinity.

So, I would suggest to the reviewer to get acquainted with this if he intends to keep using Windows Server 2012 (or later) as the test vehicle.

In the Xeon E5 case based on Sandy Bridge-EP this should still not be a problem, as long as the reviewer use 64-bit processes, because Xeon EP 4600 does not support more than 64 logical CPUs.

However, Ivy Bridge EP already can have more than 64 logical processors with the E5 4600 v2 line. Having more than 64 logical CPUs was already possible with Xeon E7 platform based on Boxboro generation, and it will get even more scalable with Ivy Bridge EX.Reply

That processors are grouped is more important than the number of processors. For NUMA architectures, all logical processors belonging to a physical CPU (with or without hyperthreading) will belong to the same group. The SetProcessAffinityMask() Windows function can be used to prevent the scheduler from assigning the process's threads a logical processor that doesn't belong to the same group. This way all threads in that process always run on cores that have the same fast memory access.

The process affinity mask essentially allows using a subset of the NUMA hardware as if it were a SMP system. If you have, say, 4 processor groups, then you have to manually divide the data up into 4 sections handled by 4 processes so that each group of threads operates on its own section with SMP memory access. MPI is then used to tie the 4 processes together just like using a cluster. The difference is that the message passing on the NUMA system is faster than on a cluster of separate physical servers, but basically it maps the NUMA system as a cluster of independent SMP systems.

Data dependent algorithms will greatly benefit from using the process affinity mask. Since a system like this doesn't make sense for data independent algorithms, ( where GPU hardware would be faster and cheaper), only software designed for NUMA systems should be compared. Reply

This is more a statement of why unified memory and cache are important to performance computing. I'd like to note that the 6-core 3930X beat the 4770K on all but the few single threaded benchmarks, and the Xeon 8-core (I think it's 8-core?) beat the 3930X.

There are plenty of applications that scale up with core count. They just don't scale up with multiple sockets and slow interconnects between those cores.Reply

bbb but, 25.6 GB/s QPI is supposed to be good <Not in 2014>, we dont need no lowest power stinkin NoC (Network On Chip) at 1Terabit/s,2Terabit/s like those ARM interconnects today

"Intel describes the data throughput (in GB/s) by counting only the 64-bit data payload in each 80-bit "flit". However, Intel then doubles the result because the unidirectional send and receive link pair can be simultaneously active. Thus, Intel describes a 20-lane QPI link pair (send and receive) with a 3.2 GHz clock as having a data rate of 25.6 GB/s. A clock rate of 2.4 GHz yields a data rate of 19.2 GB/s. More generally, by this definition a two-link 20-lane QPI transfers eight bytes per clock cycle, four in each direction.

I notice that Anandtech tries to appeal to both industrial and enthusiast circles, and I appreciate how hard that is. It seems like this article is targeted at the industrial/HPC segment, however, and I think that a standard benchmark for HPC should include some codes frequently used in HPC. Everyone knows that Gaussian will leave a horse's head on your pillow if you try to benchmark their software, but you could easily run a massive DFT with the parallelized GAMESS, and I've seen previous articles benchmark Monte Carlo codes. Both chemists and wall street types would be interested in that. CFD programs are very popular with engineers; Openfoam is a popular option. Reply

Yes, Monte Carlo codes are theoretically infinitely parallelizable, though as mentioned previously, often specific implementations do not meet that ideal. Large CFD jobs are also well-parallelizable for some portions of the calculation. 3DS's Abaqus can auto-partitition large models and process each in a separate thread, for instance. Reply

as a scientist myself I would be very interested to see how this scales with standard matrix operations using matlabs parallel computing toolbox. I have noticed that on our grids (xen domains running torque for queuing) the only real speed advantage has been in tesla GPU compute. The CPUs can take care if the overhead of a grid but essentially it comes down to programming as stated in the article. Custom code is the only way, and in most scientific applications the only availability.. Thus, testing high level languages with inherent multiproc support (parfor etc.) would be suuuper interesting to see. Thank you for the great read.Reply

Yes. It is problem specific. A data independent operation, such as a matrix multiplication, linear transforms, etc. are far better suited to GPU compute. But consider a problem solved by an iterative calculation where the result of one iteration depends on the result(s) of previous iterations. GPU compute is inherently unsuited for such data-dependent problems. Many real world problems have a data dependence, but the dependence is on a separate calculation that is itself data-independent.

Even within the HPC world, the hardware choice depends on the problems the system is to be used for. But to aim for as general purpose a system as can be had, it makes sense to use something like this 4-processor board along with several Tesla cards in its PCIe slots.

So the bottom line is that a HPC benchmark suite should contain a mix of problems. A simple matrix multiply will always be unfairly weighted towards GPU compute and will not be representative of a system's general HPC capabilities.Reply

as a scientist if you cant program an optimal assembly routine from your individual C routines as per the optimal x264 coding style to use assembly with a C fallback and check, then at least look to using the far more optimal http://julialang.org/ in place of matlab to increase all your algorithms data throughput, and upstream all your speed/quality improvements to that open code baseReply

I like the F@H shoutout. There are certainly more than "a few" users running 4p setups. I'd put it at about 50 to 200 users based on the statistics for how many users are producing at the 500k ppd level most commonly attained with these setups. Many of those users have multiple 4p boards as well.

It is not a trivial process to take full advantage of these systems with F@H. The user community has worked to select the ideal Linux kernels and schedulers for this software as well as created custom utilities to improve hardware efficiency. TheKraken is a software wrapper that locks threads from the F@H client to specific CPUs to prevent excessive memory transfer between CPUs. Another user created tool called OCNG is a custom BIOS and software utility that allows Supermicro 4p G34 boards to overclock CPUs and adjust memory timings.

To use the full performance of 4p systems F@H users needed to go much farther than loading up Windows and running the provided executable designed for single CPU systems.Reply

From looking at Ian's solver results, I think that there are actually (at least) two problems, and perhaps a third:

1. As he acknowledges, he isn't doing any sort of NUMA optimization

2. His overall rates and the obvious senstivity to DDR speed/latency indicate that he probably didn't do much cache-blocking (at its most basic level this involves permuting the order in which elements are processed in order to optimize data access patterns for cache). If that's the case then he would end up going out to DDR more than he should, which would make his code highly sensitive to the latency impacts of NUMA.

3. He may also have some cache-line sharing problems, particularly in the 3D case (i.e. cache lines that are accessed concurrently by multiple threads, such that the coherency protocol "bounces" them around the system). That's the most likely explanation for the absolutely tragic performance of the 4P system in that benchmark.

The importance of cache blocking/optimization can't be overstated. I've seen several cases where proper cache blocking eliminated the need for NUMA optimization. An extra QPI hop adds ~50% to latency in E5-based systems, and that can be tolerated with negligible performance loss if the application prefetches far enough ahead and has good cache behavior.

Ian, would you be willing to share the source code for one or more of your solvers? Reply

One additional question for Ian: You state that your finite-difference solvers use "2^n nodes in each direction".

Does this mean that the data offsets along the major axis (or axes in the 3d case) are also integer multiples of a large power of 2^n? For example, if you have a grid implemented as a 2D array named 'foo', what is the offset in bytes from foo[0][0] to foo[1][0]?

If those offsets have a large power-of-2 factor, then that would lead to pathological cache behavior and would explain the results you're getting. Experienced developers know to pad such arrays along the minor axis or axes. For example, if I wanted to use a 1024 x 1024 array, I might allocate it as 1024 x 1056 instead. The purpose of the extra 32 elements along each row is to ensure that consecutive rows don't contend for the same cache line.Reply

Really interesting article. I've written several implementations of Finite Difference solvers, and used both COTS and Open Source solvers for parallel machines. I'm really surprised by the results, but I really agree with the conclusion, of you don't write your software appropriately you won't take advantage of the hardware at your disposal.

I know it's outside the scope of this article, but I would be really interested to see a comparrison of this 4x processors machine to a 'cluster' of two dual core machines. Ideally it would be awesome to see 2 Sci Linux clusters, one with 4 2x Xeons systems, and 1 with 2 4x Xeon systems. Put the same amount of RAM / core in both rigs and run computational benchmarks. When it comes down to purchasing hardware for a large cluster, looking for the price and performance break point is important. I would imagine that having more threads per machine would be faster then having to run your data over Infiniband (or something like it). Reply

Ian, do you have any idea how your code or these tests might run on an SGI UV 20 or2000, given they have a hardware MPI system and other features to aid with NUMAsystems? The UV 20 is a quad-socket blade with up to 1.5TB RAM, while the 2000scales to 256 sockets and up to 64TB RAM. They both use the XEON E5-4600 series.

Maybe you could ask SGI if you could do a remote access test on one of their UVs?

Hi All, a letely we do some test on our photogrametric sw, and we stumbled on performance issues with Win2012 Datacenter editionon our DualXeon setups, ( http://www.agisoft.ru/forum/index.php?topic=1330.0 ) in short in W2012 is something not OK with the performance of sw, if we do same test on Win7, or XP the same hw is much more faster, up to 70% ( Hyperthreading stuff ) . Could we do more indepht benchmark/problem solving article put together ?? this could help a lot of people in realworld app usage..... Reply

In the real world, any heavy threading and computing workload wouldn't be running on Windows. There is a reason that large supercomputers use Linux, its much better at handling large NUMA systems.Reply

In the future can you please try Linux? I think Linux can do a far better job than Windows. MS Windows Server environment is not that suitable for such benchmarks. And usually for more than 4p Server you use Enterprise Ed not Standard. Sorry, this is just an advice not mandatory, but please try LinuxReply

a big raspberry has to go to Ian Cutress himself for coming out of the idiot closet with such reckless abandon and to Anand for hiring his worthless ass to write for his site.

i want to know what kind of crack one needs to be smoking to get their hands on a 4 way hyperthreaded octo core setup and decide to run a benchmark that uses 720p mpeg-2 as it's source and 4mb/s 720p x264 with the very fast setting (i think that's the one they use) as the target?!?

if you really wanted to stress all the logical cores, a custom benchmark should have been used with all of the x264 settings maxed out and a much higher resolution, maybe even a 4k source, so that we can see some separation between multi cpu and single cpu setups.

seriously, who in there right mind would build this kind of system and then encode 4mb/s 720p avc?

get your head out of your Klavin and learn how to review a damn system.Reply