Glorious

Here guy's but doesn't this also mean that the benchs weve been getting on Epyc 32 core Windows Server are all severely hampered by the same bad scheduler ?

I mean it's not a small discrepancy, a lot of the workloads are 2.5x faster in Linux.

I can't wait to see the actual numbers for a 32 Core Epyc Windows Server against Xeon when this is fixed...
Expect the gap to increase a lot... well 2.5x in certain workloads...

If this makes all the previous benchmark results wrong on Threadripper and Epyc...well lol This is huge.

I wonder what the 64 Core is going to look like, ridiculous I'd say

There's a couple of things going on here. I'd need some really low level performance stats to know for sure, but my suspicion is simply that the Windows scheduler really isn't designed for this type of workload.

Unlike Linux, Windows is designed to allow threads to jump between cores in order to increase maximum thread uptime. For example, on Linux, if a thread gets bumped by another thread, it will be re-assigned to the same CPU core. The downside is that thread may be waiting for a period of time even if another CPU core is capable of running it. This costs some performance for that specific thread, but you don't have to worry about accessing CPU cache across cores or having to access main memory as often as threads get re-assigned. By contrast, Windows will schedule a thread on whatever core is capable of running it at that time. This results in threads jumping between cores, increasing uptime but also increasing the amount of cache/memory access required.

If you have only a few threads, the Windows approach isn't going to significantly affect performance in a negative way, but as core count on the CPU starts to increase you need to start considering the increased cache/memory access this approach results in. This problem gets significantly amplified in a NUMA environment. Essentially: the Windows thread scheduler is optimized for a single application that utilizes no more then handful of threads.

As for the results on Linux, do notice scaling starts to decline noticeably beyond 16 threads, which is about what I'd expect. If I'm particularly bored (very little work for me right now) I might actually compute some hard numbers to demonstrate.

Glorious

Other workloads will, but scaling declines once you get beyond 16 cores due to scheduling overhead. Coincidentally, you see that trend in the result set, where scaling drops significantly when moving beyond 16 cores.

Distinguished

Other workloads will, but scaling declines once you get beyond 16 cores due to scheduling overhead. Coincidentally, you see that trend in the result set, where scaling drops significantly when moving beyond 16 cores.

That's not exactly correct, a true Dx12 or Vulcan (mantle) title should be able to scale well above 8 cores. Actually it would be capable to break the workload into as many threads as there are free cores or is beneficial for that matter.

Glorious

Other workloads will, but scaling declines once you get beyond 16 cores due to scheduling overhead. Coincidentally, you see that trend in the result set, where scaling drops significantly when moving beyond 16 cores.

That's not exactly correct, a true Dx12 or Vulcan (mantle) title should be able to scale well above 8 cores. Actually it would be capable to break the workload into as many threads as there are free cores or is beneficial for that matter.

The problem is as you scale up, especially for memory heavy tasks like GPU rendering, you start to run into a lot of scheduling/resource overhead. Unless you have either insanely fast memory access or much larger CPU caches then we currently have, these effects will limit how well you can scale. The APIs can handle it, but the rest of the system can't.

Splendid

Well, gamerk, in case you haven't noticed, we do have insanely fast memory that is grossly underused ATM. Maybe when developers start actually taxing the memory subsystem we'll start noticing such bottlenecks you're talking about, but if you ask me, we haven't seen any of them yet.

My take is like before: we haven't seen a proper paradigm shift because of economic reasons and not technical ones.

Glorious

Well, gamerk, in case you haven't noticed, we do have insanely fast memory that is grossly underused ATM. Maybe when developers start actually taxing the memory subsystem we'll start noticing such bottlenecks you're talking about, but if you ask me, we haven't seen any of them yet.

My take is like before: we haven't seen a proper paradigm shift because of economic reasons and not technical ones.

Cheers!

Latency is more important here then bandwidth; you do NOT want a thread assigned then have to wait several hundred CPU cycles to get the data it needs.

Splendid

Not disagreeing, but I'm pretty sure that is why NUMA has such a huge impact in Linux for AMD. Memory management for threading is really important like you say and AMD has it covered with NUMA (to a degree) for massively parallel workloads. Changing the topology of how you arrange the memory for the CPUs has a deep impact on latencies across the hardware. I like the trade off, to be honest. Less effective memory for a better average access is most definitely a great trade off. This just adds to my comment of "we do have fast memory anyway".

What I'd like to see is Microsoft pushing a NUMA-enabled version of Windows10 for AMD. I know Servers have it, but you have to enable it. So, I don't think (or I hope?) it would be a massive cost for Microsoft to push it into consumer space. There's a reason why to do it now, at the very least, so might as well? Hell, maybe there's already a hidden option (since the Kernel is massively similar anyway?) that could enable it? Maybe there's already a way?

Glorious

Not disagreeing, but I'm pretty sure that is why NUMA has such a huge impact in Linux for AMD. Memory management for threading is really important like you say and AMD has it covered with NUMA (to a degree) for massively parallel workloads. Changing the topology of how you arrange the memory for the CPUs has a deep impact on latencies across the hardware. I like the trade off, to be honest. Less effective memory for a better average access is most definitely a great trade off. This just adds to my comment of "we do have fast memory anyway".

What I'd like to see is Microsoft pushing a NUMA-enabled version of Windows10 for AMD. I know Servers have it, but you have to enable it. So, I don't think (or I hope?) it would be a massive cost for Microsoft to push it into consumer space. There's a reason why to do it now, at the very least, so might as well? Hell, maybe there's already a hidden option (since the Kernel is massively similar anyway?) that could enable it? Maybe there's already a way?

Cheers!

The downside to NUMA is you need your workloads to NEVER need to touch the same memory data, otherwise everything grinds to a halt. That's why Windows does so poorly in NUMA, given the way they handle threading (which is optimized for non-NUMA).

Honorable

The downside to NUMA is you need your workloads to NEVER need to touch the same memory data, otherwise everything grinds to a halt. That's why Windows does so poorly in NUMA, given the way they handle threading (which is optimized for non-NUMA).

Seriously though, with the amount of 2S and 4S sockets produced throughout history are you honestly telling me that OS schedulers start to destroy performance gains at only 16 threads? There are 8 Cores from 2007 (Core Quad xeons in 4S config). That's very very close to where we currently are in the consumer space already.

Splendid

The downside to NUMA is you need your workloads to NEVER need to touch the same memory data, otherwise everything grinds to a halt. That's why Windows does so poorly in NUMA, given the way they handle threading (which is optimized for non-NUMA).

Seriously though, with the amount of 2S and 4S sockets produced throughout history are you honestly telling me that OS schedulers start to destroy performance gains at only 16 threads? There are 8 Cores from 2007 (Core Quad xeons in 4S config). That's very very close to where we currently are in the consumer space already.

This might sound weird or even annoyingly direct, but... Do you really think serious work on servers with multitude of CPUs is done with Windows installed?

That is one of the many reasons why you never really use Windows Server for anything *remotely* serious.

All the people developing dotNet applications must be delusional their code is going to be running critical applications or in critical infrastructure using Windows, if at all. At best, web apps or crappy balancing machines. Funny thing: did you know IIS craps out when the CPU is at 100%? It can't accept new connections and it's still a thing they haven't fixed. Dayum!

Anyway, the point is MS hasn't really taken NUMA seriously for some bizarre reason I don't know personally, even when they do "support" it in MS Server.

Glorious

The downside to NUMA is you need your workloads to NEVER need to touch the same memory data, otherwise everything grinds to a halt. That's why Windows does so poorly in NUMA, given the way they handle threading (which is optimized for non-NUMA).

Seriously though, with the amount of 2S and 4S sockets produced throughout history are you honestly telling me that OS schedulers start to destroy performance gains at only 16 threads? There are 8 Cores from 2007 (Core Quad xeons in 4S config). That's very very close to where we currently are in the consumer space already.

This might sound weird or even annoyingly direct, but... Do you really think serious work on servers with multitude of CPUs is done with Windows installed?

That is one of the many reasons why you never really use Windows Server for anything *remotely* serious.

All the people developing dotNet applications must be delusional their code is going to be running critical applications or in critical infrastructure using Windows, if at all. At best, web apps or crappy balancing machines. Funny thing: did you know IIS craps out when the CPU is at 100%? It can't accept new connections and it's still a thing they haven't fixed. Dayum!

Anyway, the point is MS hasn't really taken NUMA seriously for some bizarre reason I don't know personally, even when they do "support" it in MS Server.

Cheers!

MSFT doesn't take NUMA seriously because the scheduler they use really isn't designed for that type of workload. At minimum they'd have to re-write their thread scheduler from scratch, which is something they likely don't want to do at this juncture. And I suspect there's a lot of low-level Windows internals that don't play well with NUMA.

Windows was designed around Unified Memory Access; it's kind of no surprise that it starts to break in a NUMA environment.

As for threading, the main problem as you increase thread count is the OS scheduler/memory access starts to become a larger and larger problem, leading to oftentimes significant decreases in performance scaling. You also need to remember that other tasks are also trying to run, and having one application take all the CPU resources often leads to all applications losing performance as they constantly bump eachother's threads while trying to finish their tasks. In an environment where your application is the only thing running and you have direct control of memory access, you could scale to infinity. But neither of those things are typically true, and scaling declines as thread count increases as a result.

As for Windows on Windows, MSFT is just converting all 32-bit memory accesses into 64-bit equivalents in realtime in order to keep the OS happy. Everything is still going through the Win32 APIs under the hood.

Honorable

It reads to me like a hardcoding fix for an Intel bug getting in the way. The OS shouldn't be avoiding using memory controllers period, even if it's because of there being no path to IO on the XCC chips.

Windows seems to fall over this regularly. Issuing a bugfix for 1st gen tech in some form that doesn't specifically target the hardware with the problem.

Glorious

It reads to me like a hardcoding fix for an Intel bug getting in the way. The OS shouldn't be avoiding using memory controllers period, even if it's because of there being no path to IO on the XCC chips.

Windows seems to fall over this regularly. Issuing a bugfix for 1st gen tech in some form that doesn't specifically target the hardware with the problem.

The problem is the scheduler has gotten so heavily optimized, things that are different tend to break in some really weird ways.

But yeah, it is odd the scheduler was trying to put all the threads on just one node, while avoiding the others entirely. That's unusual.