Performance issues surrounding 32-core AMD Threadripper and Epyc processors while running Windows, previously blamed on how its multi-die design can communicate with system memory, may be simpler to resolve than previously thought with a simple programme dubbed Coreprio unlocking full performance.

Designed for workstation and high-end desktop usage, AMD's Threadripper 2990WX is based on the server-centric Epyc 7551: Both chips feature a design which includes 32 physical cores and 64 logical threads, with the Threadripper variant featuring four memory channels and the Epyc eight. At launch, though, the performance of both devices while running Windows 10 was found to fall short of where you might expect, with testers blaming everything from memory bandwidth starvation to an inherent drawback of the multi-die design.

Level1Techs, however, suggests that the issue is considerably simpler: A flaw in the Windows kernel which has been sapping performance. Prompted by the discovery that the chips' performance in Windows was lower than when running Linux, and that similar performance issues could be seen in the Epyc chip with its considerably higher memory bandwidth, the site performed an in-depth investigation in partnership with Bitsum's Jeremy Collake. The result: the addition of a new feature to Bitsum's Coreprio tool, NUMA Dissociater, which works around the bug.

'Wendell at Level1Techs acquired a new Epyc system, and during benchmarking found the same performance regressions that were present in the 2990WX. Since that platform has a full 8 memory channels, it shouldn’t suffer the asymmetrical performance issues of the 2990WX,' Collake explains. 'When he put the system into UMA [Uniform Memory Access] mode, the regressions disappeared and performance of some benchmarks increased massively.
Back in NUMA [Non-Uniform Memory Access] mode, he noticed that changing the CPU affinity of select benchmarks (e.g. Indigo) caused them to mysteriously increase in performance. It turns out the reason was because the CPU affinity has to be changed after the process starts; the effect is not seen when the CPU affinity is assigned during process creation.'

The result of the fix - which Collake describes as 'bizarrely imprecise' - is an impressive performance gain with the worst-affected benchmarks, including Indigo which saw a 16-core Threadripper variant outperforming the 32-core version, almost doubling their results with no other changes to the system.

A full deep-dive of the issue is available on Level1Techs, while Coreprio with its experimental NUMA Dissociater functionality can be downloaded for free from Bitsum.