Weaving The Fabric

Again, as you'll see in our benchmarks, we ran into some strange performance trends that didn't add up. Given Skylake-X's frequency advantage, reworked cache, and 2D mesh topology, we didn't expect Broadwell-E to stand a chance. But in some cases, the previous-gen flagship outperformed Core i9-7900X. Asked about these anomalies, Intel responded:

...we have noticed that there are a handful of applications where the Broadwell-E part is comparable or faster than the Skylake-X part. These inversions are a result of the “mesh” architecture on Skylake-X vs. the “ring” architecture of Broadwell-E.

Every new architecture implementation requires architects to make engineering tradeoffs with the goal of improving the overall performance of the platform. The “mesh” architecture on Skylake-X is no different.

While these tradeoffs impact a handful of applications; overall, the new Skylake-X processors offer excellent IPC execution and significant performance gains across a variety of applications.

We covered Skylake-X's mesh architecture in Intel Introduces New Mesh Architecture For Xeon And Skylake-X Processors. Check that piece out for more detail. Of course, there's a lot more to this story, and much of it remains under embargo. But this is a huge change to an already effective design, so it comes as no surprise that the mesh topology doesn't yield extra performance in all of our metrics.

The Background

Interconnects are pathways for moving data between key components inside of a processor, including cores, caches, and PCIe and memory controllers. They affect latency and power consumption, which in turn affects performance and thermal design power.

Intel's ring bus debuted in 2007 with Nehalem, and AMD's HyperTransport was introduced in 2001. Both technologies evolved, but higher processor core counts, more cache, and greater I/O throughput have strained the interconnects. There are a number of ways to improve their performance, though this often requires bumping up data rates, and thus voltage, in order to realize large performance gains.

03

04

Ryzen-Die-Shot-1

03

04

Ryzen-Die-Shot-1

Intel's bi-directional ring bus, pictured above in red on a Broadwell low core-count die, serves as a good example of the challenge. Data travels a circuitous route to reach the components, and latency amplifies as core count increases. The second image shows the Broadwell high core-count die with 24 cores. Aligning the building blocks into a monolithic bus imposes penalties that make it impractical, so Intel divided the larger die into two separate ring buses. This increases scheduling complexity, and the buffered switches that facilitate communication between the rings add a five-cycle penalty, limiting scalability.

In contrast, AMD introduced its Infinity Fabric with the Zen microarchitecture, currently implemented as two quad-core processor complexes communicating over a 256-bit bi-directional crossbar that also handles northbridge and PCIe traffic. They also share a memory controller. The trip across the Infinity Fabric to the other quad-core CCX and its accompanying cache results in increased communication latency. We detailed the design and measured its latency in our AMD Ryzen 5 1600X Review. We also found that higher memory frequencies can improve the Infinity Fabric's latency characteristics, which is likely one of the key reasons that Ryzen's performance increases with faster memory data transfer rates.

AMD contends that software and platform optimizations can defray some of the performance oddities we've noticed in our testing, and from what we've seen, that is true. AMD's efforts, and an unrelenting string of BIOS, chipset, and software updates, have led to much better performance than we recorded in our inaugural Ryzen 7 review.

AMD's work continues. And now Intel faces the same challenge.

What A Mesh

Intel's 2D mesh architecture made its debut onthe company's Knights Landing products. The mesh consists of rows and columns of interconnects between the cores, caches, and I/O controllers. As you can see, the latency-killing buffered switches are absent. The ability to 'stair-step' data through the cores allows for much more complex, and purportedly efficient, routing. Intel claims its 2D mesh features a lower voltage and frequency than the ring bus, yet still provides higher bandwidth and lower latency.

01

02

01

02

Intel moved the DDR4 controllers to the left and the right sides of the 18-core high core-count die, similar to its Knights Landing design. Previously, they were at the bottom of the ring bus-based designs. The Skylake-X die shot suggests there are six memory controllers (second row down on the right and left columns), so it appears Intel disabled two controllers by default. The company likely uses its smaller LCC die for the Core i9-7900X, though representatives won't say for sure.

Things Get Meshy

Intel designed the mesh to increase scalability. There are trade-offs, however. We turned to SiSoftware Sandra's Processor Multi-Core Efficiency test, which measures inter-core, inter-module, and inter-package latency. The software offers Multi-Threaded, Multi-Core Only, and Single-Threaded metrics. We use the Multi-Threaded test with the "best pair match" setting (lowest latency).

The test measures performance between cores with all possible thread pairs, and for Intel's Core i9-7900X, that results in 189 separate results. We employ a data parser to boil the measurements down into average values.

Processor

Intra-Core Latency

Core-To-Core Latency

Core-To-Core Average Latency

Average Transfer Bandwidth

Core i9-7900X

14.5 - 16ns

69.3 - 82.3ns

75.56ns

83.21 GB/s

Core i9-7900X @ 3200 MT/s

16 - 16.1ns

76.8 - 91.3ns

83.93ns

87.31 GB/s

Core i7-6950X

13.5 - 15.4ns

54.5 - 70.3ns

64.64ns

65.67 GB/s

Core i7-7700K

14.7 - 14.9ns

36.8 - 45.1ns

42.63ns

35.84 GB/s

Core i7-6700K

16 - 16.4ns

41.7 - 51.4ns

46.71ns

32.38 GB/s

The intra-core measurement quantifies latency between threads that are resident on the same physical core, while the core-to-core numbers reflect thread-to-thread latency between two physical cores. Core i9-7900K is most comparable to the 10-core Core i7-6950X, but we included the four-core models as a reference point.

We recorded slightly higher intra-core latency and a larger 10.92ns average latency delta between the Skylake-X and Broadwell-E models. Despite Core i9-7900X's increased latency, we recorded a 17.54 GB/s advantage in average transfer bandwidth. That's a solid 26.7% increase. After generating our first set of -7900X results with DDR4-2666, we followed up with several DDR4-3200 tests and noticed an increase in mesh latency. But we also recorded higher average transfer bandwidth. These results are preliminary, and we are conducting further latency and game testing with different memory transfer rates and timings to provide a more in-depth analysis.

Processor

Intra-CCX Core-to-Core Latency

Cross-CCX Core-to-Core Latency

Cross-CCX Average Latency

Average Transfer Bandwidth

Ryzen 7 1800X

40.5 - 82.8ns

120.9 - 126.2ns

122.96ns

48.1 GB/s

Ryzen 5 1600X

40.6 - 82.8ns

121.5 - 128.2ns

123.48ns

43.88 GB/s

AMD's Ryzen processors employ a vastly different architecture that yields different measurements. The intra-core latency measurements represent communication betweentwo logical threads resident on the same physical core, and they're unaffected by memory speed. Intra-CCX measurements quantify latency betweenthreads on the same CCX that are not resident on the same core. In the past, we observed slight variances, but intra-CCX latency is also largely unaffected by memory speed. However, we've seen up to a 50% decrease in cross-CCX latency, which denotes latency between threads located on two separate CCXes, by increasing the memory data transfer rate from DDR4-1333 to DDR4-3200.

Fabric Bandwidth

We also plotted the fabric bandwidth results from our tests. Core i9-7900X establishes a large advantage over its Broadwell-E predecessor. The Ryzen processors dwarf Intel's quad-core models, but provide far less average bandwidth than the 10-core Intel CPUs.

There will be a market for this platform but it's going to be a tiny one. Most of that, I imagine, will be for professional use as some of the performance gains in areas like rendering might pay dividends in the longer term. However, in terms of the market for the "enthusiast" user ... well, there will always be those who want the fastest or most expensive new thing regardless of outlay but the price/performance ratio for this new platform is poor when compared to AMD's offerings and it's in the main stream enthusiast market where gamers form a huge share of that segment where AMD will make their money and take hold once more. AMD are willing to sacrifice overall margin for volume sales into the biggest market segment and this longer term growth approach will see them regaining much of the market share they'd lost over the past decade.

Intell's new platform is designed and priced for the top 5% of the PC market, AMD is targeting the other 95%.

In fairness to Intel, I think the forthcoming Coffee Lake will be a much better proposition for the average user.

Core i9 doesn't just feel rushed, it also gives me the impression that Intel has purposely failed to innovate; probably because they didn't need to until now. We've been stuck with the same old i3, i5 and i7 configurations for around eight years and if Intel really had the drive to push the boundaries, an i9 would have existed years ago.

Core i9 isn't just about clock speed, cores and cache. Look closer and it tells a bigger story.

This seems like a prime candidate for De-Lidding. But who wants to delid a $1999 cpu that should really live in a server! Surprised that Intel went this route after the problems it had with Haswell overclocking.