Get your brains ready

Just before the weekend, Josh and I got a chance to speak with David Kanter about the AMD Zen architecture and what it might mean for the Ryzen processor due out in less than a month. For those of you not familiar with David and his work, he is an analyst and consultant on processor architectrure and design through Real World Tech while also serving as a writer and analyst for the Microprocessor Report as part of the Linley Group. If you want to see a discussion forum that focuses on architecture at an incredibly detailed level, the Real World Tech forum will have you covered - it's an impressive place to learn.

David was kind enough to spend an hour with us to talk about a recently-made-public report he wrote on Zen. It's definitely a discussion that dives into details most articles and stories on Zen don't broach, so be prepared to do some pausing and Googling phrases and technologies you may not be familiar with. Still, for any technology enthusiast that wants to get an expert's opinion on how Zen compares to Intel Skylake and how Ryzen might fare when its released this year, you won't want to miss it.

High Bandwidth Cache

Apart from AMD’s other new architecture due out in 2017, its Zen CPU design, there is no other product that has had as much build up and excitement surrounding it than its Vega GPU architecture. After the world learned that Polaris would be a mainstream-only design that was released as the Radeon RX 480, the focus for enthusiasts came straight to Vega. It’s been on the public facing roadmaps for years and signifies the company’s return to the world of high end GPUs, something they have been missing since the release of the Fury X in mid-2015.

Let’s be clear: today does not mark the release of the Vega GPU or products based on Vega. In reality, we don’t even know enough to make highly educated guesses about the performance without more details on the specific implementations. That being said, the information released by AMD today is interesting and shows that Vega will be much more than simply an increase in shader count over Polaris. It reminds me a lot of the build to the Fiji GPU release, when the information and speculation about how HBM would affect power consumption, form factor and performance flourished. What we can hope for, and what AMD’s goal needs to be, is a cleaner and more consistent product release than how the Fury X turned out.

The Design Goals

AMD began its discussion about Vega last month by talking about the changes in the world of GPUs and how the data sets and workloads have evolved over the last decade. No longer are GPUs only worried about games, but instead they must address profession workloads, enterprise workloads, scientific workloads. Even more interestingly, as we have discussed the gap in CPU performance vs CPU memory bandwidth and the growing gap between them, AMD posits that the gap between memory capacity and GPU performance is a significant hurdle and limiter to performance and expansion. Game installs, professional graphics sets, and compute data sets continue to skyrocket. Game installs now are regularly over 50GB but compute workloads can exceed petabytes. Even as we saw GPU memory capacities increase from Megabytes to Gigabytes, reaching as high as 12GB in high end consumer products, AMD thinks there should be more.

Coming from a company that chose to release a high-end product limited to 4GB of memory in 2015, it’s a noteworthy statement.

The High Bandwidth Cache

Bold enough to claim a direct nomenclature change, Vega 10 will feature a HBM2 based high bandwidth cache (HBC) along with a new memory hierarchy to call it into play. This HBC will be a collection of memory on the GPU package just like we saw on Fiji with the first HBM implementation and will be measured in gigabytes. Why the move to calling it a cache will be covered below. (But can’t we call get behind the removal of the term “frame buffer”?) Interestingly, this HBC doesn’t have to be HBM2 and in fact I was told that you could expect to see other memory systems on lower cost products going forward; cards that integrate this new memory topology with GDDR5X or some equivalent seem assured.

Core and Interconnect

The Skylake architecture is Intel’s first to get a full release on the desktop in more than two years. While that might not seem like a long time in the grand scheme of technology, for our readers and viewers that is a noticeable change and shift from recent history that Intel has created with the tick-tock model of releases. Yes, Broadwell was released last year and was solid product, but Intel focused almost exclusively on the mobile platforms (notebooks and tablets) with it. Skylake will be much more ubiquitous and much more quickly than even Haswell.

Skylake represents Intel’s most scalable architecture to date. I don’t mean only frequency scaling, though that is an important part of this design, but rather in terms of market segment scaling. Thanks to brilliant engineering and design from Intel’s Israeli group Intel will be launching Skylake designs ranging from 4.5 watt TDP Core M solutions all the way up to the 91 watt desktop processors that we have already reviewed in the Core i7-6700K. That’s a range that we really haven’t seen before and in the past Intel has depended on the Atom architecture to make up ground on the lowest power platforms. While I don’t know for sure if Atom is finally trending towards the dodo once Skylake’s reign is fully implemented, it does make me wonder how much life is left there.

Scalability also refers to the package size – something that ensures that the designs the engineers created can actually be built and run in the platform segments they are targeting. Starting with the desktop designs for LGA platforms (DIY market) that fits on a 1400 mm2 design on the 91 watt TDP implementation Intel is scaling all the way down to 330 mm2 in a BGA1515 package for the 4.5 watt TDP designs. Only with a total product size like that can you hope to get Skylake in a form factor like the Compute Stick – which is exactly what Intel is doing. And note that the smaller packages require the inclusion of the platform IO chip as well, something that H- and S-series CPUs can depend on the motherboard to integrate.

Finally, scalability will also include performance scaling. Clearly the 4.5 watt part will not offer the user the same performance with the same goals as the 91 watt Core i7-6700K. The screen resolution, attached accessories and target applications allow Intel to be selective about how much power they require for each series of Skylake CPUs.

Core Microarchitecture

The fundamental design theory in Skylake is very similar to what exists today in Broadwell and Haswell with a handful of significant and hundreds of minor change that make Skylake a large step ahead of previous designs.

This slide from Julius Mandelblat, Intel Senior Principle Engineer, shows a higher level overview of the entirety of the consumer integration of Skylake. You can see that Intel’s goals included a bigger and wider core design, higher frequency, improved right architecture and fabric design and more options for eDRAM integration. Readers of PC Perspective will already know that Skylake supports both DDR3L and DDR4 memory technologies but the inclusion of the camera ISP is new information for us.

It is common knowledge that computing power consistently improves throughout time as dies shrink to smaller processes, clock rates increase, and the processor can do more and more things in parallel. One thing that people might not consider: how fast is the actual architecture itself? Think of the problem of computing in terms of a factory. You can increase the speed of the conveyor belt and you can add more assembly lines, but just how fast are the workers? There are many ways to increase the efficiency of a CPU: from tweaking the most common or adding new instruction sets to allow the task itself to be simplified; to playing with the pipeline size for proper balance between constantly loading the CPU with upcoming instructions and needing to dump and reload the pipe when you go the wrong way down an IF/ELSE statement. Tom’s Hardware wondered this and tested a variety of processors since 2005 with their settings modified such that they could only use one core and only be clocked at 3 GHz. Can you guess which architecture failed the most miserably?

Pfft, who says you ONLY need a calculator?

(Image from Intel)

Netburst architecture was designed to get very large clock rates at the expensive of heat -- and performance. At the time, the race between Intel and its competitors was clock rate: the higher the clock the better it was for marketers despite a 1.3 GHz Athlon wrecking a 3.2 GHz Celeron in actual performance. If you are in the mood for a little chuckle, this marketing strategy was all destroyed when AMD decided to name their processors “Athlon XP 3200+” and so forth rather than by their actual clock rate. One of the major reasons that Netburst was so terrible was branch prediction. Branch prediction is a strategy you can use to speed up a processor: when you reach a conditional jump from one chunk of code to another, such as “if this is true do that, otherwise do this”, you do not know for sure what will come next. Pipelining is a method of loading multiple commands into a processor to keep it constantly working. Branch prediction says: “I think I’ll go down this branch” and loads the pipeline assuming that is true; if you are wrong, you need to dump the pipeline and correct your mistake. One way that Pentium Netburst kept high clock rates was by having a ridiculously huge pipeline, 2-4x larger than the first generation of Core 2 parts which replaced it; unfortunately the Pentium 4 branch prediction was terrible keeping the processor stuck needing to dump its pipeline perpetually.

The sum of all tests... at least time-based ones.

(Image from Tom's Hardware)

Now that we excavated Intel’s skeletons to air them out it is time to bury them again and look at the more recent results. On the AMD side of things, it looks as though there has not been too much innovation on the efficiency side of things only now getting within range of the architecture efficiency that Intel had back in 2007 with their first introduction of Core 2. Obviously efficiency per core per clock means little in the real world as it tells you neither about raw performance of a part nor how power efficient it is. Still, it is interesting to see how big of a leap Intel made away from their turkey of an architecture theory known as Netburst and model the future around the Pentium 3 and Pentium M architectures. Lastly, despite the lead, it is interesting to note exactly how much work went into the Sandy Bridge architecture. Intel, despite an already large lead and focus outside of the x86 mindset, still tightened up their x86 architecture by a very visible margin. It might not be as dramatic as their abandonment of Pentium 4, but is still laudable in its own right.