AMD Zen Architecture Overview: Focus on Ryzen

What Makes Ryzen Tick

We have been exposed to details about the Zen architecture for the past several Hot Chips conventions as well as other points of information directly from AMD. Zen was a clean sheet design that borrowed some of the best features from the Bulldozer and Jaguar architectures, as well as integrating many new ideas that had not been executed in AMD processors before. The fusion of ideas from higher performance cores, lower power cores, and experience gained in APU/GPU design have all come together in a very impressive package that is the Ryzen CPU.

It is well known that AMD brought back Jim Keller to head the CPU group after the slow downward spiral that AMD entered in CPU design. While the Athlon 64 was a tremendous part for the time, the subsequent CPUs being offered by the company did not retain that leadership position. The original Phenom had problems right off the bat and could not compete well with Intel’s latest dual and quad cores. The Phenom II shored up their position a bit, but in the end could not keep pace with the products that Intel continued to introduce with their newly minted “tic-toc” cycle. Bulldozer had issues out of the gate and did not have performance numbers that were significantly greater than the previous generation “Thuban” 6 core Phenom II product, much less the latest Intel Sandy Bridge and Ivy Bridge products that it would compete with.

AMD attempted to stop the bleeding by iterating and evolving the Bulldozer architecture with Piledriver, Steamroller, and Excavator. The final products based on this design arc seemed to do fine for the markets they were aimed at, but certainly did not regain any marketshare with AMD’s shrinking desktop numbers. No matter what AMD did, the base architecture just could not overcome some of the basic properties that impeded strong IPC performance.

The primary goal of this new architecture is to increase IPC to a level consistent to what Intel has to offer. AMD aimed to increase IPC per clock by at least 40% over the previous Excavator core. This is a pretty aggressive goal considering where AMD was with the Bulldozer architecture that was focused on good multi-threaded performance and high clock speeds. AMD claims that it has in fact increased IPC by an impressive 54% from the previous Excavator based core. Not only has AMD seemingly hit its performance goals, but it exceeded them. AMD also plans on using the Zen architecture to power products from mobile products to the highest TDP parts offered.

The Zen Core

The basis for Ryzen are the CCX modules. These modules contain four Zen cores along with 8 MB of shared L3 cache. Each core has 64 KB of L1 I-cache and 32 KB of D-cache. There is a total of 512 KB of L2 cache. These caches are inclusive. The L3 cache acts as a victim cache which partially copies what is in L1 and L2 caches. AMD has improved the performance of their caches to a very large degree as compared to previous architectures. The arrangement here allows the individual cores to quickly snoop any changes in the caches of the others for shared workloads. So if a cache line is changed on one core, other cores requiring that data can quickly snoop into the shared L3 and read it. Doing this allows the CPU doing the actual work to not be interrupted by cache read requests from other cores.

Each core can handle two threads, but unlike Bulldozer has a single integer core. Bulldozer modules featured two integer units and a shared FPU/SIMD. Zen gets rid of CMT for good and we have a single integer and FPU units for each core. The core can address two threads by utilizing AMD’s version of SMT (symmetric multi-threading). There is a primary thread that gets higher priority while the second thread has to wait until resources are freed up. This works far better in the real world than in how I explained it as resources are constantly being shuffled about and the primary thread will not monopolize all resources within the core.

There is not one area of the front end that AMD has not touched or added to. The TLB is larger, the branch prediction is more robust, and it has a four wide decoder. The Neural Net Prediction functionality encompasses branch prediction, TLBs, and prefetch. There are some new algorithms that AMD added to improve this functionality, but the jury is still out if it is truly a “neural net” type setup. AMD made a very large addition to the architecture with the Op Cache. The Op Cache takes many of the common decoded ops and saves them locally. When the Decode stage detects that the Op Cache has the necessary micro-ops locally, it sends these to the micro-op queue for dispatch to the integer or FP units. While the decoder is four instructions wide, the queue has the ability to dispatch 6 micro-ops. This essentially insures that when four instructions are decoded and sent to the queue, there is also enough space for extra ops either from the Op Cache or from instructions decoded into multiple ops.

There are four ALUs per core as well as two AGU (address generation unit), also known as load/store units. 192 instructions can be in flight at any one time and features an 8 wide retire. This again is greater in ability than previous architectures from AMD and again is made more robust to insure that it will not bottleneck throughput. The floating point units have up to AVX 2 support and is comprised of 2 x 128 units. It can achieve 256 bit AVX results by fusing results over two clocks (competing Intel units feature 256 bit results in one cycle). So far AVX 2 is not utilized in many consumer apps but rather relegated to HPC applications. The FPU/SIMD unit is not as robust obviously as the latest found in Skylake and Kaby Lake units from Intel, but AMD thought it would be more than enough for the markets they are pursuing. The FPU features 2 x Fadd and 2 x Fmul units so in theory it has a quad issue FPU per clock. Workloads will vary of course.

From 10,000 feet the integer and FP cores are very similar to what we saw in Bulldozer, but the engineering behind them is improved dramatically. More in flight instructions, better decode, better dispatch, and better retire combined with greatly enhanced cache speed and latencies make the core much more efficient and high performing as compared.

"The L3 cache acts as a victim cache which partially copies what is in L1 and L2 caches."

It seems like it would be the opposite. The L3 is a victim cache for lines evicted from the L2, so by definition the L2 and L3 should never duplicate anything. This is why they are saying 20 MB for the 8 core chip (4 MB L2 + 16 MB L3).

You could very well be right. AMD was not exactly forthcoming about how it exactly works, but you certainly are correct about the definition of a victim cache (evicted cache lines). I need to ask a few more questions and get it correct. Thanks for reading!