Monday, November 23, 2015

An article about Zen and K12 by Yusuke Ohara gives a good overview of AMD's processor plans and a new and very interesting bit of information about AMD's high performance ARM design. As machine translators still struggle to provide clearly understandable translations of Japanese texts, multiple translators were tried and did not help. Therefore I asked the author to make sure that I got it right. He confirmed, that according to ARM officials, who are aware of the works of their architectural licensees, AMD is using at least a 4-wide design for their K12 core.

Jim Keller already said, that the smaller decoders for ARM instructions would leave room to add some performance improving features compared to x86. He also mentioned "a bigger engine" than in Zen. Looking at the microarchitecture diagram, one might ask, how AMD would utilize all these execution hardware, especially if there would be even more units and maybe even more than four instructions fetched and decoded per cycle. And given its target market, which is servers and datacenters, this might include one important feature: SMT. Some already speculated about that based on expectations, but there is AMD patent application US20150121046, which mentions SMT and its application in an AArch64 design very clearly and with many implementation details. This can be seen as an indicator of work being done for real products.

If K12 is a 4-wide or even wider SMT design similar to Zen (which is "only" 4-wide), this would put some substance behind Keller's announcements, which suggested many similarities between both designs. This is supported by the fact, that one of the inventors listed in the patents (Marius Evers) seemingly worked on both cores. Many other patents by him also cover both ARM and x86. He was also involved in one patent filed in 2007, which described a way to add SMT to the front end of a Bulldozer like module. SMT is not only useful to utilize execution units, if there are many of them. It also helps by keeping them busy, if there are multi-cycle FP instructions, branch mispredictions, or cache misses.

Of course, there are more differences between those two architectures than the ISAs alone, but many typical CPU components are either ISA-agnostic and reusable or could be adapted with much less effort than creating them from scratch. However, if it was done this way, such a strategy would not only have permitted AMD to make an efficient use of the limited R&D resources available, but it would have created a chance to produce a powerful ARM core for servers for an acceptable overhead. This is like applying SMT to R&D.