Monday, February 29, 2016

Just one week after my last blog posting, providing a hint of the maximum number of Zen cores supported per socket, a news wave about details of Zen based server processors given in the presentation of a CERN researcher hit the web. The guy works in the institution's Platform Competence Centre (PCC) and manages integration of predominantly prototype hardware according to his CERN profile. So it can be assumed, that anything he says about server platforms might have been provided by representatives coming from the different processor and server OEMs. The 8 memory channels haven't been mentioned before in a leak or patch. And the 32 core number is not related to my posting, as the CERN talk has been held on 29th of January while I published my posting (unaware of the talk) on the 1st of February, after first mentioning the patch already in December.

Now a new series of patches provides further information about the Zen core's IP blocks. They've been posted on 16/02/16 on the Linux kernel mailing list by an AMD employee, after an earlier round of patches in January, which even mention a "ZP" target, very likely being the abbreviation for "Zeppelin". The more recent patches cover additions to AMD's implementation of a scalable Machine Check Architecture (MCA), and handling of deferred errors. This is implemented in the Linux EDAC kernel module, which is responsible for hardware error detection and correction. The most interesting patch contains following sections, with some details highlighted:

The interconnect subsystem is called "Data Fabric", which knows so called coherent slaves according to the last enumeration list. The "FUSE subsystem" might be replaced by something else like "Parameter block", as it just means a block managing the processor's configuration.

The second list of enumerations contains a blocks found in the Zen core or close to it. I think, the highlighted "RES" element might actually stand for a real IP block, as it doesn't make much sense to have it sitting inmidst the other elements and not at the end. According to some other code in the patch, the L2 cache is seen as part of the core, while the L3 cache is not (as expected):

This is the first of many lists containing error strings, in this case for the load/store unit. Similar to the enumeration above, there is a reserved element, possibly hiding something, as this is a public mailing list. The strings I left out don't contain any surprises compared to the Bulldozer family. But overall I get the impression, that AMD significantly improved the RAS capabilities, which are very important for server processors. The following block contains error strings related to the instruction fetch block ("if"):

There is a new L0 ITLB, which is the only level 0 thing being mentioned so far, while VR World mentioned level 0 caches (besides other somewhat strange rumoured facts like no L3 cache in the APU variant - while this has been shown on the leaked Fudzilla slide). The only thing resembling such a L0 cache is a uOp cache, which has clearly been named in the new patch in a section related to the decode/dispatch block (indicated by "de"):

There are strings for both a "uop cache" and a "uop buffer". So far I knew about this uop buffer patent filed by AMD in 2012, which describes different related techniques aimed at saving power, e.g. when executing loops or to keep the buffer physically small by leaving immediate and displacement data of decoded instructions in an instruction byte buffer ("Insn buffer") sitting between instruction fetch and decode. The "uop cache" clearly seems to be a separate unit. Even without knowing how many uops per cycle can be provided by that cache, it will help to save power and remove an occaisional fetch/decode bottleneck when running two threads. The next interesting block is about the execution units:

Here is a first confirmation of a checkpoint mechanism. This has been described in several patents and might also be an enabler for hardware transactional memory, which has been proposed in the form of ASF back in 2009. Another use case is the quick recovery from branch mispredictions, where program flow can be redirected to a checkpoint created right before evaluating a difficult to predict branch condition.

There is a confirmation of the "GMI link" mentioned on an already leaked slide, which mentioned a bandwidth of 25 GB/s per link. The term "Data Fabric" also has been used on that slide.

When reporting about the 32 core support, I wrote that some patents used the same wording. It's actually "core processing complex" (CPC) and can contain multiple compute units (like Zen cores). So they are not the same. AMD patent filings using the term are US20150277521, US20150120978, and US20140331069.

Last but not least I have updated the Zen core diagram based on these new informations and some very likely related patents and papers:

Notable changes are:

uOp Cache has been added based on the new patch

FMUL/FADD for FMAC pairing removed, based on some corrections of the znver1 pipeline description.

32kB L1 I$ has been mentioned in some patents. With enough ways, a fast L2$ and a uOp cache this should be enough, I think.

issue port descriptions and more data paths added

2R1W and 4 cycle load-to-use-latency added for the L1 D$ based on info found on a LinkedIn profile and the given cylce differences in the znver1 pipeline description

Stack Cache speculatively added based on patents and some interesting papers. This doesn't help so much with performance, but a lot with power efficiency.

It's still interesting, what the first mentioning of fp3 port for FMAC operations was good for. I
thought, it was a typo, but more of the kind "fp3" instead of "fp2" in one case. It could still be related to register file port usage and/or bridged FMA, but probably not that useful for telling the compiler. Due to the correction patch I'm still looking further into the FPU topic, as promised earlier. I'll cover that in a followup posting.

Finally there is a hint at good hardware prefetcher performance (or bad interferences?), as AMD recommends to switch off default software prefetching for the znver1 target in GCC.

Monday, February 8, 2016

Kristian Gocnik (@I_biT_MySeLf) tipped me off about new mobile Bristol Ridge SKUs, which appeared on usb.org as you can see [UPDATE: the entries have been removed now - visiting the pages may delete your only cached copy in the browser] here and here. That's the same site, where the first Bristol Ridge SKU (FX-9830P) appeared on. I put this together with information found in the leaked slide by Benchlife.info, which you can find in my blog post about a WEI result of an A10-9600P.

Using the mobile Carrizo SKUs, the leaked A10-9600P clock, and some sorting, it was easy to map the SKUs to the leaked slide's data. Kristian Gocnik tried it independently and we got the same mapping, except for a consumer A8-9500P he speculatively derived from the pro model, but which is missing on usb.org. So the resulting table likely represents what AMD is going to release as mobile Bristol Ridge chips for the FP4 socket later this year.

The model numbers likely simply jumped by one thousand from Carrizo's and an additional thirty points for the 35W variants. Carrizo's wide TDP ranges got split into 15W and 35W TDPs. This might help to avoid the confusion about 15W and 35W Carrizos laptops. The CPU base clocks jumped significantly, while CPU Turbo and (maximum) GPU clocks kind of matured with the fab process.

"For its part, AMD engineers showed smart ways of squeezing as much as 15% more performance out of its Carrizo PC processor, simply by applying more aggressive power management to the 28nm design. The Bristol Ridge design was a study in using power management to overcome performance limits tied to heat, voltage and current."

Months after the first leaked WEI score, first true Bristol Ridge benchmarks will show, how this improvement translates into real world performance. Hopefully they get tested with dual channel memory, even if AMD or OEMs only provide single channel equipped/designed devices, as for the recent AnandTech Carrizo review.

"Core complex" should be similar to "compute unit" and has been used in some AMD patents already. The expression marked in red means a shift right by 3, which equals a division by 8. So with two logical cores per physical core due to SMT, a core complex should contain four Zen cores and a shared LLC.

The next line shows the socket ID being shifted left by 3, leaving 3 bits for the core complex ID, which suggests a maximum number of eight core complexes per socket, or 32 physical cores. This number should first be seen as a placeholder, but we've already seen rumours mentioning that many cores.