Qualcomm moved its Snapdragon designers to its ARM server chip. We peek at the results – The Register

Hot Chips Qualcomm moved engineers from its flagship Snapdragon chips, used in millions of smartphones and tablets, to its fledgling data center processor family Centriq.

This shift in focus, from building the brains of handheld devices to concentrating on servers, will be apparent on Tuesday evening, when the internal design of Centriq is due to be presented at engineering industry conference Hot Chips in Silicon Valley.

The reassignment of a number of engineers from Snapdragon to Centriq may explain why the mobile side switched from its in-house-designed Kryo cores to using off-the-shelf ARM Cortex cores, or minor variations of them. Effectively, it put at least a temporary pause on fully custom Kryo development.

Not all the mobile CPU designers were moved, and people can be shifted back as required, we’re told. Enough of the team remained on the mobile side to keep the Snapdragon family ticking over, The Register understands from conversations with company execs.

Late last year, Qualcomm unveiled the Snapdragon 835, its premium system-on-chip that will go into devices from top-end Android smartphones to Windows 10 laptops this year. That processor uses not in-house Kryo cores but slightly modified off-the-shelf CPU cores likely a mix of four Cortex-A53s and four A72 or A73s licensed from ARM. Qualcomm dubs these “semi-custom” and “built on ARM Cortex technology.”

In May, Qualcomm launched more high-end Snapdragons for smartphones: the 660 and the 630. However, the 660 uses eight Kryo cores cannibalized from the Snapdragon 820 series, and the 630 uses eight stock ARM Cortex-A53 cores.

This isn’t to say ARM’s stock cores are naff. This shift means Qualcomm’s other designs its GPUs, DSPs, machine-learning functions, and modems have to shine to differentiate its mobile system-on-chips from rivals also using off-the-shelf Cortexes. It’s a significant step for Qualcomm, which is primarily known for its mobile processors and radio modem chipsets.

For what it’s worth, Qualcomm management say they’re simply using the right cores at the right time on the mobile side, meaning the off-the-shelf Cortex CPUs are as good as their internally designed Snapdragon ones.

On Tuesday evening, an outline of the Centriq 2400 blueprints will be presented by senior Qualcomm staffers to engineers and computer scientists at Hot Chips in Cupertino, California. We’ve previously covered the basics of this 10nm ARMv8 processor line. Qualy will this week stress that although its design team drew from the Snapdragon side, Centriq has been designed from scratch specifically for cloud and server workloads.

Centriq overview … Source: QualcommClick to enlarge any picture

This is where you can accuse of Qualcomm of having its cake and eating it, though: in its Hot Chips slides, seen by The Register before the weekend, the biz boasts that Centriq uses a “5th generation custom core design” and yet is “designed from the ground up to meet the needs of cloud service providers.”

By that, it means the engineers, some of whom are from the Snapdragon side, are working on it are on their fifth generation of custom CPU design, but started from scratch to make a server-friendly system-on-chip, said Chris Bergen, Centriq’s senior director of product management.

However you want to describe it, looking at the blueprints, you can tell it’s not exactly a fat smartphone CPU.

Its 48 cores, codenamed Falkor, run 64-bit ARMv8 code only. There’s no 32-bit mode. The system-on-chip supports ARM’s hypervisor privilege level (EL2), provides a TrustZone (EL3) environment, and optionally includes hardware acceleration for AES, SHA1 and SHA2-256 cryptography algorithms. The cores are arranged on a ring bus kinda like the one Intel just stopped using in its Xeons. Chipzilla wasn’t comfortable ramping up the number of cores in its chips using a ring, opting for a mesh grid instead, but Qualcomm is happy using a fast bidirectional band.

The shared L3 cache is attached to the ring and is evenly distributed among the cores, it appears. The ring interconnect has an aggregate bandwidth of at least 250GB/s, we’re told. The ring is said to be segmented, which we’re led to believe means there is more than one ring. So, 24 cores could sit on one ring, and 24 on another, and the rings hook up to connect everything together.

Speaking of caches, Qualcomm is supposed to be shipping this chip this year in volume but is still rather coy about the cache sizes. Per core, there’s a 24KB 64-byte-line L0 instruction cache, a 64KB 64-byte-line L1 I-cache, and a 32KB L1 data cache. The rest the L2 and L3 sizes are still unknown. The silicon is in sampling, and thus you have to assume Intel, the dominant server chipmaker, already has its claws on a few of them and studied the design. Revealing these details wouldn’t tip Qualcomm’s hand to Chipzilla.

Get on my level … The L1 and L0 caches

The L0 cache is pretty interesting: it’s an instruction fetch buffer built as an extension to the L1 I-cache. In other words, it acts like a typical frontend buffer, slurping four instructions per cycle, but functions like a cache: it can be invalidated and flushed by the CPU, for example. The L2 cache holds both data and instructions, and is an eight-way job with 128-byte lines and a minimum latency of 15 cycles for a hit.

Let me level with you … The L2 cache

The L3 cache has a quality-of-service function that allows hypervisors and kernels to organize virtual machines and threads so that, say, a high priority VM is allowed to occupy more of the cache than another VM. The chip can also compress memory on the fly, with a two to four cycle latency, transparent to software. We’re told 128-byte lines can be squashed down to 64-byte lines, where possible, with error correction.

When Qualcomm says you get 48 cores, you get 48 cores. There’s no hyperthreading or similar. The Falkors are paired into duplexes that share their L2 cache. Each core can be powered up and down, depending on the workload, from light sleep (CPU clock off) to full speed. It provides 32 lanes of PCIe 3, six channels of DDR4 memory with error correction and one or two DIMMs per channel, plus SATA, USB, serial and general purpose IO interfaces.

I’ve got the power … Energy-usage controls

Digging deeper, the pipeline is variable length, can issue up to three instructions plus a direct branch per cycle, and has eight dispatch lanes. It can execute out of order, and rename resources. There is a zero or one cycle penalty for each predicted branch, a 16-entry branch target instruction cache, and a three-level branch target address cache.

Well oiled system … The Centriq’s pipeline structure

Make like a tree and get outta here … The branch predictor

Hatched, matched, dispatched … The pipeline queues

Loaded questions … The load-store stages of the pipeline

It all adds up … The variable-length integer-processing portion

The chip has an immutable on-die ROM that contains a boot loader that can verify external firmware, typically held in flash, and run the code if it’s legit. A security controller within the processor can hold public keys from Qualcomm, the server maker, and the customer to authenticate this software. Thus the machine should only start up with trusted code, building a root of trust, provided no vulnerabilities are found in the ROM or the early stage boot loaders. There is a management controller on the chip whose job is to oversee the boot process.

We’ll be at Hot Chips this week, and will report back with any more info we can find. When prices, cache sizes and other info is known, we’ll do Xeon-Centriq-Epyc specification comparison.