You are currently viewing SemiWiki as a guest which gives you limited access to the site. To view blog comments and experience other SemiWiki features you must be a registered member. Registration is fast, simple, and absolutely free so please, join our community today!

Why I think ARM servers will always be slow

My thinking on this was triggered by an interesting discussion on
Slashdot speculating on how the Chinese icensed from AMD X86
processor will work.

I do not understand the ARM processor in detail. Although I did spend
some time thinking about porting CVC Verilog simulator to generate
ARM instead of X86_64 assembly.

I think there are reasons ARMs will be slower. These reasons
may explain why ARMs will always be lower power. That is not my area.

1) X86 uses dense encoding for instruction word bit sections. This means
faster decoding and execution of instructions plus it means use of special
purpose instructions that modern compilers have figured out how to use.

2) The Arm capability to set a bit to turn off instruction execution has
huge complexity and I think speed problems when using the method that
the multi-issue and pipelining X86 processors use. The ARM
instruction disable bit at best causes hardware complexity with no
speed improvement. The Slashdot posters I think were saying that
fast hardware means precompiling the low level parts of flow graphs
emitted by programming language compilers.

3) The complicated indexing modes of the X86_64 architecture are important
for fast execution. I think this was seen by von Neumann in his architecture
design from the 1950s.

4) The seeming ARM (reduced instruction set designs in general?) advantage
of numerous regular registers was obviated by using placing top of stack
area locations in "internal" registers.
~

(1) the first thing x64 has to do is to unpack instructions (decode) and typically dedicates at least 2 extra sections of pipeline to this. The headaches include instructions which wrap across fetches, speculative execution jumps into the middle of a cache line, as well as the unpacking. This is so painful that for more than a decade the x64 CPUs have kept their innermost instruction cache in unpacked micro-ops, not original instructions.
(2) predicated execution is not part of ARM64 ISA.
(3) the complex operations are in practice converted to multiple micro-ops each simple, and with some complexities in making the combinations seem "atomic". Code generation guidelines avoid doing arithmetic on memory operands and recommend using a load-store approach. When used with load-store code style ARM64 and X64 work very similar, with ability to combine registers and offsets in generating addresses (which most modern code needs)
(4) X64 and ARM64 have the same number of architectural registers. In practice the implementations use up to 200 registers which are renamed (an IBM invention from the 1970s) to maximize parallel operation opportunities.

In general there is little advantage from an ISA these days. The occasional novelty crops up, like the discovery that one format of 16 bit works significantly better for ML than the one people used to use, and so you can have a temporary acceleration by adding that format to your vector processor before the competition gets around to it. That gap closes again. And don't forget to include analyzing the vector processing units along with the conventional. ARM Neon vs. AVX512 for example.

The main difference ARM has brought is the access to the world of SOC/ASIC and mixing various off-the-shelf blocks with custom ones. This is a whole huge market that X64 has never really cracked, and it has a lot more to do with open vs. closed fab ecosystems than it does with instruction sets.

IPC of Apple's processors are close to what we see on x86 land. And ARM's A76 is not too far behind either.

ARM is not intrinsically slower, entering the server market though takes much more than a fast core.

Adding extensions like AVX512 is interesting but at some point we will realize that a general purpose CPU should not have extensions for every possible application out there. A lot of silicon is wasted by those that make no use of the extensions and the people that really need them will probably prefer to use a dedicated hardware anyway (e.g. GPU, ML accelerator, Asic, ...).

However, using hardware accelerators is still too hard, you will be dealing with proprietary libraries and probably some closed-source blobs. Hence, we see our general purpose CPUs becoming fatter.

The space taken up by vector processors on both ARM64 and X64 is surprisingly small. It seems reasonable for adding efficiency to some widely used math and parallel data manipulation. More of an issue (depending on the chip) can be power or clock setbacks when the vector units are fired up. Meanwhile, if you look at chips like Apple use you will see the CPUs are dwarfed by the size of the GPUs and other special purpose accelerators they put into that SOC. It hardly matters what the CPU is, except that Apple by now probably have a lot of ARM IP and have wrestled the ARM licensing to be something they can tolerate. I would be more expecting them to switch to something like MIPS V or their own design just to eliminate IP costs, than to have any interest in switching to X64.

The heavy lift on the chips for servers is in cache, and in IO lanes (including coherent interconnect between sockets). As you say, more than just a fast core. In principle the cache hierarchy looks just about the same no matter what the core is, so we will see ARM64 chips with server IP blocks on board. The question is, what is the motivation to switch? X64 have two quite competent players now, both Intel and AMD, what does ARM64 bring to the table in servers?

Thanks for the posts. What about the extra indexing modes on X86_64s? I have seen very fast code that inverts various base index and offset registers. I do not think Apple IPC rates apply because everything is now decomposed into micro ops.

1) X86 uses dense encoding for instruction word bit sections. This means
faster decoding and execution of instructions plus it means use of special
purpose instructions that modern compilers have figured out how to use.

x86 does not use a dense encoding, neither is it fast to decode. AArch64 is significantly denser despite using 32-bit instructions, and decode is typically 1 or 2 cycles, several cycles less than x86 needs. Compilers avoid complex CISC instructions on x86, so a large number of instructions is practically never used.

Originally Posted by smeyer0028

2) The Arm capability to set a bit to turn off instruction execution has
huge complexity and I think speed problems when using the method that
the multi-issue and pipelining X86 processors use. The ARM
instruction disable bit at best causes hardware complexity with no
speed improvement. The Slashdot posters I think were saying that
fast hardware means precompiling the low level parts of flow graphs
emitted by programming language compilers.

You mean conditional execution? It certainly gives a speedup by reducing branch mispredictions. However it used too much encoding space on Arm so has been replaced with conditional select and compare on AArch64.

Originally Posted by smeyer0028

3) The complicated indexing modes of the X86_64 architecture are important
for fast execution. I think this was seen by von Neumann in his architecture
design from the 1950s.

No, complex indexing modes are slow, they either need extra micro-ops or add an extra cycle latency to loads. So compilers always try to use the simplest addressing modes.

Originally Posted by smeyer0028

4) The seeming ARM (reduced instruction set designs in general?) advantage
of numerous regular registers was obviated by using placing top of stack
area locations in "internal" registers.

A stack is never as fast as a register. No x86 implementation ever placed the top of the stack in registers. Having twice as many architectural registers is certainly a small advantage for AArch64.

Thanks for the posts. What about the extra indexing modes on X86_64s? I have seen very fast code that inverts various base index and offset registers. I do not think Apple IPC rates apply because everything is now decomposed into micro ops.

The complex addressing modes just add complexity. I've not seen cases where it makes sense to repeatedly incur the addressing penalties and the codesize hit. Generally you get smaller and faster code if you compute an address in a temporary and use that repeatedly.

Maybe you're talking about the early days of 286 and 386 which were so slow you needed dirty assembler tricks to get code to run reasonably fast?

The question is, what is the motivation to switch? X64 have two quite competent players now, both Intel and AMD, what does ARM64 bring to the table in servers?

Lots of reasons: higher density, more cores and higher bandwidth per socket, 2x gain in perf/Watt, lower cost. While AMD is finally in a better position, Intel has ~97% of the server market, so there is effectively no competition. Innovation is usually not mentioned, but if you want you can build an Arm CPU with a 1024-bit wide vector unit, 8+ memory channels or 128 cores - you're not limited to what Intel wants to sell you. This is why most supercomputers have adopted AArch64.

The complex addressing modes just add complexity. I've not seen cases where it makes sense to repeatedly incur the addressing penalties and the codesize hit. Generally you get smaller and faster code if you compute an address in a temporary and use that repeatedly.

Maybe you're talking about the early days of 286 and 386 which were so slow you needed dirty assembler tricks to get code to run reasonably fast?

The most common mode is register+offset. The base+index+offset can make sense in some code, for example C++ access to multiply inherited objects, or relocatable data with internal pointers. It decode cleanly into microops and has no unusual execution penalty. IIRC, you can do the same thing with ARM64 instruction set.

The things which were classic problems for RISC and are still a problem today is when you use a memory operand in arithmetic instead of just load and store. It is not too bad when the destination is a register, but the implied atomicity when the destination is memory is painful.

Lots of reasons: higher density, more cores and higher bandwidth per socket, 2x gain in perf/Watt, lower cost. While AMD is finally in a better position, Intel has ~97% of the server market, so there is effectively no competition. Innovation is usually not mentioned, but if you want you can build an Arm CPU with a 1024-bit wide vector unit, 8+ memory channels or 128 cores - you're not limited to what Intel wants to sell you. This is why most supercomputers have adopted AArch64.

Those are all apocryphal. Try finding an actual existing product and you will actually find only a small subset of those advantages are available. Now, you may be able to select the IP blocks which deliver just the subset that interest you (add a wide vector unit for example) but that is really a consequence of the fab ecosystem, not a property of AArch64. And it might make a chip no-one else will use, which is why Intel (and AMD) stick to features they believe are broadly useful and well balanced.

The balance may shift. But building server class chips is complex, not just the chip but the entire process of supporting it into production. Compatibility with peripherals and memory, extensive validation, support for firmware, diagnostics, etc. Look at how many companies recently took a run at ARM64 server chips and dropped out with no product to show for it.

Making SOCs for dedicated products in a walled product garden is in many ways easier.

Any opinions on how new memory types like 3dXpoint and Crossbar will impact these processes and their integration to these systems either internally or externally?

I don't see those replacing DRAM any time soon. Restrictions on endurance and throughput eliminate them from carefree operation, not just the problems of lower latency. My personal opinion is those new memories will play large in a NUMA approach with an extended, lower price tier behind the (still fairly large) working memory. There is a lot of maturity needed on both the hardware and the software to get from here to there. Today, the most viable uses will be apps and services especially tuned for these larger memories - but as DRAM prices are falling it is making it a tougher market for new entrants. If they had been ready a year ago it would have been a slam dunk. In the long run there will be new memory types - DRAM hit a wall 5 years ago on capacitor size, so it is just a matter of time until a competitor gains enough of an edge. It is by far the largest, most lucrative sitting duck in the semiconductor industry.

But a really tough, well established and competent duck. Nothing yet quacks like it.

0

Last edited by Tanj; 4 Weeks Ago at 05:14 PM.
Reason: needed to be more complete

The most common mode is register+offset. The base+index+offset can make sense in some code, for example C++ access to multiply inherited objects, or relocatable data with internal pointers. It decode cleanly into microops and has no unusual execution penalty. IIRC, you can do the same thing with ARM64 instruction set.

AArch64 supports base + index * scale and base + offset but not an index and offset. The details of indexing on x86 are extremely complex, but extra micro-ops are required for many cases on all microarchitectures. Also loads with a small immediate offset are 1 cycle faster than complex loads (large immediate or indexing). Basically complex addressing modes are best avoided and this has been true for many decades.

You also forget to mention that big cloud players want their own chips. Amazon is already going in that direction and it makes sense to them. Fire up a cloud instance and you will be very upset with the performance you get. Not only there are losses related to virtualization stack, but those large caches are often shared among many noisy neighbors. Also, most server processors used by cloud providers run at ~2.7ghz which is quite slow. They become fast when you have 32 cores like that, but most popular cloud instances use 2-4 real cores (4-8 vCPUs, they double the number because of hyperthreaded cores).Not to mention the huge premium for Intel's server chips.You would currently pay ~$200/month for a cloud instance with 4 real cores @ 2.7ghz. Cloud computing needs to lower prices by an order of magnitude, this is where ARM fits in.