ARM Goes 64-bit

Memory Ordering Model

As part of defining ARMv8, the architects paid careful attention to defining a clean memory model. This is particularly crucial for an architecture which will have many different teams working on implementations, since memory ordering is responsible for the most complex and difficult bugs in both hardware and software.

ARMv8 has a Release Consistency memory model, which is relatively weak. It is very similar to the Itanium memory model, and aligns well with C++11. This choice was motivated primarily by power efficiency. Generally, weak ordering models are more difficult to program, because there are few guarantees. However, weak ordering can also reduce the buffering that is required for in-flight loads and stores in a multi-processor system and reduce power consumption.

In the ARMv8 memory model, an aligned memory access that targets a single GPR is guaranteed to be atomic. Load pair and store pair instructions are guaranteed to appear as two individual atomic accesses, if targeting GPRs and naturally aligned. Unaligned accesses are not atomic, and as a practical matter are likely to be split into at least two accesses and a shift. Moreover, vector memory accesses (whether SIMD or scalar FP) are not guaranteed to be atomic at all. To allow programmers to write concurrent software, a number of synchronization primitives are available.

ARMv7 and v8 features three different types of barriers: a Data Synchronization Barrier (DSB), Data Memory Barrier (DMB), and an Instruction Synchronization Barrier (ISB). A DSB stalls the processor until all pending loads and stores have completed. A DMB forces all earlier (in program order) memory accesses to become globally visible before any subsequent accesses. An ISB flushes the CPU pipeline and any prefetch buffers, forcing any subsequent instructions to be fetched from cache or memory. Since ARM does not have coherent instruction caching, this is necessary (but not sufficient) for modifying instructions in memory.

ARMv7 and v8 also incorporate exclusive (or atomic) memory accesses, which are sometimes described as a load-linked and store-conditional (LL/SC). The load-linked instruction will read a value from an address in memory, and then the store-conditional will write a new value to the same address in memory if no other writes to the address have occurred. LL/SC is quite useful for constructing other synchronization primitives such as spinlocks. The LL/SC can be combined with pair instructions to atomically update a location that spans two registers.

ARMv8 introduces the new and elegant one-sided fences associated with Release Consistency: load-acquire and store-release. Unlike the barriers in ARMv7, these fences are address-based synchronization primitives. A load-acquire guarantees that any later (in program order) memory accesses will only be visible after the load-acquire. A store-release guarantees that all earlier memory accesses will be visible before the store-release becomes visible. Moreover, the store-release becomes visible to all caching agents in the system simultaneously. The two can be combined to form a full fence as well, a store-release and a load-acquire will be globally visible in program order.

The address-based synchronization primitives, load-acquire/store-release and LL/SC are all limited to only use base register addressing, with no offsets, indexing or increments, which simplifies the implementation.

Conclusion

The ARMv8 architecture is classically British; a clean and elegant 64-bit instruction set, with backwards compatibility for existing 32-bit software. The new AArch64 is certainly an improvement over ARMv7, with many improvements above and beyond simply extending the virtual address space to 48-bits.

The most notable additions in ARMv8 are the larger and highly regular integer register file, double precision vectors with IEEE support, and new synchronization primitives with a well-defined memory ordering model. In some respects though, the more significant changes came not from adding features, but removing them.

Like x86, ARMv7 had a fair bit of cruft, and the architects took care to remove many of the byzantine aspects of the instruction set that were difficult to implement. The peculiar interrupt modes and banked registers are mostly gone. Predication and implicit shift operations have been dramatically curtailed. The load/store multiple instructions have also been eliminated, replaced with load/store pair. Collectively, these changes make AArch64 potentially more efficient than ARMv7 and easier to implement in modern process technology.

There are no ARMv8 implementations available to judge the merits of the architecture in practice. But overall, ARMv8 is clearly a sound design that was well thought out and should enable reasonable implementations.

The vast majority of companies will wait for a licensable core design from ARM. However, those with the resources and expertise to design a CPU core will forge ahead and should have a time to market advantage and a potential differentiating factor. Applied Micro should be first to market, but others will swiftly follow, including Cavium Networks, Qualcomm, Samsung, and Nvidia.

Certainly, the next few years should prove very interesting. The number of ARMv8 architecture licensees looks set to grow, which should inject some additional diversity into the industry. However, it is unclear whether the market is large enough to support so many companies in the long term. Future ARMv8 cores will undoubtedly be found in Apple’s iPhone and iPad, along with Android devices from TI, Samsung, and others. The real question is whether ARMv8 will enable ARM’s partners to move up the value chain to servers and notebooks. However, that requires competing with Intel, which has a massive advantage in process technology over the rest of the industry.