Cortex-A76 (codename Enyo) is the successor to the Cortex-A75, a low-power high-performance ARMmicroarchitecture designed by ARM Holdings for the mobile market. Enyo was designed by Arm's Austin, Texas team. This microarchitecture is designed as a synthesizable IP core and is sold to other semiconductor companies to be implemented in their own chips. The Cortex-A76, which implemented the ARMv8.2 ISA, is the a performant core which is often combined with a number of lower power cores (e.g. Cortex-A55) in a DynamIQ big.LITTLE configuration to achieve better energy/performance.

The Cortex-A76 is a high-performance synthesizable core designed by Arm as the successor to the Cortex-A75. It is delivered as Register Transfer Level (RTL) description in Verilog and is designed. This core supports the ARMv8.2 extension as well as a number of other partial extensions. The A76 is a 4-way superscalar out-of-order processor with a private level 1 and level 2 caches. It is designed to be implemented inside the DynamIQ Shared Unit (DSU) cluster along with other cores. The DSU cluster supports up to eight cores of any combination (e.g., with little cores such as the Cortex-A55 or other just more Cortex-A76).

The Cortex-A76 succeeds the Cortex-A75. It is designed to take advantage of the 7 nm node in order to deliver up to 40% higher performance at the same power level (measured at 750 mW/core), or alternatively, up to 50% lower power for the same performance compared to the Cortex-A75 on the 10 nm node. This is achieved through a combination of both microarchitectural improvements as well as process technology advantages. It's worth noting that the A76 brings higher performance at a slight hit to the area by going wider. On the 7 nm process, the Cortex-A76 targets frequencies of 3 GHz and higher.

The Cortex-A76 is a complex, 4-way superscalar out-of-order processor with an 8-issue back end. The pipeline is 13 stages with an 11-cycle branch misprediction penalty. It has a 64 KiB level 1instruction cache and a 64 KiB level 1data cache along with a private level 2 cache that is configurable as either 256 KiB (1 bank) or 512 KiB (2 banks)

Each cycle, up to 16 bytes are fetched from the L1 instruction cache. The instruction fetch works in tandem with the branch predictor in order to ensure the instruction stream is ready to be fetched. Additionally, there is a return stack which stores the address and instruction set state (AArch32/R14 or AArch64/X30) on branches. On a return (e.g., ret on AArch64), the return stack will pop. The branch prediction unit on the A76 is decoupled from the instruction fetch, allowing it to run ahead. To that end, it now operates on 32-byte instruction windows, twice the fetch size.

From the instruction fetch, up to four 32-bit instructions are sent to the decode queue (DQ) each cycle. For narrower 16-bit instructions (i.e., Thumb), this means up to eight instructions get queued. The A76 features a 4-way decode. Up to four instructions may be decoded into macro-operations each cycle.

From the front-end, up to four macro-operations may be sent each cycle to be renamed. The A76 has a capacity to handle up to 128 instructions in flight, a number that has no increased for a long time (the Cortex-A72 and even the Cortex-A57 had an out-of-order window of 128 instructions). Micro-operations are broken down into their µOP constituents and are scheduled for execution. Roughly 20% more µOPs are generated from the MOPs. From here, µOPs are sent to the instruction issue which controls when they can be dispatched to the execution pipelines. µOPs are queued in eight independent issue queues (120 entries in total).

The A76 issue is 8-wide, allow for up to eight µOPs to execute each cycle. The execution units can be grouped into three categories: integer, advanced SIMD, and memory.

There are four pipelines in the integer cluster - three for general math operations and a dedicate branch ALU. All three ports have a simple ALU. Those perform arithmetic and logical data processing operations. The third port has support for complex arithmetic (e.g. MAC, DIV).

There are two ASIMD/FP execution pipelines. In the Cortex-A75, each of the pipelines were 64-bit wide, on the A76, they were doubled to 128-bit. This means each pipeline is capable of 2 double-precision operations, 4 single-precision, 8 half-precision, or 16 8-bit integer operations. On the A76, those pipelines can also execute the cryptographic instructions if the extension is supported (not offered by default and requires an additional license from Arm).

The A76 includes two ports with an address-generation unit on each. The level 1 data cache is fixed at 64 KiB and can have an optional ECC protection per 32 bits. It is virtually indexed, physically tagged which behaves as a physically indexed, physically tagged 4-way set-associative cache. The L1D cache implements a pseudo-LRUcache replacement policy. It features a 4-cycle fastest load-to-use latency with two read ports and one write port meaning it can do two 16B loads/cycle and one 32B store/cycle. From the L1, the A76 supports up to 20 outstanding non-prefetch misses. The load buffer is 68 entries deep while the store buffer is 72-entry deep. In total, the A76 can have 140 simultaneous memory operations in-flight which is actually 25% more than the A76 instruction window.

The A76 can be configured with either 128, 256 or 512 KiB of level 2 cache. It implements a dynamic biased replacement policy and is ECC protected per 64 bits. The L2 is strictly inclusive of the L1 data cache and non-inclusive of the L1 instruction cache. There is a 256-bit write interface to the L2 and a 256-bit read interface from the L2 cache. The fastest load-to-use latency is 9 cycles. The L2 can support up to 46 outstanding misses to the L3 which is located in the DSU itself. The L3, which is shared by all the cores in the DynamIQ big.LITTLE and is configurable in size ranging from 2 MiB to 4 MiB with load-to-use ranging from 26 to 31 cycles. As with the L2, up to 32 bytes may be transferred from or to the L2 from the L3 cache. Up to 94 outstanding misses are supported from the L3 to main memory.

In addition to controlling memory accesses, ordering, and cache policies, the MMU is also responsible for the translation of virtual addresses to physical addresses on the system. This is done through a set of virtual-to-physical address mappings and attributes that are held in translation tables. The physical address size here is 40 bits. The Cortex-A76 incorporates a dedicated L1 TLB for instruction cache and another one for the data cache. Both the ITLB and the DTLB are 48-entry deep and are fully associative. On a memory access operation, the A76 will first perform lookup in there. If there is a miss in the L1 TLBs, the MMU will perform a lookup for the requested entry in the second-level TLB.

There is a unified level 2 TLB comprising of 1280 entries organized as 5-way set associative which is shared by both instruction and data. The STLB handles misses from the instruction and data L1 TLBs. Typically, STLB accesses take three cycles, however, longer latencies are possible when a different block or page size mapping is used. If there is a miss in the L2 TLB, the MMU will resort to a hardware translation table walk. Up to four TLB misses (i.e., translations table walks) can be performed in parallel. The STLB will stall if there are six successive misses. During table walks, the STLB can still perform up to two TLB lookups.

The TLB entries store one or both of the global indicator and an address space identifier (ASID), allowing context switching without TLB invalidation as well as a virtual machine identifier (VMID) which allows for VM switching by the hypervisor without TLB invalidation.