Introduction to ARMv8 64-bit Architecture

Introduction

The ARM architecture is a Reduced Instruction Set Computer (RISC) architecture, indeed its originally stood for “Acorn RISC Machine” but now stood for “Advanced RISC Machines”.
In the last years, ARM processors, with the diffusion of smartphones and tablets, are beginning very popular: mostly this is due to reduced costs, and a more power efficiency compared to other architectures as CISC:

Complex Instruction Set Computer (CISC) processors, like the x86, have a rich instruction set capable of doing complex things with a single instruction. Such processors often have significant amounts of internal logic that decode machine instructions to sequences of internal operations (microcode).RISC architectures, in contrast, have a smaller number of more general purpose instructions, that might be executed with significantly fewer transistors, making the silicon cheaper and more power efficient. Like other RISC architectures, ARM cores have a large number of general-purpose registers and many instructions execute in a single cycle. It has simple addressing modes, where all load/store addresses can be determined from register contents and instruction fields.

RISC architectures (ARM, Mips, …) peculiarity:

The load/store architecture only allows memory to be accessed by load and store operations, and all values for an operation need to be loaded from memory and be present in registers, so operations as “add reg,[address]” are not permitted!

Another difference with CISC architectures: when a Branch and Link is called (in Intel arch. is the “call” operation) the return address is stored in a special register and not in the stack.

Cortex-A50: ARMv8-A 64bit with load-acquire and store-release features , which are an excellent match for the C++11, C11 and Java memory models. (2011)

Extensions

With every new version of ARM, there’re new extensions provided, the v8 architecture has these:

Jazelleis a Java hardware/software accelerator: “ARM Jazelle DBX (Direct Bytecode eXecution) technology for direct bytecode execution of Java”. On Sofware side: Jazelle MobileVM is a complete JVM which is Multi-tasking, engineered to provide high performance multi-tasking in a very small memory footprint

Cryptographic Extension is an extension of the SIMD support and operates on the vector register file. It provides instructions for the acceleration of encryption and decryption to support the following: AES, SHA1, SHA2-256.

TrustZone: is a system-wide approach to security for a wide array of client and server computing platforms include payment protection technology, digital rights management, BYOD, and a host of secured enterprise solutions

The visualization extensions provide the basis for ARM architecture compliant processors to address the needs of both client and server devices for the partitioning and management of complex software environments into virtual machines.

The Large Physical Address extension provides the means for each of the software environments to utilize efficiently the available physical memory when handling large amounts of data

T32: 16-bit instructions are decompressed transparently to full 32-bit ARM instructions in real time without performance loss.Thumb-2 technology made Thumb a mixed (32- and 16-bit) length instruction set

Data types

Data types are simply these:

Byte: 8 bits.

Halfword: 16 bits.

Word: 32 bits.

Doubleword: 64 bits.

Quadword: 128 bits.

The architecture also supports the following floating-point data types:

Half-precision floating-point formats.

Single-precision floating-point format.

Double-precision floating-point format.

In this short guide, I don’t talk about floating point assembly instructions to don’t make it too long, if you want know more about, you can see the ARM Architecture Reference Manual.

Exception levels

There’re four exception levels, which replaces the 8 different processor modes, they work as the ring in Intel architectures, they are a form of privilege hierarchy:

EL0 is the least privileged level, indeed it is called unprivileged execution.Apps are runned here.

EL1: here can be runned OS kernel

EL2: provides support for virtualization of Non-secure operation. Hypervisor can runned here.

EL3 provides support for switching between two Security states, Secure state and Non-secure state. Secure monitor can be runned here.

When executing in AArch64 state, execution can move between Exception levels only on taking an exception or on returning from an exception.
Each of the 4 privilege levels has 3 private banked registers: the Exception Link Register, Stack Pointer and Saved PSR.

Interprocessing: AArch64 <=> AArch32

Interprocessing is the term used to describe moving between the AArch64 and AArch32 Execution states.
The Execution state can change only on a change of Exception level. This means that the Execution state can change only on taking an exception to a higher Exception level, or returning from an exception to a lower Exception level.
On taking an exception to a higher Exception level, the Execution state either:

Remains unchanged.

Changes from AArch32 state to AArch64 state.

On returning from an exception to a lower Exception level, the Execution state either:

Remains unchanged.

Changes from AArch64 state to AArch32 state.

The A64 Register

A64 has 31 general-purpose registers (integer) more the zero register and the current stack pointer register, here all the registers:

Wn

32 bits

General-purpose register: n can be 0-30

Xn

64 bits

General-purpose register: n can be 0-30

WZR

32 bits

Zero register

XZR

64 bits

Zero register

WSP

32 bits

Current stack pointer

SP

64 bits

Current stack pointer

How registers should be using by compilers and programmers:

r30 (LR): The Link Register, is used as the subroutine link register (LR) and stores the return address when Branch with Link operations are performed.

r29 (FP): The Frame Pointer

r19…r28: Callee-saved registers

r18: The Platform Register, if needed; otherwise a temporary register.

r17 (IP1): The second intra-procedure-call temporary register (can be used by call veneers and PLT code); at other times may be used as a temporary register.

r16 (IP0): The first intra-procedure-call scratch register (can be used by call veneers and PLT code); at other times may be used as a temporary register.

r9…r15: Temporary registers

r8: Indirect result location register

r0…r7: Parameter/result registers

The PC (program counter) has a limited access, only few instructions, as BL and ADL, can modify it.

The use of Stack

The stack implementation is full-descending: in a push the stack pointer is decremented, i.e the stack grows towards lower address.
Another features is that stack must be quad-word aligned: SP mod 16 = 0.

A64 instructions can use the stack pointer only in a limited number of cases:

Load/Store instructions use the current stack pointer as the base address: When stack alignment checking is enabled by system software and the base register is SP, the current stack pointer must be initially quadword aligned, That is, it must be aligned to 16 bytes. Misalignment generates a Stack Alignment fault.

Add and subtract data processing instructions in their immediate and extended register forms, use the current stack pointer as a source register or the destination register or both.

Logical data processing instructions in their immediate form use the current stack pointer as the destination register.

Process State

PSTATE (process state, CPSR on AArch32) holds process state related information, his flags will be change with compare instructions, for example, so it is used by processor to see if make a branch (jump in Intel terminology) or not.

What instructions are not present compared to AArch32:

Conditional execution operations, cause of:

The A64 instruction set does not include the concept of predicated or conditional execution. Benchmarking shows that modern branch predictors work well enough that predicated execution of instructions does not offer sufficient benefit to justify its significant use of opcode space, and its implementation cost in advanced implementations. [source]

Load Multiple. instructions load from memory a subset, or possibly all, of the general-purpose registers and the PC, so there aren’t: push, pop, ldmia, ecc… : these are be replace by load/store pair.

Coprocessor instructions

Branches & Exception

Conditional branch
Conditional branches change the flow of execution depending on the current state of the condition flags or the value in a general-purpose register.

B<cond>

Branch conditionally

B.<cond> <label>

CBNZ

Compare and branch if nonzero

CBNZ <Wt|Xt>, <label>

CBZ

Compare and branch if zero

CBZ <Xt>, <label>

Unconditional branch

B

Branch unconditionally

B <label>

BL

Branch with link

BL <label>

The BL instruction(s) writes the address of the sequentially following instruction, for the return (see RET), to general-purpose register, X30.

Unconditional branch (register)

BLR

Branch with link to register

BLR <Xn>

BR

Branch to register

BR <Xn>

RET

Return from subroutine:

RET {<Xn>}; where Xn register holding the address to be branched to. Defaults to X30 if absent.

Exception generating

HVC Generate exception targeting Exception level 2

SMC Generate exception targeting Exception level 3

SVC Instruction Generate exception targeting Exception level 1

Others instrunctions

NOP: No OPeration

WFE Wait for event

WFI Wait for interrupt

SEV Send event

SEVL Send event local

Load/Store register

There’re many instructions in this class to move many data size: byte, halfword and word, but I show only four, just to make you understand them : two for move single register and two for move a pair of registers; but first I have to describe how we can access to memory.

Load/Store addressing modes

This part is very important to understand different ARM addressing modes; the most used are three:

[base{, #imm}]: Base plus offset addressing means that the address is the value in the 64-bit base register plus an offset.

Example: ldrsw x0, [x29,76] #load signed word in x0

[base, #imm]! : Pre-indexed addressing means that the address is the sum of the value in the 64-bit base register and an offset, and the address is then writtenback to the base register.

[base], #imm : Post-indexed addressing means that the address is the value in the 64-bit base register, and the sum of the address and the offset is then written back to the base register.

Example: ldp x29, x30, [sp], 80 #load values from stack

now I can describe load/store instructions, don’t care addressing mode, I show you only few example.

Single Register
Save a register into a memory

ldr: Load register works with:

Register offset: LDR <Xt>, [<Xn|SP>, <R><m>{, <extend> {<amount>}}]

Immediate offset: LDR <Xt>, [<Xn|SP>], #<simm>

PC-relative literal: LDR <Xt>, <label

str: Store register:

register offset: STR <Xt>, [<Xn|SP>, <R><m>{, <extend> {<amount>}}]

immediate offset: STR <Xt>, [<Xn|SP>], #<simm>

<simm> is signed immediate byte offset, in the range -256 to 255

Pair of Registers
Save the two registers specified into memory address of Xn or SP

ldp load pair: LDP <Xt1>, <Xt2>, [<Xn|SP>], #<imm>

stp store pair: STP <Xt1>, <Xt2>, [<Xn|SP>], #<imm>

<imm> is signed immediate byte offset, a multiple of 8 in the range -512 to 504

Data processing – immediate

Arithmetic (immediate)

ADD

ADD (immediate)

ADD <Xd|SP>, <Xn|SP>, #<imm>{, <shift>}; Rd = Rn + shift(imm)

ADDS

Add and set flags

SUB

Subtract

SUB <Xd|SP>, <Xn|SP>, #<imm>{, <shift>}; Rd = Rn – shift(imm)

SUBS

Subtract and set flags

CMP

Compare

CMP <Xn|SP>, #<imm>{, <shift>}

CMN

Compare negative

Where: <shift> Is the optional shift type to be applied to the second source operand, defaulting to LSL.
The shift operators LSL (logical shift left), ASR (arithm sift right) and LSR (logical shift right) accept an immediate shift <amount> in the range 0 to one less than the register width of the instruction, inclusive.

Logical

AND

Bitwise

AND <Xd|SP>, <Xn>, #<imm> ;Rd = Rn AND imm

ANDS

Bitwise AND and set flags

ANDS <Xd>, <Xn>, #<imm> ;Rd = Rn AND imm

EOR

Bitwise exclusive

EOR <Xd|SP>, <Xn>, #<imm> ;Rd = Rn EOR imm

ORR

Bitwise inclusive

ORR <Xd|SP>, <Xn>, #<imm> ;Rd = Rn OR imm

TST

Test bits

TST <Xn>, #<imm> ;Rn AND imm

Move
Instructions to move wide immediate (16bit):

MOVZ

Move wide with zero

MOVZ <Xd>, #<imm>{, LSL #<shift>} ;Rd = LSL (imm16, shift)

MOVN

Move wide with NOT

MOVN <Xd>, #<imm>{, LSL #<shift>} ;Rd = NOT (LSL (imm16, shift))

MOVK

Move 16-bit immediate into register, keeping other bits unchange

MOVK <Xd>, #<imm>{, LSL #<shift>} ; Rd<shift+15:shift> = imm16

There are also an instruction to move immediate:MOV <Xd>, #<imm> ;Rd = imm
but his three versions are aliases of movz, movn and movk

PC-relative address calculation

The ADR instruction adds a signed, 21-bit immediate to the value of the program counter that fetched this instruction, and then writes the result to a general-purpose register:ADR <Xd>, <label>

The ADRP instruction permits the calculation of the address at a 4KB aligned memory region. In conjunction with an ADD(immediate) instruction, or a Load/Store instruction with a 12-bit immediate offset, this allows for the calculation of, or access to, any address within ±4GB of the current PC:ADRP <Xd>, <label>

The Move (register) instructions are aliases for other data processing instructions. They copy a value from a general-purpose register to another general-purpose register or the current stack pointer, or from the current stack pointer to a general-purpose register.MOV <Xd>, <Xm> Xd = Xm;

CRC32
The optional CRC32 instructions operate on the general-purpose register file to update a 32-bit CRC value from an input value comprising 1, 2, 4, or 8 bytes.
There are two different classes of CRC instructions, CRC32 and CRC32C, that support two commonly used 32-bit polynomials, known as CRC-32 and CRC-32C.

Conditional select
The Conditional select instructions select between the first or second source register, depending on the current state of the condition flag

CSEL

Conditional select

CSEL <Xd>, <Xn>, <Xm>, <cond> ;Rd = if cond then Rn else Rm

CSINC

Conditional select increment

CSINC <Xd>, <Xn>, <Xm>, <cond> ;Rd = if cond then Rn else (Rm + 1)

CSINV

Conditional select inversion

CSINV <Xd>, <Xn>, <Xm>, <cond> ;Rd = if cond then Rn else NOT (Rm)

CSNEG

Conditional select negation

CSNEG <Xd>, <Xn>, <Xm>, <cond> ;Rd = if cond then Rn else -Rm

CSET

Conditional set

CSET <Xd>, <cond> ;Rd = if cond then 1 else 0

CSETM

Conditional set mask

CSETM <Xd>, <cond> ;Rd = if cond then -1 else 0

CINC

Conditional increment

CINC <Xd>, <Xn>, <cond> ;Rd = if cond then Rn+1 else Rn

CINV

Conditional invert

CINV <Xd>, <Xn>, <cond> ;Rd = if cond then NOT(Rn) else Rn

CNEG

Conditional negate

CNEG <Xd>, <Xn>, <cond> ;Rd = if cond then -Rn else Rn

Conditional comparison
The Conditional comparison instructions provide a conditional select for the NZCV condition flags, setting the flags to the result of an arithmetic comparison of its two source register values if the named input condition is true, or to an immediate value if the input condition is false. There are register and immediate forms. The immediate form compares the source register to a small 5-bit unsigned value.

<nzcv> is the flag bit specifier, an immediate in the range 0 to 15, giving the alternative state for the 4-bit NZCV condition flags, encoded in the nzcv field.

<imm> Is a five bit unsigned (positive) immediate encoded in the imm5 field.

How ccmop works:
it checks NZCV flags for <cond>, if previous comparison passed, do this one and set NZCV, otherwise set NZCV to <imm>.
If we have to write this code: x0 >= x1 && x2 == x3
in arm assembly, with ccmp we can do this:

cmp x0, x1
ccmp x2, x3, #0, ge
beq good

Assembly Example:

It’s time to code!! Like others tutorial on assembly I show first the C-like code and then ARM asm.