Windows on ARM - An assembly language primer

The ARM CPU has garnered significant attention in the recent past due to
its wide-spread usage in mobile devices.
With Windows 8, for the first time Microsoft has released a mainstream
Windows OS to run on the ARM CPU. Windows CE has been running on ARM for
more than a decade now.
Developers and support engineers working with the Windows on ARM (WoA)
platform need a basic understanding of the ARM CPU and ARM assembler
in order to be able to effectively troubleshoot and debug issues that
occur at lowest levels of the operating system.
Although there is no shortage of information on the ARM CPU architecture and
assembly language, there is a very little information on the usage of ARM
assembly on Windows 8.
This article attempts to provide the reader with enough information to gain
a basic understand the ARM assembly language as used by Windows.
It does not attempt to be a comprehensive reference manual for the ARM CPU,
please refer to references section for detailed information on this topic.

Tools

This section covers some of the tools that were used to research this article.

In order to test the conversion of C/C++ constructs to ARM assembler,
the ARM cross compiler that ships with VS2013 was used.
To build the ARM executables the compiler was run from a console window
as shown below.
The following section assumes VS2013 is installed on the system in the default
install path.

To study how the individual ARM assembler instructions are translated
into ARM opcodes the ARM assembler was used. Once the assembler generated
the object (.OBJ) file, the linker (link.exe) was used to examine
opcode sequences.
All these steps are shown below.
The ARM assembler and linker also ship with Visual Studio 2013.

CPU Version

The research for this article was performed on a Microsoft Surface RT
(Generation 1) running on an Nvidia TEGRA 3 Quad Core CPU.
The Secure Boot Signing Policy that retail devices like Surface RT ship
with does not allow live kernel debugging.
It is however possible to configure Surface RT devices to generate complete
kernel memory dumps and these memory dumps can be loaded and analyzed on
both the X86 and X64 versions of WinDBG.
So all the research for this article was done using kernel mode and
user mode memory dumps generated on the Surface RT device.

To generate a complete kernel memory dump on a Surface RT system the
commands listed below were run from an administrative command prompt,
followed by a system reboot and finally bug-checking the system using the
RightCtrl+ScrollLock+ScrollLock key sequence, as described in
[8].

The "!sysinfo cpuinfo" command describes the CPU as ARM Family 7 Cortex-A9 r02p09.
Based on this information the ARM Cortex-A9 Technical Reference Manual
[3]
and the ARM Architecture Reference Manual for ARM-v7-A and ARM-v7-R
[4]
were used to research this article.

Registers

The ARM CPU can execute in User, System, Supervisor, Abort, Undefined,
Interrupt (IRQ) and Fast Interrupt (FIQ) modes.
In total the ARM CPU has 37 physical registers, each one 32-bits wide.
Out of these 37 registers, only 17 registers are visible to software at
any given point in time, depending on the mode the CPU is executing in.
These registers comprise of thirteen general-purpose registers (r0 to
r12) and three special purpose registers (r13- r15) and the CPU Program
Status Register (CPSR).
The special purpose registers r13, r14, r15 are also referred to as
SP, LR, and PC respectively.
The CPSR register is similar to the X86/X84 flags register.
Unlike the X86, ARM does not contain any segment registers.

Current Program Status Register. Similar to the EFlags register on X86.

APSR

Application Program Status Register.
This is not a separate register but the NZCVQ and GE bits of the CPSR
that are writable from user mode.

SPSR

Saved Program Status Register.
Copy of the CPSR at the time an exception occurs. SPSR contains the pre-exception
value of the CPSR. The CPU contains a separate instance of the SPSR for
every exception mode that is supported by the ARM CPU.

Of the 17 registers mentioned above, r0-r7 and r15 are unbanked registers
i.e. they map to the same physical registers irrespective of the mode the
CPU is executing in.
Registers r8 through r14 are banked i.e. they map to different physical
registers depending on the CPU's execution mode.
The purpose of banked registers is for the CPU to automatically save
and restore these register contents across execution mode changes
and ensure that the registers are not overwritten during an exception.
Registers r13 and r14 are banked in all execution modes except in System Mode.
Registers r8–r12 are banked only in FIQ mode.
In addition, the CPSR register is banked into the SPSR registers in all
modes, expect in System Mode.
The ARM documentation refers to banked registers with the suffixes
svc, abt, und, irq or fiq representing the execution modes of the CPU in
which the registers are used.

The following table shows the banked and unbanked registers in all of the
different execution modes of the CPU:

User

System

Supervisor

Abort

Undefined

IRQ

FIQ

r0

r0

r0

r0

r0

r0

r0

r1

r1

r1

r1

r1

r1

r1

r2

r2

r2

r2

r2

r2

r2

r3

r3

r3

r3

r3

r3

r3

r4

r4

r4

r4

r4

r4

r4

r5

r5

r5

r5

r5

r5

r5

r6

r6

r6

r6

r6

r6

r6

r7

r7

r7

r7

r7

r7

r7

r8

r8

r8

r8

r8

r8

r8_fiq

r9

r9

r9

r9

r9

r9

r9_fiq

r10

r10

r10

r10

r10

r10

r10_fiq

r11

r11

r11

r11

r11

r11

r11_fiq

r12

r12

r12

r12

r12

r12

r12_fiq

SP

SP

SP_svc

SP_abt

SP_und

SP_irq

SP_fiq

LR

LR

LR_svc

LR_abt

LR_und

LR_irq

LR_fiq

PC

PC

PC

PC

PC

PC

PC

CPSR

CPSR

SPSR_svc

SPSR_abt

SPSR_und

SPSR_irq

SPSR_fiq

The list of ARM registers can be examined using the debugger's register display
command:

Current Program Status Register (CPSR)

The five mode bits M[4:1] contain the values listed in the
following table indicating the mode CPU is currently operating in :

Mode

Value

Description

USR

0x10

User Mode

FIQ

0x11

FastInterrupt Mode

IRQ

0x12

Interrupt Mode

SVC

0x13

Supervisor Mode

ABT

0x17

Abort Mode

UDF

0x1B

Undefined Mode

SYS

0x1F

System Mode

The J & T bits determine if the CPU is in ARM or Thumb mode,
where J = Jazelle and T = Thumb.

J=0 & T=0 ARM Mode

J=0 & T=1 Thumb Mode

The NZCV bits are used by conditional flow control instructions to alter
program execution based on the result of compare operations.
These bits are set by instructions like cmp, tst, or any other instruction
that has an "S" suffix.
The 2-letter acronyms in the Condition column in the following table are
used as suffixes to branch instructions.
The Flags column shows the value of one or more condition bits that would
result in the corresponding branch being taken.
Examples of such conditional flow control instructions are Conditional
Compare and Branch (CBxx) and Conditional Branch (Bxx) and its
variants. The xx is the condition suffix as shown below:

Code

Condition

Flags

Description

0000 (0)

EQ

Z == 1

Equal

0001 (1)

NE

Z == 0

Not equal

0010 (2)

CS

C == 1

Carry set

0011 (3)

CC

N == 1

Carry clear

0100 (4)

MI

N == 1

Description

0101 (5)

PL

N == 0

Plus, positive or zero

0110 (6)

VS

V == 1

Overflow

0111 (7)

VC

V == 0

No overflow

1000 (8)

HI

(C == 1) && (Z == 0)

Unsigned higher

1001 (9)

LS

(C == 0) || (Z == 1)

Unsigned lower or same

1010 (a)

GE

N == V

Signed greater than

1011 (b)

LT

N != V

Signed less than

1100 (c)

GT

(Z == 0) && (N == V)

Signed greater than

1101 (d)

LE

(Z == 1) || (N != V)

Unsigned less than or equal

1110 (e)

AL

Any

Always (unconditional)

Trap Frames

The trap frame structure (KTRAP_FRAME) is used by Windows to save
and restore register contents during interrupts, system calls and exceptions.
Due to the use of banked registers the ARM CPU does not push anything on the
stack during an exception, hence the trap frame on the ARM CPU is entirely
defined by software.
The trap frame structure for ARM CPU, defined in ntddk.h, is as follows:

As highlighted by the "NOTE:" in the above output, the trap frame structure
does not contain fields for non-volatile registers i.e. R4-R10. At the time of
an exception the non-volatile registers are saved in another structure called
the KEXCEPTION_FRAME.
The KEXCEPTION_FRAME structure is not exposed through public symbols but
it is defined in ntddk.h.
The macros GENERATE_EXCEPTION_FRAME and RESTORE_EXCEPTION_FRAME are defined in
the WDK Header file kxarm.h. These macros are used at the beginning and end of
functions respectively to setup and tear down the EXCEPTION_FRAME structures
on the stack.

In addition to the CPU registers described above, the KTRAP_FRAME also contains
a copy of the CPU's Breakpoint Value registers (Bvr) and the Breakpoint Control
Registers (Bcr) which control the configuration and usage of the Bvrs.
The KTRAP_FRAME also contains a copy of the CPU's Watchpoint Value Registers (Wvr)
and the Watchpoint Control Registers (Wcr) which control the configuration and usage
of the Wvrs.
All of the breakpoint and watchpoint registers reside in co-processor CP14,
more on co-processors later.
The maximum number of breakpoints and watch points that are available on a
CPU are defined in hardware and these values are cached in the Kernel Processor
Control Region (KPCR) structure.
The fields KPCR.MaxBreakpoints and KPCR.MaxWatchpoints cache the maximum
number of breakpoints and watchpoints respectively.
The content of these fields in the KPCR structure is shown below:

The trap frame also optionally points to the Vector Floating Point (VFP)
registers, these registers reside in co-processor CP10.
These registers are used as either 64-bit "D" floating point registers or
as the NEON 128-bit SIMD or "Q" registers.
These "D" and "Q" registers are aliased and they map to the same physical
bits in the VFP.
The VFP register values can be read using the VTSM instruction and written
to using the VLDM instruction.

The debugger's default register mask on the ARM i.e. 0x01 causes the 'r'
command to display only the integer registers.
The other registers described above can be examined by setting the
register mask to 0x4f as shown below:

Instruction Set

Windows, like all other modern operating systems, uses the ARM CPU in
Thumb-2 mode in which instructions are either 16 bits (Thumb) or 32 bits (ARM).
Thumb mode, which was introduced in early ARM processors, allows for higher
instruction density and uniform instruction coding but these instructions
are limited in functionality as compared to their 32-bit ARM counterparts.
Here are some of the limitations:

16-bit Thumb instructions only contain 3-bits to identify source and
destination registers.
Consequently only registers R0 - R7 can be accessed by them.
The 32-bit ARM instructions, on the other hand, can access the full set
of R0 - R15 registers.
Following are some examples of 16-bit instructions accessing registers r0 - r7.

Opcode

Mnemonic

Operand

2304

movs

r3,#4

4605

mov

r5,r0

2D00

cmp

r5,#0

3B01

subs

r3,#1

005C

lsls

r4,r3,#1

Thumb instructions cannot be predicated i.e. they cannot be made to
operate conditionally using the NZCV bits like the ARM instruction set.

Immediate Values are restricted to 12 bits, so only numbers from 0 to 4095
can be encoded with the instruction.
However using the barrel shifter, described later in this article, the
immediate number can be multiplied and added to an existing register
value to increase its range.

A Thumb routine can call both Thumb code and ARM code, but it cannot
contain non-Thumb instructions. The same goes for an ARM routine.

Thumb-2, introduced in modern ARM processors, allows these limitations
to be worked around by enabling compilers and the processor to generate
and understand functions which combine both Thumb and ARM instructions
in the same instruction stream, without requiring branch instructions to
switch from one mode to the other.

During a branch operation the ARM CPU must be told that the target of
the branch is a Thumb-2 instruction.
This is indicated by setting the least significant bit of the branch address.
As a consequence of this when a function pointer is examined in WinDBG
it always points at a one byte offset within the function as illustrated
below:

The function nt!IopErrorLogThread begins at address 0x836c8188, however
the field WORK_QUEUE_ITEM.WorkerRoutine contains the address 0x836c8189
which has the least significant bit is set indicating a Thumb-2 instruction stream.

Instruction Encoding

Since instruction sizes in ARM Thumb-2 mode can be both 16 and 32 bit, the
way an instruction is encoded plays a critical role in determining the actual
instruction size.
32-bit instructions are encoded as 2 separate 16-bit half-words.
The value of bits[15:11] of the first half-word determines if the instruction
is made of a single half-word (16 bits) or double half-word (32-bits).
If the value of bits[15:11] of the first half-word are either 11101 or 11110 or 11111,
the half-word is the first half-word of a 32-bit instruction otherwise it is
a 16-bit instruction.

The following excerpt from the ARMv7 Architecture Reference Manual
Section A8.8.43 shows the encoding of the above mentioned DMB instruction
in Thumb-2 mode.

Figure 2 : ARM 32-bit instruction encoding

The first (lower) 16 bit part of the opcode (0xf3bf) is represented by the binary
number "1111 0011 1011 1111" which matches the first half of the instruction encoding.

The second (higher) 16 bit part of the opcode (0x8f5b) is represented by the binary
number "1000 1111 0101 1011" which matches the second half of the instruction encoding.
The "option" value is binary 1011, and specifies the ISH option to the DMB instruction
as shown below:

The opcode (0xbf10) is represented by the binary number 1011 1111 0001 0000,
which matches the instruction bit encoding shown below.

Figure 4 : ARM 16-bit instruction encoding

Instructions on the ARM CPU have different variants depending on the prefix
that follows the primary mnemonic. These prefixes can be S, W, or .W and
determines how the instruction is encoded, whether CPSR are affected
and how some of the operands are interpreted.

Instructions that have an S suffix change the NZCV bits of the CPSR
register based on the result of the operation.

Instructions that have a .W suffix are always encoded as 32-bit ARM
instructions as opposed to 16-bit Thumb instructions.

Instructions that have a W suffix zero extend their 12-bit immediate
value i.e. the 3rd operand. ARM 32-bit instructions that don't have the
W suffix treat their 3rd operand as a 12-bit constant value and decode it
based on the value of most significant 4 bits of the constant i.e. bits 11-8.

Following are some variants of the ADD instruction with the same operands
encoded differently based on the suffix immediately following the instruction
mnemonic.
The first column is the opcode for the instruction.

Barrel Shifter

The ARM instruction set has the capability to combine shift and rotate
operations along with arithmetic, logical, compare, load and store
operations in a single instruction.
This is achieved through the barrel shifter, a hardware logic unit in
the CPU shown below:

Figure 5 : ARM Barrel Shifter

The barrel shifter implements shift and rotate operations that can be of
arithmetic or logical type like:

Instruction Ordering

Modern compilers attempt to optimize program execution by generating instruction
sequences which may be different from what was intended by the high level
programming language.

Modern CPUs also perform multiple run time optimizations like instruction
pipelining, write buffering, instruction and data caching, speculative
execution and out of order execution.
While these optimizations result in faster program execution, there are
cases where they may lead to undesirable results.
This is especially true for low level operations performed by the OS like
cache operations, TLB flushes, page table updates and device register accesses.
Barriers prevent both the compiler and CPU from performing the above mentioned
optimizations.

The ARM CPU documentation uses the term barrier to refer to CPU optimization prevention.
There are 3 different types of barriers that can be used on the ARM CPU.

Instr.

Barrier Type

Description

dmb

Data Memory Barrier

Ensures that all explicit memory accesses before the DMB instruction
complete before any explicit memory accesses after the DMB instruction start.
The DMB instruction is automatically inserted by the compiler whenever
any Interlocked family of functions are used in C or C++.
Additionally declaring a global variable as volatile results in the compiler
generating DMB instructions provided the file is compiled with the
/volatile:ms, instead of the /volatile:iso option.

dsb

Data Synchronization Barrier

Completes when all instructions before this instruction complete.
The DSB instruction can be directly inserted using the macro
_DataSynchronizationBarrier() which is defined in winnt.h.

isb

Instruction Synchronization Barrier

Flushes the pipeline in the CPU, so that all instructions following
the ISB are fetched from cache or memory, after the ISB has been completed.
The ISB instruction can be directly inserted using the macro
_InstructionSynchronizationBarrier() which is defined in winnt.h.

The scope of these barrier instructions can be restricted to sharing
domains as well as to specific memory access types.
These can specified optionally as instruction suffixes to the barrier
instructions.
If a barrier instruction does not have a suffix its scope is assumed to be
system wide and it applies to both read and write type memory accesses.

Sharing Domain

Suffix

Description

Non-Shareable

NSH

Per-Core TLBs

Inner Shareable

ISH

System Memory

Outer Shareable

OSH

Device Memory

Full System

SY or ST

System and Device Memory

Access Type

Suffix

Comments

Read and Write

None

For full system read and write access, the sharing domain and access
is combined into the suffix SY.

Write only

ST

For full system write only access, the sharing domain and access is
combined into the suffix ST.

The following annotated code snippet shows the usage of the ISB instruction to perform a
pipeline flush before updating the exception handling settings on the ARM CPU and another
one after the update to fetch subsequent instructions directly from memory.

Interlocked Operations

Unlike the X86 and X64 CPUs, which use the lock prefix before instructions to make
them atomic across multiple CPUs, the ARM CPU uses LDREX and STREX and its variants
to implement interlocked operations.
The LDREX and STREX instructions are used in pairs but there can be other
intervening instructions between them.

The following code snippet shows the assembly instructions generated by the
compiler during a call to the function InterlockedIncrement ( &g_Lock );.

In the STREX example above the R3 register contains success (0) or
failure (1) depending on whether R2 was stored in memory pointed to by R1.

Commonly Used Instructions

This section lists the most common instructions that are encountered in
functions on the WoA platform.
Familiarity with these instructions helps in reading and understanding
most of the assembler code generated by the Visual Studio compiler targeting
WoA.
Instruction opcodes are included to clearly distinguish between 16 and 32
bit Thumb-2 instructions.

PC Relative Conditional Branch if equal. If (CPSR.Z == 1) PC = BranchTarget. Similar to the X86 JZ instruction. Since the opcode for this instruction is 32-bits its target range is much larger than the previous instruction.

bb6b

cbnz r3,83429d80

PC Relative Compare and Branch on Nonzero. if ( R3 != 0 ) PC = BranchTarget. The range of such branches is +4 to +130 bytes.

Compare and Test

Opcode

Instruction

Operation

f0130f10

tst r3,#0x10

Set flags based on bitwise AND operation. CPSR.Flags = r3 & 0x10

ea930f00

teq r3,r0

Set flags based on bitwise XOR operation. CPSR.Flags = r3 ^ r0

2800

cmp r0,#0

Set flags based on subtraction operation. CPSR.Flags = r0 - ZeroExtend(0x0). The immediate operand is zero extended to make it 32-bits wide.

Special Instructions

Windows use the ARM CPU's capability of generating exceptions on undefined
instructions to process "well known" undefined instructions which are essentially
opcodes that are construed as undefined by ARM but convey meaning to the Window's
exception handling mechanism.
16-bit instructions starting with a 0xDE are undefined and lead to an
Undefined Instruction exception which is handled by nt!KiUndefinedInstructionException.
While executing an undefined instruction, the CPSR.Mode is set to 11011b i.e. Undefined.

KiUndefinedInstructionException() directly handles certain undefined
instructions like __ rdpmccntr64, but for the rest, it simply dispatches the
exception to KiDispatchException() which in turns calls KiPreprocessInternalInvalidOpcode().
WoA uses the following undefined instructions:

Opcode

Mnemonic

Description

0xDEFE

__debugbreak

Breaks into the debugger. Used by ntdll!DbgUserBreakPoint().

0xDEFC

__assertfail

Used to indicate critical assertion failures in the kernel debugger. Used by KeAccumulateTicks()

0xDEFB

__fastfail

Indicates fast fail conditions resulting in KeBugCheckEx(KERNEL_SECURITY_CHECK_FAILURE). Called by functions like InsertTailList() upon detecting a corrupted list, as described in [9].

0xDEFA

__rdpmccntr64

Reads the 64-bit performance counter co-processor register and returns the value in R0+R1. Used by ReadTimeStampCounter(), KiCacheFlushTrial() etc.

Divide By Zero Exception, used by functions like nt!_rt_udiv and nt!_rt_udiv. Also generated by the compiler to check the divisor before division operations.

Calling Convention

The ARM CPU and the X64 CPU have very similar calling conventions
in that the first four parameters to a function are passed via registers.
However, unlike the X64 that has a register spill space, the ARM compiler does
not reserve any space on the stack for register based parameters.
Another similarity between X64 and ARM is that only the function prolog
and epilog modify the value of the stack pointer (SP), the function body
never changes SP.
The registers used for parameter passing on the ARM CPU are listed below:

R0 = Parameter #1

R1 = Parameter #2

R2 = Parameter #3

R3 = Parameter #4

The fifth parameter onwards is stored on the stack.

The following figure shows assembler code sequence during a function call.

Figure 7 : Function Parameters

Function Prolog and Epilog

The following code snippet is an example of instructions that typically
make up the prolog of a non-leaf function:

The push instruction above saves the volatile registers R4, R5, R6, R7,
R8, R9, R10, R11 and LR (R15) on the stack. LR (R15) is used to return
execution control back to the caller.
The addws sets up the r11 register to point to the location of the stack
where the old r11 register was saved. This creates a frame pointer chain
similar to the one created on the X86 with the EBP register.
And finally the sub instruction creates space on the stack for local variables.

The add instruction in the above snippet simply adjusts the stack pointer
to skip over the local variables.
The pop instruction restores back the contents of the non-volatile registers
which were saved in the function prolog.
The value of the saved LR register (i.e. the return address) is restored
back into the PC, thus returning control back to the caller and obviating
the need for an explicit branch instruction.

Figure 8 : Function Prolog and Epilog

The prolog and epilog for leaf functions (i.e. function that don't call others)
are very different from the sequence shown above.
Following is the complete disassembly of a non-leaf function:

In the code snippet shown above, the LR register contains the return address
of the caller upon entry. Since this function does not modify the LR
register contents, returning to the caller simply involves branching to
LR i.e. "bx lr".

Function Disassembly Walkthrough

To tie together all the concepts introduced above, this section provides
a complete annotated listing of the user mode function CreateFileA() in kernelbase.dll.

Here is the prototype of CreateFileA() along with the registers and stack
locations that would contain the parameters passed in by the caller.

The term "callee" used in the following code snippet refers to the function
CreateFileW() which is called by CreateFileA().
Figure 9 depicts the state of the stack after the sub instruction has
executed i.e. prolog for CreateFileA() has completed.

WinDBG, as of version 6.3.9600, does not pay attention to mov instructions
because they do not fall under the category of flow control instructions.
WinDBG encounters the bx r12 instruction, and gives up on the static disassembly
because it assumes that the value of r12 will be determined at runtime.
It however misses the fact that the above sequence amounts to bx 0x83551978
which is nothing but a call to another function, as shown in the figure below:

Figure 6 : Indirect Branch

So any time WinDBG encounters an indirect branch via a register it
fails to follow the function in its entirety.

Co-Processor

The ARM CPU has multiple co-processors that implement functionality that is not
a part of core instruction execution. The co-processors that are used by Windows,
as well other operating systems, are:

CP10 (Vector Floating Point Co-processor)

CP14 (Debug Co-processor)

CP15 (System control Co-processor)

The MRC and MCR instructions are used to access the co-processor registers.
The VFP (CP10) can be also be accessed using VMSR and VMRS instructions.

The compiler intrinsics MoveFromCoprocessor() and MoveToCoprocessor()
and their variants can be used to access ARM co-processors from C/C++.
The Visual Studio 2013 CRT source file "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\crt\src\ARM\helpexcept.c"
has examples on how to use these intrinsics.

Since the CP15 co-processor contains the most critical registers required
by Windows, some details of this co-processor are included in this section.
The CP15 registers are organized by function groups with each group
represented by a single primary co-processor register referred to as CRn.
The function group description and the corresponding primary control register
is listed in the table below:

CRn

Functionality

c0

ID and Feature Registers

c1

System Control Register

c2

Translation Table Base

c3

Domain Access Control

c5

Fault Status

c6

Fault Address Register

c7

Cache/Write Buffer Control

c8

TLB Maintenance Operations

c9

Performance Counters

c10

Memory Mapping Registers & TLB Operations

c11

DMA Control

c12

Security Extensions registers

c13

Process, Context & Thread ID Registers

The following table contains some examples of CP15 registers that are used
by Windows for various low level operations.
Individual CP15 registers are selected by the primary co-processor register
(CRn), the secondary co-processor register (CRm), OpCode #1 (Op1) and OpCode#2 (Op2).

CP#

Opc1

CRn

CRm

Opc2

Description

p15

0

c1

c0

0

SCTLR System Control Register (Used by KiInitializeExceptionVectorTable to setup exception handling)

The following code snippet shows the MRC and MCR instructions accessing
the contents of the TPIDRUR0 register in CP15 using primary register c13,
secondary register c0, OpCode1=0 and OpCode2=3.
The MRC instruction reads the contents of TPIDRUR0 into ARM register r0.
The MCR instruction writes the contents of ARM register r0 to TPIDRUR0.
Figure 10 labels the various operands passed to the MRC instruction.

System Calls

The SVC instruction causes a Supervisor Call exception.
This provides a mechanism for unprivileged software (user mode applications)
to make calls into the operating system (kernel routines).
WoA uses this mechanism to implement native system calls similar
to the int 0x2e, sysenter and syscall instructions on the X86 and X64 CPUs.
In the code snippet shown below the NTDLL native API NtClose uses the
SVC #1 instruction to invoke the exception handler for system call exceptions
(nt!KiSWIException).
This service index for NtClose() is 0x0d. The usage of register r12 to pass
the service index into the system call is recommended by the ARM
Application Binary Interface (ABI).

WoA uses a system service dispatch table similar to the one on X64.
The kernel variable nt!KiServiceTable points to a table that
contains 32 bit entries each containing a 28 bit relative service offset
and a 4 bit argument count.
The kernel initialization function nt!KeCompactServiceTable() sets up the table.
The logic ServiceAddress = KiServiceTable + KiServiceTable[ServiceIndex] >> 4 )
computes the address of the function that implements the native service.
The "return from exception" instruction (i.e. RFE sp) transfers execution back
to user mode.

The following example shows the address of the function nt!NtClose being
computed relative to the base of the table at nt!KiServiceTable using the
service index 0x0d.

Exception Handling

On the X86/X64 CPU the Interrupt Descriptor Table (IDT) contains pointers
to exception handlers, software interrupt handlers and hardware interrupt
handlers.
On the ARM CPU, has a separate exception vector table that contains
instruction opcodes instead of function pointers.
The opcode for each type of exception in the table is the same (0xf8dff01c)
and it encodes an instruction that will transfer execution control to
the PC relative offset to the handler for that exception.
As a part of system startup, the kernel function nt!KiInitializeExceptionVectorTable()
writes the address of the Windows exception vector table (nt!KiArmExceptionVectors) to
the Vector Base Address Register (VBAR) in CP15.
The ARM exception table along with the registered exception handlers is shown below.

On the X86 and X64 there is a single exception handler that handles all
types of page faults.
On the ARM CPU there are two different handlers one for data page faults
(nt!KiDataAbortException) and another one for code page faults
(nt!KiPrefetchAbortException). Both these exception handlers call the
common routine nt!KiCommonMemoryManagementAbort to perform the bulk of
page fault handling.

Fast IRQ handling is not supported on the WoA platform.
Examining the implementation of the FIQ exception handler (nt!KiFIQException)
shows that this function if ever called would bug-check the system with
the stop code 0x3d (INTERRUPT_EXCEPTION_NOT_HANDLED).

Interrupt Descriptor Tables

On the X86/X64 CPU, drivers register their interrupt service routines (ISRs)
through a system provided template directly in the interrupt descriptor table
(IDT).
ARM platforms that have a generic interrupt controller (GIC) do not
support vectored interrupts.
So WoA routes all hardware interrupts through a single entry point
(nt!KiInterruptException) which is responsible for determining the source
of the interrupt from the GIC and then dispatching the interrupt to the
appropriate driver's ISR.

Similar to the X64 CPU, WoA uses a total of 16 IRQLs.
The IRQLs associated with hardware devices are in the range 0x8 through 0xb.
For each device IRQL, the first 16 device interrupts at that IRQL are
registered directly in the KPCR->Idt[] array. Any overflow interrupts i.e.
beyond the 16 interrupts per device IRQL, are registered in the
KPCT->IdtExt[] array.
The function KiConnectInterruptInternal() determines if there is an overflow
situation and accordingly allocates the extended IDT at KPCT->IdtExt from
NonPagedPool with 0x400 entries.
Both the primary IDT (KPCR->Idt[]) and the extended IDT (KPCR->IdtExt[])
contain pointers to KINTERRUPT structures that were allocated as a result
of drivers registration of their ISR.

Unlike the X86/X64 where the IDT is a hardware defined structure, on the
ARM CPU the IDT is software defined.
This has an interesting security benefit in that the KINTERRUPT structure
on ARM no longer needs to contain any executable code, as can be observed
from the size of the KINTERRUPT.DispatchCode[] array in the above output,
and hence it can be allocated out of Non-Executable NonPagedPool.

In addition to the primary and extended IDTs described above, WoA
also uses a global secondary IDT for General Purpose I/O (GPIO) interrupts.
This IDT is allocated from non-paged pool and is pointed to by the global
variable nt!KiGlobalSecondaryIDT.
Each entry in this table is of type KSECONDARY_IDT_ENTRY which contains
an embedded KINTERRUPT structure as shown below.
The current implementation allocates the secondary IDT with 0x100 entries.

Conclusion

This article described the ARM CPU, registers and Thumb-2 instructions.
It explained the functionality of the instructions typically seen in code
generated by the Visual Studio compiler as well as details of the function
calling convention.
It covered some unique aspects of the ARM CPU like the barrel shifter,
the co-processors, and explicit opcodes used for memory barriers and undefined
instructions, while also explaining how such aspects are used by Windows.
This article also highlighted some of the key differences between how
certain features like trap frames, exception handling, interrupt dispatching,
interrupt descriptor tables, system calls and interlocked operations are
implemented on ARM as compared to X86/X64.

Special thanks to Alex Ionescu (@aionescu) for his review and valuable feedback on this article.