ARMv5E? - Upgraded ARM ISA that includes additional functions for accelerating DSP operations. All ARMv5E?+ hardware has at least a 16 bit wide multiplication unit. Additional instructions for single cycle packed 16 bit fixed point multiplication and multiplication w/ accumulation as well as saturating addition.

ARMv6 - Upgraded with additional DSP styled operations and support for unaligned load/stores (although these are slower). Adds support for packed SIMD style addition and multiplication operations on 32 bit registers. Adds additional fixed point multiplication instructions (basically instructions that multiply then shift the result).

NEON - Adds NEON coprocessor. A separate SIMD core with 32 64 bit registers which can also be accessed as 16 128 bit registers. All NEON instructions operate on the NEON register file and can only be returned to the main core via coprocessor instructions or writing out to memory, both of which incur high latency. NEON operations include vector fixed and floating point addition, multiplication, load, store and shifts. Most operations are fully pipelined, making them vastly more efficient then standard ARM operations.

With each arm generation, scheduling becomes increasingly important. Fortunately, code scheduled for a later arm processor uses runs at or near optimal on earlier arm cores, with a few exceptions. Therefore one should ideally consider scheduling for the arm11 processor when writing code, even when developing on earlier processors.

PP5022/PP5024

Pipeline Interlocks: All single loads have a 1 cycle interlock if used immediately after load.

Similar to ARM7TDMI?, but with 5 cycle pipeline and separate cache for instructions and data that eliminates the unconditional ldr delay under some circumstances. Attention to pipelining becomes more important. All arm9 cores should not use registers on the cycle immediately after issuing a load against them.

AS3525v1

Cache: 8KB I, 8KB D, 0 cycle latency

IRAM: 320KB, performance is comparable to DRAM

Boosting: Yes (62/248MHz)

Examples: e200v2, m200v4, c200v2, fuzev1, clipv1

Memory performance is fairly poor when boosted, which hurts battery life if codecs require frequent boosting. IRAM seems no better then DRAM. Codecs run entirely from IRAM.

IRAM is significantly faster then DRAM, but still has higher latency then cache. In general memory performance is very poor on this CPU. Memory bus speed limited to 100MHz so latency increases when boosting.

Pipeline Interlocks: All single loads have a 1 cycle interlock if used immediately after load, single multiplies have a single cycle interlock if used outside the multiplier unit on the next cycle (e.g. multiply accumulate has no interlock on sequential cycles, but a multiply followed by a store does).

ARM11

Load/Stores: Still single issue, however the load/store and ALU pipelines are independent, and loads can retire out of order if there are no dependencies. Load multiple instructions are now single cycle, with memory accesses occupying only the memory pipeline on subsequent cycles, so in principle one can load many registers in just one cycle if subsequent cycles are occupied with independent ALU ops.

Pipeline Interlocks: All single loads have a 3 cycle interlock if used immediately after load. Double word aligned multiple loads are much faster then single loads, and multiple loads can be faster then double loads due to memory pipeline. Interlocks now occur when using some multiplication instructions on sequential cycles (e.g. smlawY to accumulate to the same register will stall), so avoid accumulating into the same register on sequential cycles.

Similar to ARM9E? but with even longer pipeline, branch prediction, added L2 cache, 64 bit load/store unit, separate ALU and memory pipelines, and ISA upgraded to v6. Load multiple and load double instructions now fetch two registers per clock if they are even word aligned. Load multiple and store multiple instructions issue in one cycle but will stall if the used registers are read before they are available when loading or written before their contents are stored when storing or if any other memory accesses are started. Large interlock latencies mean considering pipelining is essential. If properly scheduled, performance is substantially improved over ARM9E? due to improved cache, branch prediction and wider load/store units.

iMX31

Cache: 16KB I, 16KB D, 0 cycle latency, 128KB L2

IRAM: 16KB, not used

Boosting: No.

Examples: Gigabeat S

High clock speed means memory is fairly slow.

Coldfire

These are RISC variants of the Motorola 68k architecture. Coldfire architecture versions:

When accessing data in DRAM, use 16 byte aligned movem.l wherever possible. Line burst transfers are ~2.5 times as fast as 4x 4 byte (longword) transfers.

Use the EMAC if the algorithm allows it. Standard multiplication instructions use the same multiplier, but they always need several cycles because they're synchronous. EMAC is pipelined; stalls occur only if you're fetching the result from %accN too early.

EMAC instructions can load from memory in parallell while multiplying with only one extra cycle. The point above about long instructions apply though so avoid using offsets if possible.

MCF5250

Same as MCF5249 except IRAM is 128 KB single cycle (64 KB + 64 KB; only first block is DMA capable).