What is Latency?

In this context, latency is the time (either in clocks or nano-seconds) taken to transfer a block of data either from main memory or GPU caches. We want the data as quickly as possible, thus the lower the time the better. The size of the data block we request is usually the size of a native pointer (4 bytes in 32-bit, 8 in 64-bit).

As a GPU (or APU) executes instructions, both the instructions themselves and the data they operate on must be brought into registers; until the instruction/data is available, the GPU cannot proceed and must wait; even advanced designs that can execute out-of-order eventually need data.

Latency is generally measured in core "clocks" (1/frequency) for caches (as they usually run at GPU speed) and nano-seconds (10^-9) for the main memory.

Why is it important to measure it?

The latency of the main memory directly influences the efficiency of the GPU, thus its performance: reducing wait time can be more important than increasing execution speed. Unfortunately, memory has huge latency (today, by a factor of 100 or more): A GPU waiting for 100 clocks for data would run at 1/100 efficiency, i.e. 1% of theoretical performance!

Modern GPUs have internal "memory caches" that mirror instructions/data from main memory but at far lower latencies; they allow the GPU to get data much faster and thus increase efficiency. Unfortunately the faster the cache the smaller it needs to be, thus modern GPUs contain various cache hierarchy levels (L1, L2, L3) that get progressively bigger but slower.

Memory is not only differentiated by the speed it runs at (MHz) but also its type (e.g. DDR3, GDDR3, GDDR5, etc.) and also the timings (command latencies) it supports (e.g. tCAS/CL, tRP, tRCD, tRAS, etc.). The lower the timings the lower the overall latency of memory.

What kinds of memory do GPUs have?

While CPUs have "data" and "code/instruction" memory/caches, GPUs have additional memory types that, from a compute perspective, serve different purposes and have different characteristics. As GPUs are generally SIMT designs, threads execute in groups (blocks/warps) - not independently as with multi-core/threaded CPUs.

Code/Instruction Memory: Read-only global memory that holds instructions to be executed, not data to be used by them. Generaly cached.

Are the Cache / Memory latencies fixed?

No. Modern GPUs also contain "data prefetchers" which bring data into the caches speculatively, i.e. they guess which instruction/data will be needed next and fetch it to be ready when needed. Thus the GPU does not need to wait for the data to be brought all the way from main memory but get it from the cache.

Prefetchers work by recognising patterns in the access of data (spatial, temporal, etc.) when executing code. Thus the latency of accessing data depends entirely to whether the prefetchers have "understood" the pattern and have fetched the right data into the caches.

Sandra allows you to test various access patterns and thus observe the latencies of the various cache levels and memory, as well as the effect of the prefetchers.

What is the TLB?

The "Page Table" is what maps virtual to physical addresses and thus virtual pages to real memory. The TLB (translation look-aside buffer) is a CPU feature that caches the recent mappings from the page table.

If the TLB does not contain the required map, i.e. "TLB miss", the page table itself must be searched which is very much slower: "Page-Walk Hit". GPUs may contain multiple TLB levels - just like cache levels - but typically have only 512-entries x 4kB page = 2MB ("TLB range"). This is relatively small compared to 8-16GB memory of today's computers.

How does this relate to lantecy measurement?

As the TLB range is relatively small, an algorithm accessing a large memory block in a random pattern is likely to miss the TLB and thus incur the "TLB miss". Thus the total latency to access a data item not cached in L1D/L2 caches is not just the L3/Memory access latency but this additional latency.

The latency values published by the manufacturers are naturally "best case", and include only L1D/L2/L3/Memory access times and not any additional latencies incurred in practice.

We do not believe it is realistic, due to the small native page size and thus small TLB range that algorithms would not incur the "page-walk hit" when accessing memory outside L1D/L2.

What are the memory access patterns Sandra uses?

Sandra allows you to test various access patterns and thus observe the latencies of the various cache levels and memory, as well as the effect of the prefetchers:

Sequential Access Pattern: Memory is accessed sequentially which is an easy pattern for prefetchers - "a show-case for prefetchers"; thus the latencies will be "best case", very much reduced.

In-Page Random Access Pattern: Memory is accessed in a random pattern within the page (either native or large): this ensures there are no "TLB miss" latencies, just raw cache/memory latencies. Some prefetchers (e.g. "adjacent line prefetcher") still have an impact.

Full Random Access Pattern: Memory is accessed in a random pattern within the whole block. Large blocks may incur a "TLB miss" depending on the "TLB range".

Note: OpenCL was used as it is supported by all GPUs/APUs. The tests are also available through CUDA which provides more precise clock timings (due to core clock tick counter) which are not available in OpenCL nor DirectX ComputeShader.

Hardware Specifications

Here are the GPUs and APUs we are comparing in this article:

GPU / APU

Core (CU) Speed / Turbo

Cores (CU) / Threads (SP)

Memory / Speed

Registers / Const / Shared / L2+L3+L4 cache

GeForce 8800 GTS (GT80)

1188MHz

12C / 96SP

640MB GDDR3 800MHz 320-bit

8k / 64kB / 16kB

nVidia GeForce GTX 260 (GT200)

1295MHz

24C / 192SP

896MB GDDR3 1GHz 448-bit

16k / 64kB / 16kB

nVidia GeForce 555M (Fermi)

1180MHz

3C / 144SP

1.5GB DDR3 1.8GHz 192-bit

32k / 64kB / 48kB / 384kB

nVidia GeForce 660 TI (Kepler)

980MHz / 1100MHz

7C / SP

2GB DDR5 6GHz 192-bit

64k / 64kB / 48kB / 384kB

AMD A6-3650 APU (Llano) / Radeon HD 6530D

444MHz

4C / 320SP

512MB DDR3 1.33GHz 128-bit (shared out of 8GB)

16k / 64kB / 32kB / 64kB

AMD Radeon HD 6850 (Barts)

775MHz

12C / 960SP

1GB GDDR5 4GHz 256-bit

16k / 64kB / 32kB / 256kB

Intel i7-3xxxM APU (Ivy Bridge) / GT2 HD 4000

650MHz / 1050MHz

16C / 16SP*

512MB DDR3 1.33GHz 128-bit (shared out of 8GB)

16k / 64kB / 64kB / 2MB

Intel i7-4xxxM APU (Haswell) / GT3 HD 5200

600MHz / 800MHz

40C / 40SP*

512MB DDR3 1.6GHz 128-bit (shared out of 8GB)

16k / 64kB / 64kB / 2MB + 128MB eDRAM

Global Cache/Memory Latency

"Global Memory" is device memory, either dedicated in the case of GPU or shared system memory in the case of APU. It can hold any data type and can be read or written and accessed by any thread running on the GP.

While not all GPUs cache global memory, they do have TLB caches - just like modern CPUs. The "random in-page" access pattern that Sandra uses is especially designed to avoid TLB misses and thus measure the "real" cache/memory latencies. The "full random" access pattern can be used to measure TLB miss penalties where desired.

GPU

L1D (clk)

L2 (clk)

L3 (clk)

Memory (clk)

Comment

GeForce 8800 GTS

~502clk / ~577ns

It's prety clear that there are no caches for global memory on the old G80. Over 4MB we see TLB miss penalties as we're using the full random access pattern.

GeForce 260 GTX

~493clk / ~380ns

No caching effects on GT100 either with pretty small TLB miss penalties.

GeForce GT 555M

4kB ~20clk

32kB ~100clk

256kB ~320clk

~680clk / ~575ns

Fermi adds caching to global memory in a pretty much a textbook result, with extremely low latency small L1D/L2D caches and reasonably performant global L3 cache. TLB penalties are pretty significant with worst-case memory latency as veryt high - DDR3 memory does not help.

Radeon HD 6850

8kB ~320clk

256kB ~365clk

~545clk / ~703ns

L1D cache as slow as Fermi's L3 (!) and L2D not much help but better than nothing. At least TLB miss penalties are not as bad as Fermi's but while memory latency is less in terms of clocks, it is slower even with GDDR5 (~703ns vs ~575ns).

AMD Llano APU

8kB ~320clk

64kB ~363clk

~493clk / ~1110ns

Same L1D as HD 6850 but smaller L2D, but at least not worse than a dedicated GPU - for a 1st gen APU that's not bad. Memory latency is higher, no doubt due to shared DDR3 memory.

Intel Ivy Bridge APU

128kB ~90clk

~300clk / ~272ns

We only find one cache here (L1D), but it's 3 times faster than AMD's (90clk vs 363clk), matches Fermi's L2 and is reasonably large. Main memory latency is extremely low for an APU, even including TLB miss penalties, only ~270ns - again that is 3 times faster than Llano using the same 1.33GHz DDR3 memory.

Intel Haswell APU

256kB ~100clk

~410ns

We only find one cache here (L1D), double Ivy size and 10clk slower. However, main memory latency is huge, almost 2x (410ns vs. 272ns). Whatever changes were made to the RingBus, latencies seem to have increased considerably - perhaps that is the reason for the doubling of L1D?

An impressive result for "Fermi" and a good result for APUs, with an impressive memory controller on "Ivy Bridge". The Radeon 6850 comes off worst even though it is the only dedicated GPU using GDDR5 memory.

Constant Cache/Memory Latency

"Constant Memory" is read-only memory and as such more likely to be cached. As it is limited in size (e.g. 64kB) it needs to be used judiciously. No TLBs are generally needed as it may span 1 or very few pages.

GPU

L1D (clk)

L2 (clk)

Memory (clk)

Comment

GeForce 8800 GTS

2kb ~90clk

32kb ~210clk

~365clk / ~307ns

Unlike global memory, const memory is cached, still much slower than shared memory.

Worse L1D latency than global memory (+20clk), might as well not bother with constant memory at all! That is not what we expect here at all. High worst-case latency as well (~500ns), over 2x as much as Fermi (~200ns)- and this is a dedicated GPU!

Likely due to the large L1D cache we saw with global memory, latency is constant throughout the range - and the same as global memory. Very low worst-case time of ~90ns, lowest by miles - even Fermi is 2x slower.

Intel Haswell APU

~100clk / ~135ns

As with Ivy, we don't detect any caches here; assuming the only cache is the large L1D one, the constant cache is too small to observe any caching effects. As with Global, latency seems 10clk higher.

Disastrous result for AMD's Radeon 6850 and Llano APU - just don't use constant memory. Great results on Fermi at small block sizes but overall Intel' Ivy Bridge has great performance throughout the range.

Shared Memory Latency

"Shared Memory" is thread-group memory used to transfer data between threads running on the same group. As such it is pretty limited in size (e.g. 16-32kB) and not cached.

Pretty slow against the competition (5x slower than Fermi!) but 1/2 latency of global/const L1D cache. It may be worth copying constant memory data into shared memory if possible!

AMD Llano APU

~163clk / ~368ns

Similar to Radeon 6850 result, slow, but not as slow as global/const L1D cache.

Intel Ivy Bridge APU

~76clk / ~78ns

Somewhat high latency, but still lower than global/const L1D, thus normal optimisations still apply.

Intel Haswell APU

~84clk / ~108ns

Again, 10clk higher latency than Ivy but still lower than global/const as well as competitor APUs. Still, somewhat disappointing that the latency has not improved.

Bad result for AMD's Radeon 6850 and Llano APU again - though an opportunity for optimisation arises: copying constant data to shared memory reduces latency by half! That's exactly how we improved the GP Cryptography benchmarks (AES encrypt/decrypt kernels) with great success! This optimisation also benefits Ivy Bridge but can be worse on Fermi where global/const L1D cache is 50% faster.

Private Memory Latency

"Private Memory" is thread local memory, used for thread data manipulation. Each thread has a limited number of registers available for this purpose (total threads per CU / number of active threads per CU) - any "overspill" causes global memory to be used. As global memory latency is huge compared to registers (1clk), overspill has to be avoided at all costs.

Up to 8kB latency is similar to global memory; over that size latencies are even higher.

GeForce 260 GTX

8kB ~450clk

~480clk / ~370ns

Similarly to G80, up to 8kB latency is similar to global memory and higher over that size. Overspills are costly.

GeForce GT 555M

1/2kB? 280clk

32kB ~323clk

~558clk / ~472ns

It is not conclusive whether there is a L1D of 1/2kB but the L2D is clearly visible, that while "slow", it does help - the competition has no caches at all! Worst-case latency is high though, comparable to global memory latency.

Radeon HD 6850

~532clk / 687ns

No caching is visible and overspills are costly: worst-case latency is high (~687ns) - though comparable with the competition.

AMD Llano APU

~514clk / 1159ns

Similar to Radeon 6850 result, no caching visible and costly overspills: while slightly lower in terms of clocks, the real-time latency (~1160ns) is very high.

Intel Ivy Bridge APU

128kB ~119clk

~119clk / 113ns

While it may appear there is no caching here either, the L1D cache we saw in global/const has similar latencies to what we see here - we are nowhere near global memory worst-case latencies. Real-time worst-case latency (113ns) is 1/10 that of Llano and 1/5 Fermi!

Intel Haswell APU

TBA

Bad result for AMD's Radeon 6850 and Llano APU yet again, with costly overspill penalties - Ivy Bridge rules both APUs and GPUs! Fermi's honour is saved by the caches.

Texture Cache/Memory Latency

This article does not investigate texture cache/memory latencies.

GeForce 8800 GTS

The 8800 (GT80) was the World's first "mass-market" GPGPU, supporting CUDA 1.0 - and DirectX 10 - and thus just as revolutionary as the original GPU (Riva TNT). Its unified shaders could, for the 1st time, perform a more varied set of tasks - like GPGPU - even today it can run CUDA and Open CL applications. Its 8 SP per SM design remained unchanged until CUDA 2.0.

Its GDDR3 memory and wide bus are holding their own against modern DDR3/GDDR3 competition, and while global memory is not cached, most CUDA/Open CL applications should have taken this into account long ago. Constant memory is cached with decently fast L1D and L2 and shared memory is fast also - similar to modern GPGPUs.

GeForce 260 GTX

The 260 GTX like its big brother 280 GTX were based on the 2nd generation GPGPU architecture (GT200), supporting CUDA 1.3 and, for the first time, double/FP64 support in hardware. High-precision scientific applications (that required 64-bit precision) could finally be ported to GPGPU.

While the GDDR3 memory is wider (448 vs 320-bit) and faster, latencies are pretty much similar to the previous G80. Global memory is still uncached, and constant and shared memory latencies are comparable.

"Fermi" (CUDA 2.x) has one major improvement over previous G80/GT200 (CUDA 1.x) architectures: global memory is now cached, with a 3-level cache visible - same as constant memory. Previously, only TLB caches existed for global memory (L1 TLB, L2 TLB), but constant memory was cached. Fermi is now more "forgiving" in terms of memory accesses, though optimisation is still required.

Its 3-level cache architecture (L1D, L2D, L3D) it is the most complex design here, but not unexpected - it is similar to the architecture of modern CPUs.

Very fast but small L1D (4kB ~20clk) and L2D (32kB ~100clk) keep the latencies down for global and constant memories, but memory latencies increase with block size until the worst-case value (~680clk) is over 30x higher! Only design where L1D is faster than shared memory.

Shared memory is fast with constant latency throughout the range. Private memory used for overspills is slow but, due to caching, faster than the competition.

AMD A6-3650 APU (Llano) / Radeon HD 6530D

"Llano" was the 1st mainstream (both desktop and mobile) APU and as such it has enjoyed mass-market appeal. While recently replaced with "Trinity", it is still used in a vast number of systems.

Its 32nm DirectX 11 GPU (BeaverCreek/Sumo) is based on the Radeon 5500 series (Redwood) and is thus a VLIW5 design with 80 SP per CU and 4-5CUs. Here we test the 4C 320SP version.

It is quite clear that there is little point in using constant memory: stick to global (-20clk). If possible, copy const data into shared memory that is 3 times faster (~163clk vs. ~351clk). Private memory used for overspill is very slow (~490clk), thus they have to be avoided like the plague.

AMD Radeon HD 6850 (Barts)

"Barts" (6800 series) is the successor of "Cypress" (5800 series), the 1st of the "Northern Islands" family but still a VLIW5 design on 40nm. While it boasts various improvements, it also lacks some features (e.g. double/FP64 support) and with far less SIMD units actually performs lower - it is all about lower costs.

You need to look at the 6900 series for a worthy successor to the 5800 series or even the 7800 series.

Somewhat surprisingly, the same comments as "Llano" apply here, even though it is a dedicated GPU and not an APU. Code optimised for one will run equally well on the other.

Intel i7-3xxxM APU (Ivy Bridge) / GT2

"Ivy Bridge" is the first Intel APU (GT1/GT2 EU v7) as it includes GPGPU capabilities - the previous "Sandy Bridge" model did contain a built-in GPU (GT1/GT2 EU v6) but it did not have GPGPU capabilities - they were emulated in software on the CPU.

World's 1st 22nm device, it has few but complex EUs (CU) with an undisclosed number of SPs (6-8) per EU. It also contains a 2MB cache for code and data.

Intel i7-4xxxM APU (Haswell) / HD 4000

"Haswell" is a 2nd generation Intel APU (GT1/GT2 and GT3 EU v7.5) on the same 22ns process. Our GT3 sample has more than double EVs (40!) than GT2 (16) as well as 128MB eDRAM/L4 cache; while core speed is only slightly lower (600MHz vs. 650MHz) Turbo speed is much lower (800MHz vs. 1.1GHz). All things considered GT3 has higher performance for less power but it is not cheap!

Final Thoughts / Conclusions

While nVidia's GPUs' latencies have been investigated by other parties in more detail before, here we compare GPUs and APUs from multiple vendors using the common OpenCL interface (and DirectX ComputeShader). Fermi shows one major improvement - cached global memory - but while its 3-level cache architecture is complex - it works as expected, similar to modern CPUs.

The AMD GPU and APU do throw a few surprises that, whatever their nature (hardware, compiler, etc.) mean that some kernels may need to be optimised differently for best performance. We (SiSoftware) ourselves have used the latency results to optimise the GP Cryptography benchmarks (AES encrypt/decrypt kernels) with great success (details in future article).

Intel's first APU is also interesting to test: it behaves similarly to AMD's own APU and GPU (albeit with far lower latencies) - rather than nVidia's GPU. While its cache architecture is very simple (large L1) it works well.

We have shown that there is no "one latency", but latencies greatly vary with memory type and access pattern. The way kernels access memory (access pattern) and the type of memory used have direct influence on the latencies they will experience.