The framework for data prefetch in GCC supports capabilities
of a variety of targets. Optimizations within GCC that involve prefetching
data pass relevant information to the target-specific prefetch
support, which can either take advantage of it or ignore it. The
information here about data prefetch support in GCC targets was
originally gathered as input for determining the operands to GCC's
prefetch RTL pattern, but might continue to be useful
to those adding new prefetch optimizations.

Existing data prefetch support in GCC includes:

A generic prefetch RTL pattern.

Target-specific support for several targets.

A __builtin_prefetch function that does nothing on targets
that do not support prefetch or for
which prefetch support has not yet been added to GCC.

An optimization enabled by -fprefetch-loop-arrays that
prefetches arrays used in loops.

Possibilities for future work include:

Greedy prefetch [22] of data referenced by pointer
variables, controlled by an option like -fprefetch-pointers.
Jan Hubicka has said he is interested in doing this.

Prefetch support for additional targets.

Running benchmarks and analyzing results on various targets to validate
prefetch optimization heuristics.

Data prefetch, or cache management, instructions allow a compiler
or an assembly language programmer to minimize cache-miss latency
by moving data into a cache before it it accessed.
Data prefetch instructions are generally treated as hints;
they affect the performance but not the functionality of software in
which they are used.

Data prefetch instructions often include information about the
locality of expected accesses to prefetched memory. Such
hints can be used by the implementation to move the data into the
cache level where it will be the most good, or the least harm.
Prefetched data in the same cache line as other data likely to be
accessed soon, such as neighboring array elements, has
spatial locality.
Data with temporal locality, or persistence, is expected
to be accessed multiple times and so should be left in a cache when it is
prefetched so it will continue to be readily accessible.
Accesses to data with no temporal locality are transient; the data
is unlikely to be accessed multiple times and, if possible, should not be
left in a cache where it would displace other data that might be needed soon.

Some data prefetch instructions allow specifying in which level of
the cache the data should be left.

Locality hints determined in GCC optimization passes can be ignored in
the machine description for targets that do not support them.

Some data prefetch instructions make a distinction between memory
which is expected to be read and memory which is expected to be written.
When data is to be written, a prefetch instruction can move a block
into the cache so that the expected store will be to the cache.
Prefetch for write generally brings the data into the cache in an
exclusive or modified state.

A prefetch for data to be written can usually be replaced with a
prefetch for data to be read; this is what happens on implementations
that define both kinds of instructions but do not support prefetch for
writes.

At least one target's data prefetch instructions has a
base update form, which modifies the prefetch address after
the prefetch. Base update, or pre/post increment, is also supported
on load and store instructions for some targets, and this could be
taken into consideration in code that uses data prefetch.

Some architectures provide prefetch instructions that cause
faults when the address to prefetch is invalid or not cacheable.
The data prefetch support in GCC assumes that only non-faulting
prefetch instructions will be used.

Prefetch timing is important. The data should be in the cache
by the time it is accessed, but without a delay that would allow
other data to displace it before it is used.

Using prefetches that are too speculative can have negative effects,
because there are costs associated with data prefetch instructions.
These include wasting bandwidth, kicking other data out of the cache and
causing additional conflict misses, consuming slots for memory
instructions [26], and increasing code size, which
can bump useful instructions out of the instruction cache.

Similarly, prefetching data that is already in the cache increases
overhead without providing any benefit
[25]. Data might already be in the
cache if it is in the same cache line as data already prefetched
(spatial locality), or if the data has been used recently (temporal
locality).

On some (but not all) targets it makes sense to combine prefetching
arrays in loops with loop unrolling
[23][26].

Variants of prefetch commands that fault are not included here.
Some implementations of these architectures recognize data prefetch
instructions but treat them as nop instructions.
They are generally ignored for pages that are not cacheable.
The exception to this is prefetch instructions with base update forms,
for which the base address is updated even if the addressed memory
cannot be prefetched.

The descriptions that follow are meant to describe the basic
functionality of data prefetch instructions. For complete information
about data prefetch support on a particular processor, refer to the
technical documentation for that processor; the references provide a
starting point for that information.

The Alpha architecture supports data prefetch via load instructions
with a destination of register R31 or F31, which
prefetch the cache line containing the addressed data
[2][3].
Instruction LDS with a destination of register F31
prefetches for a store.

LDBU, LDF, LDG, LDL,
LDT, LDWU

Normal cache line prefetches.

LDS

Prefetch with modify intent; sets the dirty and modified bits.

LDQ

Prefetch, evict next; no temporal locality.

Addresses used for prefetch should be aligned to prevent alignment
traps.

Data prefetch instructions are ignored on pre-21264 implementations
of Alpha.

Data prefetch support in the AltiVec instruction set architecture
is quite different from that of other architectures that GCC supports.
Rather than prefetching a single block of data, it prefetches a
data stream made up of the following elements
[4].:

EA

the effective address of the first unit in the sequence;
there are no alignment restrictions

unit size

the number of quad words (16 bytes?)
in each unit; between 0 and 31

count

the number of units in the sequence; between 0 and 255

stride

the number of bytes between the effective address of one unit
and the effective address of the next unit in the sequence; this can be
negative, but should not be smaller than 16 bytes

(Data Stream Stop); stop a data stream if no more data from it is
needed

dssall

(Data Stream Stop All); stop all data streams

A prefetch instruction specifies one of four data streams, each of
which can prefetch up to 128K bytes, 12K bytes in a contiguous block.
Reuse of a data stream aborts prefetch of the current data stream and
begins a new one. The data stream stop instructions can be used when
data from a stream is no longer needed, for example for an early exit
of a loop processing array elements.

Additional AltiVec instructions for cache control are
lvxl (Load Vector Indexed LRU) and stvxl
(Store Vector Indexed LRU), which indicate that an access
is likely to be the final one to a cache block and that the address
should be treated as least recently used, to allow other data to
replace it in the cache.

The differences between AltiVec's cache control instructions and
The PowerPC instructions dcbt and dcbtst are
discussed in section 5.2.1.7 of [4].

GCC data prefetch support for AltiVec could use the
PowerPC prefetch support, which fits into the
prefetch framework.
Using a constant unit size and always using a count of 1 would make a data
stream touch behave like data prefetch instructions on other targets,
allowing it to fit in GCC's data prefetch framework, but this would require
specifying a data stream for each prefetch and keeping track of which ones
are in use.

The IA-32 Streaming SIMD Extension (SSE) instructions are used on several
platforms, including the Pentium III and Pentium 4 [6]
and IA-32 support on IA-64 [8].
The SSE prefetch instructions are included in the AMD extensions to 3DNow!
and MMX used for x86-64 [5].

The lfetch (Line Prefetch) instruction has versions for
read and write prefetches, and an optional modifier to specify the
locality of the memory access and the cache level to which the data
would best be allocated [8].

The possible values for the locality hint are:

none

Temporal locality for cache level 1 and higher (all levels).

nt1

No temporal locality for level 1, temporal for level 2 and higher.

nt2

No temporal locality for level 2, temporal for levels above 2.

nta

No temporal locality, all levels

There are two base update forms of lfetch, which increment
the register containing the address and then implicitly prefetch the new
address, as well as the original address. The increment value is either
in a second general register or is an immediate value.

Line size is implementation dependent; it is a power of 2, at
least 32.

Load and store instructions can also be used to prefetch data.
The base update forms of these instructions imply a prefetch, and
have a completer that specifies the locality of the memory access.

The PREF (Prefetch) instruction, supported by MIPS32
[9] and MIPS64 [10],
takes a hint with one of the following values:

load

data is expected to be read, not modified

store

data is expected to be stored or modified

load_streamed

data is expected to be read but not reused

store_streamed

data is expected to be stored but not reused

load_retained

data is expected to be read and reused extensively

store_retained

data is expected to be stored and reused extensively

writeback_invalidate

data is no longer expected to be used

PrepareForStore

prepare the cache for writing an entire line

The "streamed" versions place the prefetched data into the cache in
such a way that it will not displace data prefetched as "retained".
The "retained" versions place the data in the cache so that it will not
be displaced by data prefetched as "streamed."

The prefetch moves a block of data into the cache. The size is
implementation specific.

There are no alignment restrictions.

The PREFX (Prefetch Indexed) instruction, supported by MIPS64,
differs in the addressing mode and is for use with floating point data.

Prefetch and cache control are also supported for accesses of semaphores.

Some load and store instructions modify the base register, providing
either pre-increment or post-increment, and some provide a cache control
hint; a load instruction can specify spatial locality, and
a store instruction can specify block copy or spatial locality.
The spatial locality hint implies that there is poor temporal locality
and that the prefetch should not displace existing data in the cache.
The block copy hint indicates that the program is likely to store a
full cache line of data.

There are no alignment requirements on the address of prefetched data;
the low order part of the address is ignored.

The SPARC version 9 instruction set architecture defines
the PREFETCH (Prefetch Data) and
PREFETCHA (Prefetch Data from Alternate Space)
[15] instructions, whose variants are specified
by the fcn field:

0

prefetch for several reads

Move the data into the cache nearest the processor (high degree of
temporal locality).

1

prefetch for one read

Prefetch with minimal disturbance to the cache (low degree of
temporal locality).

2

prefetch for several writes (and possibly reads)

Gain exclusive ownership of the cache line (high degree of
temporal locality).

3

prefetch for one write

Prefetch with minimal disturbance to the cache (low degree of
temporal locality).

4

prefetch page

Shorten the latency of a page fault.

UltraSPARC-I treats these instructions as nops [18].
UltraSPARC-II and UltraSPARC-IIi support them by mapping the variants
listed above onto two variants for read and write prefetch with no or low
temporal locality [19][20].

There are no alignment restrictions on the address to prefetch; the
instructions ignore the 5 least significant bits.

The Intel XScale processor includes ARM's DSP-enhanced instructions,
including the PLD (Preload) instruction.
This instruction prefetches the 32-byte cache line that includes
the specified data address.

NOTE: More investigation is necessary; [23]
has an example that implies that base update might be available.

These references need cleanup and should actually be used in the text
above that uses the information. Many of the links will likely be out
of date soon, but they'll stay here until the initial rush of prefetch
work is done.

[21a]Compiler Writer's Guide for the Alpha 21264,
Order Number EC-RJ66A-TE, June 1999.

[21b]Compiler Writer's Guide for the 21264/21364,
Order Number EC-0100A-TE, January 2002.

[22]Compiler-Based Prefetching for Recursive Data Structures,
Chi-Keung Luk and Todd C. Mowry, linked from
http://www.cs.cmu.edu/~tcm/Papers.html.
That location also has links to several other papers about data prefetch
by Todd C. Mowry.

For questions related to the use of GCC,
please consult these web pages and the
GCC manuals. If that fails,
the gcc-help@gcc.gnu.org
mailing list might help.
Comments on these web pages and the development of GCC are welcome on our
developer list at gcc@gcc.gnu.org.
All of our lists
have public archives.

Copyright (C)
Free Software Foundation, Inc.
Verbatim copying and distribution of this entire article is
permitted in any medium, provided this notice is preserved.