Spatial locality and block size

So far we assumed that the block size -- the amount of data in each
cache entry -- is one word.
By storing multiple (2, 4, ...) consecutive words in a cache entry and
fetching all the words on a cache miss, we can improve performance due
to spatial locality (see Gottlieb's
diagram
of
4-word
blocks)
There is a limit to the benefit of increasing block size, however:

it reduces the number of cache entries.

it increases the miss penalty:
the
time
required
to
fetch a cache entry on a cache miss. To
reduce the miss penalty, modern main memories are designed to fetch
multiple words on successive clock cycles.

Strategies for memory writes

Two basic strategies:

write-through:
writes always update both cache and memory. So that processor
does not have to wait for memory write to finish, we include a write buffer (which holds
information on store instructions which have not yet been written to
main memory)

write-back: writes
only
update
the
block
in the cache; when the block is replaced in the
cache, the modified words are written back to main memory. This is more
complex but reduces the main memory traffic, since a program may modify
a memory word several times while it is in the cache.

Effect on performance: effective memory access time

Goal is to have effective memory access time be close to
the access
time of the fastest memory

Effect on performance: CPI (p. 475-477)

Calculate cache performance in terms of its effect on the CPI:
assume each miss (for instruction, data load, or data store) leads to a
miss penalty, measured in clock cycles
(resulting from the CPU stalling while it waits for data from main
memory)

Unified vs. split instruction / data cache

Having separate caches for instructions and data does not improve hit
rate but does support increased bandwidth -- one can fetch an
instruction and data word at the same time. Most current
processors have separate L1 I and D caches.

Two-level cache

As gap in speed between CPU and memory speed grows larger, penalty for
cache miss becomes unacceptably high. To address this problem,
all modern high end CPUs all have at least two levels
of caches: A very fast, and hence not very big, first level (L1) cache
together with a larger but slower L2 cache. Some recent
microprocessors (e.g., Core i7) have 3 levels.

When a miss occurs in L1, L2 is examined, and only if a miss occurs
there is main memory referenced. (Performance analysis, p. 485).