In class notes

SIMD
• Single instruction on multiple data
– This form of parallel processing has existed since the
1960s
– The idea is rather than executing array operations by loop,
we execute all of the array operations in parallel on
different processing elements (ALUs)
• we convert for(i=0;i<n;i++) a[i]++; into a single operation, say
A=A+1
– Not only do we get a speedup from the parallelism, we
also get to remove the looping operation (incrementing i,
the comparison and conditional branch)
• There have been 3 approaches to SIMD
– Vector architectures (including matrix architectures)
– Multimedia SIMD instruction set extensions
– Graphics processor units
• here we concentrate on the first and second
Two Views
• If we have n processing
elements
– we view the CPU as having a
control unit and n ALUs
(processing elements, or PEs in
the figure)
– each PE handles 1 datum from
the array where data is cached
in a PE’s local cache
• Otherwise, we use pipelined
functional units
– rather than executing the
instruction on n data
simultaneously, in each cycle
we start the next array
operation in the functional unit
pipeline
The Pipelined Approach
• Although the simultaneous execution provides the
more efficient execution, the pipelined approach is
preferred in modern architectures for several reasons
– It is a lot cheaper than having n PEs
– We already have pipelined functional units so the vector
processing does not require a significant change to our
ALU
– The simultaneous execution is limited to n parallel
operations per cycle because of the limitation in PEs and
so we still may need to execute the looping mechanism
• e.g., a loop of 100 array elements on an architecture of 8 PEs still
needs the loop to iterate 13 times
– There is no need to support numerous individual caches
with parallel access
• although we will use multi-banked caches
– Requires less power utilization which is significant today
VMIPS
• We alter MIPS to now support vector operations
– The idea is that we will combine array elements into
storage so that we can fetch several array elements in one
cycle from memory (cache) and store them in large (wide)
registers
• we will use vector registers where one register stores multiple
array elements, a portion of the entire array
– This requires widening the bus and also costs us in terms of
greater memory access times because we are retrieving
numerous words at a time
• in VMIPS, a register can store 64 elements of 64-bit items and
there are 8 such registers
– additionally, there are scalar registers (32 integer and 32 FP)
• the registers all connect via ports to all of the functional units as
well as the load/store unit, there are numerous ports to support
parallel data movement (see slide 2 or figure 4.2 page 265)
VMIPS Instruction Set
• Aside from the ordinary MIPS instructions (scalar
operations), we enhance MIPS with the following:
– LV, SV – load vector, store vector
• LV V1, R1 – load vector register V1 with the data starting at the memory
location stored in R1
• also LVI/SVI for using indexed addressing mode, and LVWS and SVWS
for using scaled addressing mode
– ADDVV.D V1, V2, V3 (V1  V2 + V3)
– ADDVS.D V1, V2, F0 (scalar addition)
• similarly for SUB, MUL and DIV
– S--VV.D V1, V2 and S--VS.D V1, F0 to compare pairwise
elements in V1 and V2 or V1 and F0
• -- is one of EQ, NE, GT, LT, GE, LE
• result of comparison is a set of boolean values placed into the bit vector
register VM which we can then use to implement if statements
– POP R1, VM – count number of 1s in the VM and store in R1
• this is only a partial list of instructions, and only the FP operations, see
figure 4.3 for more detail, missing are any integer based operations
Example
• Let’s look at a typical vector processing problem,
computing Y = a*X + Y
– Where X & Y are vectors and a is a scalar (e.g.,
y[i]=y[i]+a*x[i])
• The MIPS code is on the left and the VMIPS code is on the
right
L.D
F0, a
DADDIR4, Rx, #512
Loop: L.D
F2, 0(Rx)
MUL.DF2, F2, F0
L.D
F4, 0(Ry)
ADD.D F4, F4, F2
S.D
F4, 0(Ry)
DADDIRx, Rx, #8
DADDIRy, Ry, #8
DSUB R20, R4, Rx
BNEZ R20, Loop
L.D
LV
MULVS.D
LV
ADDVV.D
SV
F0, a
V1, Rx
V2, V1, F0
V3, Ry
V4, V2, V3
V4, Ry
In MIPS, we execute almost 600 instructions
whereas in VMIPS, only 6 (there are 64
elements in the array to process, each is 8
bytes long) and there are no RAW hazards or
control hazards to deal with
Vector Execution Time
• Although we typically compute execution time in
seconds (ns) or clock cycles, for vector operations,
architects are more interested in the number of
distinct issues required to execute some chunk of code
– This requires some explanation
– The vector processor’s performance is impacted by the
length of the vector (the number of array values stored in a
single vector), any structural hazards (caused by limitations
to the number and type of functional units) and data
dependencies between vectors
• we will ignore the last one, at least for now
– The vector processor’s performance then is primarily based
on the length of the vector
• for instance, in VMIPS, our vector length is 64 doubles, but if our
vector stores 128 doubles, then we have to do our vector operation
twice
Convoys and Chimes
• A convoy is a set of sequential vector operations that
can be issued together without a structural hazard
– Because we are operating on vectors in a pipeline, the
execution of these operations can be overlapped
• e.g., L.V V1, Rx followed by ADDVV.D V3, V1, V2 would allow
us to retrieve the first element of V1 and then start the addition
while retrieving the second element of V1
• A chime is the amount of time it takes to execute a
convoy
– We will assume that there are no stalls in executing the
convoy, so the chime will take n + x – 1 cycles where x is
the length of the convoy and n is the number of data in the
vector
– A program of m convoys will take m chimes, or m * (n + x
– 1) cycles (again, assuming no stalls)
– The chime time ignores pipeline overhead, and so
architects prefer to discuss performance in chimes
Convoy Example
• Assume we have 1 functional unit for each operation
(load/store, add, multiply, divide)
– We have the following VMIPS code executing on a vector of
64 doubles
•
•
•
•
•
LV
MULVS.D
LV
ADDVV.D
SV
V1, Rx
V2, V1, F0
V3, Ry
V4, V2, V3
V4, Ry
• The first LV and MULVS.D can be paired in a convoy, but
not the next LV because there is only 1 load unit
– Similarly, the second LV and ADDVV.D are paired but not the
final SV
– This gives us 3 convoys:
• LV
• LV
• SV
MULVS.D
ADDVV.D
Multiple Lanes
• The original idea behind SIMD was to have n PEs so that n
vector elements could be executed at the same time
– We can combine the pipeline and the n PEs, in which case, the
parallel functional units are referred to as lanes
• Without lanes, we launch 1 FP operation per cycle in our
pipelined functional unit
– With lanes, we launch n FP operations per cycle, one per lane,
where elements are placed in a lane based on their index
• for instance, if we have 4 lanes, lane 0 gets all elements with index i % 4
== 0 whereas lane 1 gets all elements with i % 4 == 1
• To support lanes, we need lengthy vectors
– If our vector is 64 doubles and we have 16 lanes on a 7 cycle
multiply functional unit, then we are issuing 16 instructions per
cycle over 4 cycles and before we finish the first multiplies, we
are out of data, so we don’t get full advantage of the pipelined
nature of the functional units!
– We also need multi-banked caches to permit multiple
loads/stores per cycle to keep up with the lanes
Handling Vectors > 64 Elements
• The obvious question with respect to SIMD is what
happens if our vector’s length > the size of our vector
register (which we will call maximum vector length
or MVL)
– If this is the case, then we have to issue the vector code
multiple times, in a loop
• by resorting to using a loop, we lose some of the advantage – no
branch penalties or loop mechanisms
– On the other hand, we cannot provide an infinitely long (or
ridiculously long) vector register in hopes to satisfy all
array usage
– Strip mining is the process of generating code to handle
such a loop in which the number of loop iterations is n /
MVL where n is the size of the program’s vector
• Note: if n / MVL leaves a remainder then our last iteration will
take place on only a partial vector
• see the discussion on pages 274-275 for more detail
Handling If Statements
• As with loop unrolling, if our vector code
employs if statements, we can find this a
challenge to deal with
– Consider for instance
• for(i=0;i<n;i++)
•
if(x[i] != 0)
•
x[i]=x[i] – y[i];
LV
LV
L.D
SNEVS.D
SUBVV.D
SV
– We cannot launch the subtraction down the FP
adder functional unit until we know the result of
the condition
• In order to handle such a problem, vector
processors use a vector mask register
– The condition is applied in a pipelined unit
creating a list of 1s and 0s, one per vector element
– This is stored in the VM (vector mask) register
– Vector mask operations are available so that the
functional unit only executes on vector elements
where the corresponding mask bit is 1
V1, Rx
V2, Ry
F0, #0
V1, F0
V1, V2, V2
V1, Rx
Notice SUBVV.D
is a normal
subtract
instruction
-- we need to
modify it to
execute
using the
vector mask
Memory Bank Support
• We will use non-blocking caches with critical
word first/early restart
– However, that does not necessarily guarantee 1 vector
element per cycle to keep up with the pipelined
functional unit because we may not have enough banks
to accommodate the MVL
• the Cray T90 has 32 processors, each capable of generating
up to 4 loads and 2 stores per clock cycle
• the processor’s clock cycle is 2.167 ns and cache has a
response time of 15 ns
• to support the full performance of the processor, we need 15 /
2.167 * 32 * (4 + 2) = 1344 individual accesses per cycle,
thus 1344 banks! It actually has 1024 banks (altering the
SRAM to permit pipelined accesses makes up for this)
Strides
• Notice that the vector registers store consecutive memory
locations (e.g., a[i], a[i+1], a[i+2], …)
– In some cases, code does not visit array locations in sequential
order, this is especially problematic in 2-D array code
• a[i][j]=a[i][j] + b[i][k] * d[k][j]
• A stride is the distance separating elements in a given
operation
– The optimal stride is 1 but for the above code, we would either
have difficulty when accessing b[i][k] or d[k][j] depending on
loop ordering resulting in a stride of as large as 100
– The larger the stride, the less effective the vector operations may
be because multiple vector register loads will be needed cycleafter-cycle
• blocking (refer back to one of the compiler optimizations for cache) can
be used to reduce the impact
– To support reducing such an impact, we use cache banks and
also a vector load that loads vector elements based on strides
rather than consecutive elements
SIMD Extensions for Multimedia
• When processors began to include graphics
instructions, architects realized that operations not
necessarily need to be 32-bit instructions
– Graphics for instance often operates on several 8-bit
operations (one each for red, green, blue, transparency)
• so while a datum might be 32 bits in length, it really codified 4
pieces of data, each of which could be operated on simultaneously
within the adder
– Additionally, sounds are typically stored as segments of 8
or 16 bit data
• Thus, vector SIMD operations were incorporated into
early MMX style architectures
– This did not require additional hardware, just new
instructions to take advantage of the hardware already
available
Instructions
•
•
•
•
Unsigned add/subt
Maximum/minimum
Average
Shift right/left
– These all allow for 32 8-bit, 16 16-bit, 8 32-bit or 4 64-bit
operations
• Floating point
– 16 16-bit, 8 32-bit, 4 64-bit or 2 128-bit
• Usually no conditional execution instructions because there
would not necessarily be a vector mask register
• No sophisticated addressing modes to permit strides (or deal
with sparse matrices, a topic we skipped)
• The MMX extension to x86 architectures introduced
hundreds of new instructions
• The streaming SIMD extensions (SSE) to x86 in 1999
added 128-bit wide registers and the advanced vector
extensions (AVD) in 2010 added 256-bit registers
Example
Loop:
L.D
MOV
MOV
MOV
DADDI
L.4D
MUL.4D
L.4D
ADD.4D
S.4D
DADDI
DADDI
DSUB
BNEZ
F0, a
F1, F0
F2, F0
F3, F0
R4, Rx, #512
F4, 0(Rx)
F4, F4, F0
F8, 0(Ry)
F8, F8, F4
F8, 0(Rx)
Rx, Rx, #32
Ry, Ry, #32
R20, R4, Rx
R20, Loop
The 4D extension used with register F0
Means that we are actually using F0, F1, F2, F3
combined
L.4D/S.4D moves 4 array elements at a time
• In this example, we
use a 256-bit SIMD
MIPS
– The 4D suffix
implies 4 doubles per
instruction
– The 4 doubles are
operated on in
parallel
• either because the FP
unit is wide enough to
accommodate 256
bits or because there
are 4 parallel FP units
TLP: Multiprocessor Architectures
• In chapter 3, we looked at ways to directly
support threads in a processor, here we expand
our view to multiple processors
– We will differentiate among them as follows
•
•
•
•
Multiple cores
Multiple processors each with one core
Multiple processors each with multiple cores
And whether the processors/cores share memory
– When processors share memory, they are known as
tightly coupled and they can promote two types of
parallelism
• Parallel processing of multiple threads (or processes) which
are collaborating on a single task
• Request-level parallelism which has relatively independent
processes running on separate processors (sometimes called
multiprogramming)
Shared Memory Architecture
• We commonly refer to
this type of architecture
as symmetric
multiprocessors (SMP)
– Tightly coupled, or
shared memory
• also known as a uniform
memory access
multiprocessor
– Probably only a few
processors in this
architecture (no more
than 8 or shared
Although in the past, multiprocessor computers
memory becomes a
could fall into this category, today we typically view
bottleneck)
this category as a multicore processor, true
multiprocessor computers will use distributed
memory instead of shared memory
Challenges
• How much parallelism exists within a single program
to take advantage of the multiple processors?
– Within this challenge, we want to minimize the
communication that will arise between processors (or
cores) because the latency is so much higher than the
latency of a typical memory access
• we wish to achieve an 80 times speedup from 100 processors ,
using Amdahl’s Law, compute the amount of time the processors
must be working on their own (not communicating together).
Solution: 99.75% of the time (solution on page 349)
– What is the impact of the latency of communication?
• we have 32 processors and a 200 ns time for communication
latency which stalls the processor, if the processor’s clock rate is
3.3 GHz and the ideal CPI is .5, how much faster is a machine
with no interprocess communication versus one that spends .2% of
the time communicating? 3.4 times faster (solution on page 350)
Cache Coherence
• The most challenging aspect of a shared memory
architecture is ensuring data coherence across
processors
– What happens if two processors both read the same
datum? If one changes the datum, the other has a stale
value, how do we alert it to update the value?
• As an example, consider the following time line
of events
Time
Event
0
A’s Cache
storing X
B’s Cache
storing X
Memory
item X
----
----
1
1
A reads X
1
----
1
2
B reads X
1
1
1
3
A stores 0 into X
0
1
0
Cache Coherence Problem
• We need our memory system to be both coherent and
consistent
– A memory system is coherent if
• a read by processor P to X followed by a write of P to X with no
writes of X by any other processor always returns the value written
by P
• a read by a processor to X following a write by another processor
to X returns the written value if the read and write are separated by
a sufficient amount of time
• writes to the same location are serialized so that the writes are seen
by all processors in the same order
– Consistency determines when a written value will be
returned by a later read
• we will assume that a write is only complete once that write
becomes available to all processors (that is, a write to a local cache
does not mean a write has completed, the write must also be made
to shared memory)
• if two writes take place, to X and Y, then all processors must see
the two writes in the same order (X first and then Y for instance)
Snooping Coherence Protocol
• In an SMP, all of the processors have caches which are
connected to a common bus
– The snoopy cache listens to the bus for write updates
• Data falls into one of these categories
– Shared – datum that can be read by anyone and is valid
– Modified – datum has been modified by this processor and must
be updated on all other processors
– Invalid – data has been modified by another processor but not
yet updated by this cache
• The snooping protocol has two alternatives
– Write-invalidate – upon a write, the other caches must mark
their own copies as invalid and retrieve the updated datum
before using it
• it two processors attempt to write at the same time, only one wins, the
other must invalidate its write, obtain the new datum and then reperform
its operation(s) on the new datum
– Write-update – upon a write, update all other caches at the same
time by broadcasting the new datum
Processor
Bus
A’s cache
B’s cache
Memory
----
----
0
0
A reads X
Cache miss
0
----
B reads X
Cache miss
0
0 (from A,
0
not memory)
A writes X
Invalidate X
1
----
1 (or ----)
B reads X
Cache miss
1
1
1
Extensions to Protocol
• MESI – adds a state called Exclusive
– If a datum is exclusive to the cache, it can be written
without generating an invalidate message to the bus
– If a read miss occurs to a datum that is exclusive to a cache,
then the cache must intercept the miss, send the datum to
the requesting cache and modify the state to S (shared)
• MOESI – adds a state called Owned
– In this case, the cache owns the datum AND the datum is
out of date in memory (hasn’t been written back yet)
– This cache MUST respond to any requests for the datum
since memory is out of date
– But the advantage is that if a modified block is known to be
exclusive, it can be changed to Owned to avoid writing
back to memory at this time
A Variation of the SMP
• As before, each
processor has its own
L1 and L2 caches
– snooping must occur
at the interconnection
network in order to
modify the L1/L2
caches
• A shared L3 cache is
banked to improve
performance
• The shared memory
level is the backup to
L3 as usual and is
also banked
Performance for Shared Memory
• Here, we concentrate just on memory accesses of a multicore
processor with a snoopy protocol (not the performance of the
processors themselves)
– Overall cache performance is a combination of
• miss rate as derived from compulsory, conflict and capacity misses (these
misses are sometimes called true sharing misses)
• traffic from communication including invalidations and cache misses after
invalidations, these are sometimes referred to as coherence misses (these
misses are sometimes called false sharing misses)
• Example
– Assume that x1 and x2 are in the same cache block and are shared
by P1 and P2, indicate the true and false misses and hits from
below:
•
•
•
•
•
•
P1
P2
write x1 – true (P1 must send out invalidate signal)
read x2 – false (block was invalidated)
write x1 – false (block marked as shared because of P2’s read of x2)
write x2 – false (block marked as shared with P1)
read x2 – true (need new value from P2)
Commercial Workloads
• To demonstrate the performance of the snoopy cache
protocol on a SMP, we look at a study done on the DEC
ALPHA 21164 from 1998
– 4 processors (from 1998) with each processor issuing up to 4
instr/clock cycle, 3 levels of cache
• L1: 8 KB/8 KB instr/data cache, direct-mapped, 32 byte blocks, 7 cycle
miss penalty
• L2: 96 KB, 3 way set assoc, 32 byte block, 21 cycles
• L3: 2 MB, direct mapped, 64 byte block, 80 cycle miss
– As a point of comparison, the Intel i7 has these three caches
• L1: 32 KB/32 KB instr/data cache, 4 way/8 way, 64 byte blocks, 10
cycle miss penalty
• L2: 256 KB, 8 way set assoc, 64 byte block, 35 cycles
• L3: 2 MB (per core), 16 way, 64 byte block, 100 cycle miss
• The study looks at 3 benchmarks:
– OLTP - user mode 71%, kernel time 18%, idle 11%
– DSS – 87%, 4%, 9%
– AltaVista (search engine) – 98%, <1%, <1%
Distributed Shared Memory
• The tightly coupled (shared memory)
multiprocessor is useful for promoting parallelism
within tasks (whether 1 process, a group of
threads, or related processes)
• However, when processes generally will not
communicate with each other, there is little need
to force the architect to build a shared memory
system
– The loosely coupled, or distributed memory system, is
generally easier to construct and possibly cheaper
• in fact, any network of computers can be thought of as a
loosely coupled multiprocessor
– Any multicore multiprocessor will be of this
configuration
DSM Architecture
• Here, each multicore MP is a SMP as per
our previous slides
• Connecting each processor together is an
interconnection network
– An example ICN is shown to the right, there
are many topologies including nearest
neighbors of 1-D, 2-D, 3-D and hypercubes
Directory-based Protocol
• The snoopy protocol requires that caches broadcast
invalidates to other caches
• For a DSM, this is not practical because of the
lengthy latencies in communication
– Further, there is no central bus that all processors are
listening to for such messages (the ICN is at a lower level
of the hierarchy, passed the caches but possibly before a
shared memory)
• So the DSM requires a different form for handling
coherence, so we turn to the directory-based protocol
– We keep track of every block that may be cached in a
central repository called a directory
– This directory maintains information for each block:
• in which caches it is stored
• whether it is dirty
• who currently “owns” the block
The Basics of the Protocol
• Cache blocks will have one of three states
– Shared – one or more nodes currently have the block that
contains the datum and the value is up to date in all caches
and main memory
– Uncached – no node currently has the datum, only memory
– Modified – the datum has been modified by one node,
called the owner
• for a node to modify a datum, it must be the only node to store the
datum, so this permits exclusivity
• if a node intends to modify a shared datum, it must first seek to
own the datum from the other caches, this allows a node to
modify a datum without concern that the datum is being or has
been modified by another node in the time it takes to share the
communication
• once modified, the datum in memory (and any other cache) is
invalid, or dirty
The Directory(ies)
• The idea is to have a single directory which is
responsible for keeping track of every block
– But it is impractical to use a single directory because such
an approach is not scalable
• Therefore, the directory must be distributed
– Refer back to the figure 3 slides ago, we enhance this by
adding a directory to each MP
• each MP now has its multicores & L1/L2 caches, a shared L3
cache, I/O, and a directory
– The local directory consists of 1 entry per block in the
caches (assuming we are dealing with multicore processors
and not collections of processors)
• we differentiate between the local node (the one making a
request) and the home node (the node storing or owning the
datum) and a remote node (a node that has requested an item from
the owner or a node that requires invalidation once the owner has
modified the datum)
Protocol Messages
Type
Source
Dest
Content
Function
Read miss
Local
Directory
P, A
P has a read miss at A, requests
data to make P a sharer
Write miss
Local
Directory
P, A
P has a write miss at A, requests
data and makes P owner
Invalidate
Local
Directory
A
Invalidate all remote caches for
A
Invalidate
Directory
Remote
A
Invalidate a shared copy of A
Fetch
Directory
Remote
A
Fetch block A from remote
cache, send to home directory
and change A to shared
Fetch/Invalidate
Directory
Remote
A
Fetch block A from remote
cache, send to home directory
and change remote cache’s A to
invalid
Data value reply
Directory
Local
D
Return datum from home
Data write-back
Remote
Directory
A, D
Write A back
Example Protocol