Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

CQ (Completion Queue) - an event notification block used when processor needs to be notified that BTE or FMA transactions have completed.NAT (Network Address Translation) - responsible for validating and translating addresses from the network address format to an address on the local node.AMO (Atomic Memory Operation) - responsible for AMO type of transactions.ORB (Outstanding Request Buffer) - processes requests to the network and matches responses from the network to the original requests.RMT (Receive Message Table) - tracks groups of packets, or sequences, transmitted from remote nodes of the networkSSID (Synchronization Sequence Identification) - Tracks all of the request packets that originate and all of the response packets that terminate at the NIC, in order to perform completion notifications for transactions.Assists in the identification of SW operations and processes impacted by errorsMonitors error detected by other NIC blocks

Figure 2: Logical and Physical views of striping. Four application processes write a variable amount of data sequentially within a shared file. This shared file is striped over 4 OSTs with 1 MB stripe sizes. This write operation is not stripe aligned therefore some processes write their data to stripes used by other processes. Some stripes are accessed by more than one process (which may cause contention). Additionally, OSTs are accessed by variable numbers of processes (3 OST0, 1 OST1, 2 OST2 and 2 OST3).

Figure 3: Write Performance for serial I/O at various Lustre stripe counts. File size is 32 MB per OST utilized and write operations are 32 MB in size. Utilizing more OSTs does not increase write performance. The Best performance is seen by utilizing a stripe size which matches the size of write operations.

Figure 4: Write Performance for serial I/O at various Lustre stripe sizes and I/O operation sizes. File utilized is 256 MB written to a single OST. Performance is limited by small operation sizes and small stripe sizes. Either can become a limiting factor in write performance. The best performance is obtained in each case when the I/O operation and stripe sizes are similar.

These loops were taken from the nuccor application and provided by Rebecca Hartman-Baker from ORNL. She originally began by comparing various compilers and optimization levels. The results of the following rewrites came at the suggestion of Vince Graziano of Cray.

This code better plays to the strengths of the CPU. More cache reuse, easier prefetching, better chance of vectorizing.

Original: 13.938244 sReordered: 7.955379 s

The code further improves on the last by allowing slightly better cache reuse, but significantly better opportunity to vectorize on both a and b. I asked the compiler team why the loop nest on the left was only partially vectorized and they said that their studies showed that it would probably not be profitable (probably due to the tmat7 array striding on the second dimension.

Original: 13.938244 sReordered: 7.955379 sFissioned: 2.481636 s

The following Cache Blocking example was created by Steve Whalen of Cray.

See http://en.wikipedia.org/wiki/Restrict for more information on “Restrict”

The following come from Kim McMahon (Cray)

Figure 5: Write performance of a file-per-process I/O pattern as a function of number of files/processes. The file size is 128 MB with 32 MB sized write operations. Performance increases as the number of processes/files increases until OST and metadata contention hinder performance improvements. Each file is subject to the limitations of serial I/O.Improved performance can be obtained from a parallel file system such as Lustre. However, at large process counts (large number of files) metadata operations may hinder overall performance. Additionally, at large process counts (large number of files) OSS and OST contention will hinder overall performance.

Figure 8: Write Performance of a single shared file as the number of processes increases. A file size of 32 MB per process is utilized with 32 MB write operations. For each I/O library (Posix, MPI-IO, and HDF5) performance levels off at high core counts.

6.
 A cache line is 64B
 Cache is a “victim cache”
 All references go to L1 immediately and get evicted down the caches
 A cache line is usually only in one level of cache
 Hardware prefetcher detects forward and backward strides through
memory
 Each core can perform a 128b add and 128b multiply per clock cycle
 This requires SSE, packed instructions
 “Stride-one vectorization”

15.
With snoop filter, a streams test
shows 42.3 MB/sec out of a
possible 51.2 GB/sec or 82% of
peak bandwidth
• This feature will be key for two-
socket Magny Cours Nodes which
are the same architecture-wise
Cray Inc. Preliminary and Proprietary SC09 17

22.
 Two Gemini ASICs are
packaged on a pin-compatible
mezzanine card
 Topology is a 3-D torus
 Each lane of the torus is
composed of 4 Gemini router
“tiles”
 Systems with SeaStar
interconnects can be upgraded
by swapping this card
 100% of the 48 router tiles on
each Gemini chip are used
Cray Inc. Preliminary and Proprietary SC09 24

29.
Cool air is released into the computer room
Liquid Liquid/Vapor
in Mixture out
Hot air stream passes through evaporator, rejects
heat to R134a via liquid-vapor phase change
(evaporation).
R134a absorbs energy only in the presence of heated air.
Phase change is 10x more efficient than pure water
cooling.
31

37.
 Compiler feedback is enabled with -Minfo and -Mneginfo
 This can provide valuable information about what optimizations were
or were not done and why.
 To debug an optimized code, the -gopt flag will insert debugging
information without disabling optimizations
 It’s possible to disable optimizations included with -fast if you believe one
is causing problems
 For example: -fast -Mnolre enables -fast and then disables loop
redundant optimizations
 To get more information about any compiler flag, add -help with the
flag in question
 pgf90 -help -fast will give more information about the -fast
flag
 OpenMP is enabled with the -mp flag

38.
Some compiler options may effect both performance and accuracy. Lower
accuracy is often higher performance, but it’s also able to enforce accuracy.
 -Kieee: All FP math strictly conforms to IEEE 754 (off by default)
 -Ktrap: Turns on processor trapping of FP exceptions
 -Mdaz: Treat all denormalized numbers as zero
 -Mflushz: Set SSE to flush-to-zero (on with -fast)
 -Mfprelaxed: Allow the compiler to use relaxed (reduced) precision to
speed up some floating point optimizations
 Some other compilers turn this on by default, PGI chooses to favor
accuracy to speed by default.

42.
 Make sure it is available
 module avail PrgEnv-cray
 To access the Cray compiler
 module load PrgEnv-cray
 To target the various chip
 module load xtpe-[barcelona,shanghi,istanbul]
 Once you have loaded the module “cc” and “ftn” are the Cray
compilers
 Recommend just using default options
 Use –rm (fortran) and –hlist=m (C) to find out what happened
 man crayftn

48.
ftn-6289 ftn: VECTOR File = resid.f, Line = 29
A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines
32 and 38.
ftn-6049 ftn: SCALAR File = resid.f, Line = 29
A loop starting at line 29 was blocked with block size 4.
ftn-6289 ftn: VECTOR File = resid.f, Line = 30
A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32
and 38.
ftn-6049 ftn: SCALAR File = resid.f, Line = 30
A loop starting at line 30 was blocked with block size 4.
ftn-6005 ftn: SCALAR File = resid.f, Line = 31
A loop starting at line 31 was unrolled 4 times.
ftn-6204 ftn: VECTOR File = resid.f, Line = 31
A loop starting at line 31 was vectorized.
ftn-6005 ftn: SCALAR File = resid.f, Line = 37
A loop starting at line 37 was unrolled 4 times.
ftn-6204 ftn: VECTOR File = resid.f, Line = 37
A loop starting at line 37 was vectorized.

50.
 OpenMP is ON by default
 Optimizations controlled by –Othread#
 To shut off use –Othread0 or –xomp or –hnoomp
 Autothreading is NOT on by default;
 -hautothread to turn on
 Modernized version of Cray X1 streaming capability
 Interacts with OMP directives
If you do not want to use OpenMP and have OMP directives
in the code, make sure to make a run with OpenMP shut
off at compile time

65.
 In FFTs, the problems are
 Which library choice to use?
 How to use complicated interfaces (e.g., FFTW)
 Standard FFT practice
 Do a plan stage
 Deduced machine and system information and run micro-kernels
 Select best FFT strategy
 Do an execute
Our system knowledge can remove some of this cost!
74

66.
 CRAFFT is designed with simple-to-use interfaces
 Planning and execution stage can be combined into one function call
 Underneath the interfaces, CRAFFT calls the appropriate FFT kernel
 CRAFFT provides both offline and online tuning
 Offline tuning
 Which FFT kernel to use
 Pre-computed PLANs for common-sized FFT
 No expensive plan stages
 Online tuning is performed as necessary at runtime as well
 At runtime, CRAFFT will adaptively select the best FFT kernel to use based
on both offline and online testing (e.g. FFTW, Custom FFT)
75

123.
 Cache blocking is a combination of strip mining and loop interchange, designed
to increase data reuse.
 Takes advantage of temporal reuse: re-reference array elements already
referenced
 Good blocking will take advantage of spatial reuse: work with the cache
lines!
 Many ways to block any given loop nest
 Which loops get blocked?
 What block size(s) to use?
 Analysis can reveal which ways are beneficial
 But trial-and-error is probably faster

139.
 Linux has a “first touch policy” for memory allocation
 *alloc functions don’t actually allocate your memory
 Memory gets allocated when “touched”
 Problem: A code can allocate more memory than available
 Linux assumed “swap space,” we don’t have any
 Applications won’t fail from over-allocation until the memory is finally
touched
 Problem: Memory will be put on the core of the “touching” thread
 Only a problem if thread 0 allocates all memory for a node
 Solution: Always initialize your memory immediately after allocating it
 If you over-allocate, it will fail immediately, rather than a strange place
in your code
 If every thread touches its own memory, it will be allocated on the
proper socket
Slide 154

140.
 Short Message Eager Protocol
 The sending rank “pushes” the message to the receiving rank
 Used for messages MPICH_MAX_SHORT_MSG_SIZE bytes or less
 Sender assumes that receiver can handle the message
 Matching receive is posted - or -
 Has available event queue entries (MPICH_PTL_UNEX_EVENTS) and buffer space
(MPICH_UNEX_BUFFER_SIZE) to store the message
 Long Message Rendezvous Protocol
 Messages are “pulled” by the receiving rank
 Used for messages greater than MPICH_MAX_SHORT_MSG_SIZE bytes
 Sender sends small header packet with information for the receiver to pull
over the data
 Data is sent only after matching receive is posted by receiving rank