Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

8.
Data Locality• The most cri3cal factor in performance? Google argues otherwise!• Not guaranteed by a JVM• Spa7al -­‐ reused over and over in a loop, data accessed in small regions• Temporal -­‐ high probability it will be reused before long

13.
Cache Write Strategies• Write through: changed cache line immediately goes back to main memory• Write back: cache line is marked when dirty, evic3on sends back to main memory• Write combining: grouped writes of cache lines back to main memory• Uncacheable: dynamic values that can change without warning

14.
Exclusive versus Inclusive• Only relevant below L3• AMD is exclusive– Progressively more costly due to evic3on– Can hold more data– Bulldozer uses "write through" from L1d back to L2• Intel is inclusive– Can be be9er for inter-­‐processor memory sharing– More expensive as lines in L1 are also in L2 & L3– If evicted in a higher level cache, must be evicted below as well

16.
MESI+F Cache Coherency Protocol• Speciﬁc to data cache lines• Request for Ownership (RFO), when a processor tries to write to a cache line• Modiﬁed, the local processor has changed the cache line, implies only one who has it• Exclusive, one processor is using the cache line, not modiﬁed• Shared, mul3ple processors are using the cache line, not modiﬁed• Invalid, the cache line is invalid, must be re-­‐fetched• Forward, designate to respond to requests for a cache line• All processors MUST acknowledge a message for it to be valid

17.
Sta7c RAM (SRAM)• Requires 6-­‐8 pieces of circuitry per datum• Cycle rate access, not quite measurable in 3me• Uses a rela3vely large amount of power for what it does• Data does not fade or leak, does not need to be refreshed/recharged

23.
Registers• On-­‐core for instruc3ons being executed and their operands• Can be accessed in a single cycle• There are many diﬀerent types• A 64-­‐bit Intel Nehalem CPU had 128 Integer & 128 ﬂoa3ng point registers

24.
Store Buﬀers• Hold data for Out of Order (OoO) execu3on• Fully associa3ve• Prevent “stalls” in execu3on on a thread when the cache line is not local to a core on a write• ~1 cycle

25.
Level Zero (L0)• Added in Sandy Bridge• A cache of the last 1536 uops decoded• Well-­‐suited for hot loops• Not the same as the older "trace" cache

27.
Level Two (L2)• 256K per core on a Sandy Bridge, Ivy Bridge & Haswell• 2MB per “module” on AMDs Bulldozer architecture• ~11 cycles to access• Uniﬁed data and instruc3on caches from here up• If the working set size is larger than L2, misses grow

28.
Level Three (L3)• Was a “uniﬁed” cache up un3l Sandy Bridge, shared between cores• Varies in size with diﬀerent processors and versions of an architecture. Laptops might have 6-­‐8MB, but server-­‐class might have 30MB.• 14-­‐38 cycles to access

29.
Level Four??? (L4)• Some versions of Haswell will have a 128 MB L4 cache!• No latency benchmarks for this yet

33.
Programming Op7miza7ons• Stack allocated data is cheap• Pointer interac3on -­‐ you have to retrieve data being pointed to, even in registers• Avoid locking and resultant kernel arbitra3on• CAS is be9er and occurs on-­‐thread, but algorithms become more complex• Match workload to the size of the last level cache (LLC, L3/L4)

34.
What about Func7onal Programming?• Have to allocate more and more space for your data structures, leads to evic3on• When you cycle back around, you get cache misses• Choose immutability by default, proﬁle to ﬁnd poor performance• Use mutable data in targeted loca3ons

35.
Hyperthreading• Great for I/O-­‐bound applica3ons• If you have lots of cache misses• Doesnt do much for CPU-­‐bound applica3ons• You have half of the cache resources per core• NOTE -­‐ Haswell only has Hyperthreading on i7!

42.
Phase Change Memory (PRAM)• Higher performance than todays DRAM• Intel seems more fascinated by this, released its "neuromorphic" chip design last Fall• Not able to perform processing• Write degrada3on is supposedly much slower• Was considered suscep3ble to uninten3onal change, maybe ﬁxed?