Computer Architecture in the 21st Century

Computer Architecture in the
21st Century
Chuck Thacker
Microsoft Research Silicon Valley
January, 2013
Outline
• Some review
• Why networking and architecture must
merge.
• How can we continue to improve both?
• Some remaining challenges.
Some physics – Dennard scaling (1975)
• P = CV2 f
– Relates power, capacitive load, operating voltage, and
frequency in CMOS circuits.
• If we scale the dimensions and V down by k:
• P’ = (C’ V’2f’) = (C/k)(V/k)2 f’ = CV2f/k3
• Capacitance is C/k because although the area goes down by k2,
the dielectric is 1/k as thick.
– k is the “feature size”: 90nm -> 45 nm is k = 2.
– Our new circuit is 1/k2 the area and1/k3 the power (at the
same frequency) of the original version.
– Our new chip should also be cheaper, since $ = Area.
– This works well for feature sizes > 100nm or so.
– It doesn’t work today, since we can’t continue to scale V,
and we can’t get rid of the heat.
• If f’ = kf, the power per unit area is unchanged. Making
a bigger chip means making a hotter chip.
• Semiconductor makers have used scaling in very
different ways for memories and CPUs.
An early 21st century PC “mother
board”
A 2012 “System on a Chip” (SOC)
The arrows show the off-chip connections
SOC basics
• Usually a few traditional CPU cores,
• With other specialized functions:
– Network and disk controller
– Memory controller
– Low speed I/O controllers
• Connected by an on-chip network.
• Optimized for a specific application:
•
•
•
•
Mobile phone
Tablet
Desktop
Data center server
• SOCs make sense only in high-volume multi-generational
applications, where they are mandatory for low cost.
• Specialization is the key to making innovative, differentiated
products that consume less energy.
• A specialized block can be 100X the energy efficiency of the same function
implemented in software running on a CPU.
• But specialization conflicts with the need for high volume.
A Typical SOC
• Nvidia Tegra 2: 75% of chip is specialized HW
7
SOC requirements
• To use SOCs effectively, a manufacturer must have:
• A reliable source of IP (CPU/cache, on-chip networks,
device and memory controllers. This can be internal or
external.
• Design tools to aggregate and interconnect the IP. CAD
vendors supply these tools today.
• Product designers to specify the functions and employ the
tools.
– Designs and their functionality must delight customers, or the
needed volumes will not be achieved.
• Do not need:
– A silicon fabrication facility
– A final product assembly plant.
Another approach to specialization
•
Field-programmable gate arrays (FPGAs)
•
•
•
•
•
Clean-slate design at the gate (logic) level.
High volumes, since differentiated designs can be produced with the same chips.
Simple silicon, so can be introduced rapidly in a new process
A lot of the transistors in an FPGA consume no power whatsoever.
Energy efficiency is not as great as with full-custom or ASIC designs
•
But if frequently good enough, particularly for algorithms that don’t work well in software running on
a general purpose CPU. (e.g. compression, crypto)
– So modern FPGAs contain “hard” functions that can be interconnected (and powered down
when not used)
•
•
•
•
•
•
•
Lots of logic
•
•
•
CPU cores (Xilinx Zynq family has 2 ARM cores with caches)
DSPs (hundreds)
Embedded RAM (megabits)
High speed I/O (for networking)
External DRAM controllers
Controllers for common I/O standards (e.g. Ethernet)
The MAXC computer would fit in a single modern FPGA
So would a few dozen Alto computers (with their memory and disks)
Problems:
•
•
Can’t use all the logic; wires are usually the thing that limits a design
You pay for logic that you don’t use.
A Field-programmable Gate Array (FPGA)
Another use for FPGAs: Networks
• The Internet resists innovation
• Protocols are standardized
• To communicate, must use common protocols.
• Data centers are different:
•
•
•
•
•
Single owner
Fewer nodes (tens of thousands rather than billions)
Links are short and have similar performance
Topology is known
Network within a data center requires much higher
bandwidth than the connections into and out of it.
• Can we take advantage of these differences to
build a lower-cost more energy-efficient network
for our data centers?
– Yes
Current data center practice
24 – 48 port switches are common
Problem: Need more bandwidth near the root of the tree.
The AN3 network
• Assumes shipping containers are used, but this is optional.
– 64 containers (“forty footers”)
– Each with two rows of 16 44 “U” racks
– Each half-rack has 22 1U servers with 4-port 5Gb/s switch/NIC (network
interface controller) that form a small sub-network. Two of the NICs have a 10
Gb/s “uplink” port.
– Each L0 sub-network connects to two Level 1 switches, each with 128 10
Gb/sec ports.
– 64 L1 ports are for the connections to L0 sub-nets, 64 are used for connections
to 63 other containers and the network operations center (NOC).
– No central switching needed
• 1408 servers/container, 90,112servers/center
• Somewhat surprisingly, both the NICs and L1 switches can be
implemented cost-effectively with FPGAs
– These have “hard” 10 Gb/s transceivers
– They have enough buffering – “block RAMs”
– They have enough logic
• We do not need “top-of-rack” switches.
– Less engineering
– Lower cost
– Higher bandwidth for traffic within the subnet
The switches in one container
A 64-container data center has a bisection bandwidth of 32*64*10*2 Gb/s = 41 Tb/s
The entire Internet has been estimated to have a bisection bandwidth of 100 Tb/s
The L0 Sub-network
44 cables (2816 total)
The AN3 NIC
Circuit- vs. packet- switched networks
•
Both types are common
– LANs and the Internet use packets
– Phone networks use circuits
•
Primary difference is the way congestion is handled
– Packet switches drop packets on congested paths.
– Circuit switching uses admission control – don’t accept a call if it would oversubscribe a switch
or link.
•
Data flows are handled differently
– Packet switches use variable-length packets for data
– Circuit switches send data in fixed-length cells, carried in larger fixed sized frames. Each slot
carries a cell for a particular flow, established at call setup time.
•
Circuit switches are simpler than packet switches
– Little buffering is needed, since cells are forwarded quickly. Packet switches use queueing to
handle momentary overload, and have huge buffers.
– Routing decisions are much simpler, since routing and buffer allocation is done at setup time.
This makes scaling to higher link rates (40 – 100 Gb/s) easier.
•
In a large network, packets are used because “call setup” overhead is too high.
– In a phone network, this is OK, since voice calls are usually longer than call setup (dialing).
– Also OK if you can statistically multiplex many connections onto a single flow. Large ISPs do
this.
Key ideas in AN3
• Circuit switching rather than packet switching
• Fast call setup
• Speculative transmission
• Eliminate the usual top-of-rack switches
– A full data center has only 128 L1 switches, but
90,000 NICs
– Can reduce the NIC’s cost even further with ASICs
Another area for improvement:
Memory systems
• Approaches to overcome the memory wall:
– Use new semiconductor technologies
• Flash, Phase Change, MRAM
– Get the memory closer to the CPU
• Die stacking
– Change the way the CPU accesses memory
• Transactional memory
Flash memory
• Properties:
– Two kinds: NOR and NAND
• Both store data by trapping charge on a “floating gate”
– Density is ~4X DRAM
– Nonvolatile, no refresh
– Block oriented
• Must erase a block (to all 1’s) before writing it
• Writing turns 1’s into 0’s, but not the other way.
• Writes must be to sequential addresses
– Reads are fast, writes are slow.
– Wears out with use
• 106 erase/write cycles
• This may have been solved recently
Flash Applications (2)
• NOR: BIOS storage, FPGA bitstream storage
• NAND: Disk replacement (SSD = “solid state disk”)
–
–
–
–
Cost per bit is 10X a magnetic disk
No rotational delay
No seek time
Great for data that is mostly read randomly, and which
can be written sequentially (a log).
• This is done in a “Flash translation layer”
– Some vendors use non-disk interfaces to improve
performance
– Another option: Change the system storage hierarchy
• Now: CPU -> Cache -> DRAM -> Disk
• Future: CPU -> Cache -> DRAM -> Flash -> Disk
Die stacking
• A packaging trick that allows us to change the
architectural interface to the DRAM.
• Can transfer more data at once because the
wires are smaller, and have less capacitance
• Many problems:
– Must thin the DRAM die
– Must have vias through the die
– Must use known-good die
• Can’t repair a stack
Transactional Memory
• > 500 papers in the last decade
• Problem with parallel computing: Shared, mutable state.
– Usually handled with mutual exclusion, but programming with
locks is very hard, since locking abstractions don’t compose.
• Simple idea from data bases. Transactions are regions of a
program with ACID properties:
– Atomic: “All or nothing”
– Consistent: Data is never visible in an inconsistent state
• Debit/credit example
– Isolated: Execution is equivalent to some serial interleaving of
the transactions. If one transaction would affect another, one of
them aborts.
– Durable: Survives system crashes, power failures
• Hardware transactional memory usually doesn’t do this.
• TM is beginning to appear in CPU architectures
– But we need to gain experience in using it in real programs.
Final thoughts
• We still have a lot of unsolved problems in
computing.
• Until/unless a radical new technology appears,
improvements in architecture and software
improvements are our major lever.