Title: How Computer Architecture Trends May Affect Future Distributed Systems
1How Computer Architecture TrendsMay Affect
Future Distributed Systems
- Mark D. Hill
- Computer Sciences Department
- University of Wisconsin--Madison
- http//www.cs.wisc.edu/markhill
- PODC 00 Invited Talk
2Three Questions
- What is a System Area Network (SAN)and how will
it affect clusters? - E.g., InfiniBand
- How fat will multiprocessor servers beand how to
we build larger ones? - E.g. Wisconsin Multifacets Multicast Timestamp
Snooping - Future of multiprocessor servers clusters?
- A merging of both?
3Outline
- Motivation
- System Area Networks
- Designing Multiprocessor Servers
- Server Cluster Trends
4Technology Push Moores Law
- What do following intervals have in common?
- Prehistory to 2000
- 2001 to 2002
- Answer Equal progress in absolute processor
speed(and more doubling 2003-4, 2005-6, etc.) - Consider salary doubling
- Corollary Cost halves every two years
- Jim Gray In a decade you can buy a computerfor
less than its sales tax today
5Application Pull
- Should use computers in currently wasteful ways
- Already computers in electric razors greeting
cards - New business models
- B2C, B2B, C2B, C2C
- Mass customization
- More proactive (beyond interactive) Tennenhouse
- Today P2C where PPerson CComputer
- More C2P mattress adjusts to save your back
- More C2C Agents surf the web for optimal deal
- More sensors (physical/logic worlds coupled)
- More hidden computers (c.f., electric motors)
- Furthermore, I am wrong
6The Internet Iceberg
- Internet Components
- Clients -- mobile, wireless
- On Ramp -- LANs/DSL/Cable Modems
- WAN Backbone -- IPv6, massive BW
- and ...
- SERVICES
- Scale Storage
- Scale Bandwidth
- Scale Computation
- High Availability
7Outline
- Motivation
- System Area Networks
- What is a SAN?
- InfiniBand
- Virtualizing I/O with Queue Pairs
- Predictions
- Designing Multiprocessor Servers
- Server Cluster Trends
8Regarding Storage/Bandwidth
- Currently resides on I/O Bus (PCI)
- HW SW protocol stacks
- Must add hosts to add storage/bandwidth
bridge
i/o bus
i/o slot 0
i/o slot n-1
9Want System Area Network (SAN)
- SAN vs. Local Area Nework (LAN)
- Higher bandwidth (10 Gbps)
- Lower latency (few microseconds or less)
- More limited size
- Other (e.g., single administrative domain, short
distance) - Examples Tandem Servernet Myricom Myrinet
- Emerging Standard InfiniBand
- www.inifinibandTA.org w/ spec 1.0 Summer 2000
- Compaq, Dell, HP, IBM, Intel, Microsoft, Sun,
others - 2.5 Gbits/s times 1, 4, or 12 wires
10InfiniBand Model (from website)
target (disks)
TCA
11Inifiniband Advantages
- Storage/Network made orthogonal from Computation
- Reduce hardware stack -- no i/o bridge
- Reduce software stack hardware support for
- Connected Reliable
- Connected Unreliable
- Datagram
- Reliable Datagram
- Raw Datagram
- Can eliminate system call for SAN use (next slide)
12Virtualizing InfiniBand
- I/O traditionally virtualized with system call
- System enforces isolation
- System permits authorized sharing
- Memory virtualized
- System trap/call for setup
- Virtual memory hardware for common-case
translation - Infiniband exploits queue pairs (QPs) in memory
- C.f., Intel Virtual Interface Architecture
(VIA)IEEE Micro, Mar/Apr 98 - Users issue sends, receives, remote DMA
reads/writes
13Queue Pair
- QP setup system call
- Connect with process
- Connect with remote QP(not shown here)
- QP placed in pinned virtual memory
- User directly access QP
- E.g., sends, receives remote DMA reads/writes
proc
Main Memory
dma-W4
dma-R3
send2
receive1
send1
receive2
HCA
14InfiniBand, cont.
- Roadmap
- NGIO/FIO merger in 99
- Spec in 00
- Products in 03-10
- My Assessment
- PCI needs successor
- InfiniBand has the necessary features (but also
many others) - InifiniBand has considerable industry buy-in (but
it is recent) - Gigabit Ethernet will be only competitor
- Good name with backing from Cisco et al.
- But TCP/IP is a killer
- Infiniband for storage will be key
15InfiniBand Research Issues
- Software Wide Open
- Industry will do local optimization(e.g., still
have device driver virtualized with system calls) - But what is the right way to do software?
- Is there a theoretical model for this software?
- Other SAN Issues
- A theoretical model of a service-providers site?
- How to trade performance and availability?
- Utility of broadcast or multicast support?
- Obtaining quasi-real-time performance?
16Outline
- Motivation
- System Area Networks
- Designing Multiprocessor Servers
- How Fat?
- Coherence for Servers
- E.g., Multicast Snooping
- E.g., Timestamp Snooping
- Server Cluster Trends
17How Fat Should Servers Be?
- Use
- PCs -- cheap but small
- Workgroup servers -- medium cost medium size
- Large servers -- premium cost size
- One answer yes
18How Do We Build the Big Servers?
- (Industry knows how to build the small ones)
- A key problem is the memory system
- Memory Wall E.g., 100ns memory access 400
instruction opportunities for 4-way 1GHz
processor - Use per-processor caches to reduce
- Effective Latency
- Effective Bandwidth Used
- But cache coherence problem ...
19Coherence 101
4
4
r0lt-m100
r1lt-m100
m100lt-5
X 5
100 4
100 4
interconnection network
memory
memory
100 4
20Broadcast Snooping
P2GETX
P2GETX
data
data
data
21Broadcast Snooping
- Symmetric Multiprocessor (SMP)
- Most commercially-successful parallel computer
architecture - Performs well by finding data directly
- Scales poorly
- Improvements, e.g., Sun E10000
- Split address data transactions
- Split address data network (e.g., bus
crossbar) - Multiple address buses (e.g., four multiplexed by
address) - Address bus is broadcast tree (not shared wires)
- But
- Broadcast all address transactions (expensive)
- All processors must snoop all transactions
22Directories
P2GETX P1GETX
P2GETX
data
data
data
23Directories
- Directory Based Cache Coherence
- E.g., SGI/Cray Origin2000
- Allows arbitrary point-to-point interconnection
network - Scales up well
- But
- Cache-to-cache transfers common in demanding
apps(55-62 sharing misses for OLTP Barroso
ISCA 98) - Many applications cant use 100s of processors
- Must also scale down well
24Wisconsin Multifacet Big Picture
- Build Servers For Internet economy
- Moderate multiprocessor sizes 2-8 then 16-64,
but not 1K - Optimize for these workloads (e.g. cache-to-cache
transfers) - Key Tool Multiprocessor Prediction Speculation
- Make a guess... verify it later
- Uniprocessor predecessors branch set
predictors - Recent multiprocessor work Mukherjee/Hill
ISCA98, Kaxiras/Goodman HPCA99 Lai/Falsafi
ISCA99 - Multicast Snooping
- Timestamp Snooping
25Comparison of Coherence Methods
Use prediction to improve on both?
26Multicast Snooping
- On cache miss
- Predict "multicast mask" (e.g., bit vector of
processors) - Issue transaction on multicast address network
- Networks
- Address network that totally-orders address
multicasts - Separate point-to-point data network
- Processors snoop all incoming transactions
- If it's your own, it "occurs" now
- If another's, then invalidate and/or respond
- Simplified directory (at memory)
- Purpose Allows masks to be wrong (explained
later)
27Predicting Masks
- Performed at Requesting Processor
- Include owner (GETS/GETX) all sharers (GETX
only) - Exclude most other processors
- Techniques
- Many straightforward cases (e.g., stack,
code,space-sharing) - Many options (network load, PC, software,
local/global)
predicted mask
Mask Predictor
block address
feedback
28Implementing an Ordered Multicast Network
- Address Network
- Must create the illusion of total order of
multicasts - May deliver a multicast to destinations at
different times - Wish List
- High throughput for multicasts
- No centralized bottlenecks
- Low latency and cost ( pipelined broadcast tree)
- ...
- Sample Solutions
- Isotach Networks Reynolds et al., IEEE TPDS
4/97 - Indirect Fat Tree ISCA 99
- Direct Torus
29Indirect Fat Tree ISCA 99
P D M
30Indirect Fat Tree, cont.
- Basic Idea
- Processors send transactions up to roots
- Roots send transactions down with logical
timestamp - Switches stall transactions to keep in order
- Null transaction sent to avoid deadlock
- Assessment
- Viable high cross-section bandwidth
- Many "backplane" ASICs means higher cost
- Often stalls transactions
- Want
- Lower cost of direct connections
- Always delivery transactions as soon as possible
(ASAP) - Sacrifice some cross-section bandwidth
31Direct 2-D Torus (work in progress)
- Features
- Each processor is switch
- Switches directly connected
- E.g., network of Compaq 21364
- Network order?
- Broadcasts unordered
- Snooping needs total order
- Solution
- Create order with logical timestampsinstead of
network delivery order - Called Timestamp Snooping ASPLOS 00
0
1
15
14
32Timestamp Snooping
- Timestamp Snooping
- Snooping with order determined by logical
timestamps - Broadcast (not multicast) in ASPLOS 00
- Basic Idea
- Assign timestamp to coherence transactions at
sender - Broadcast transactions over unordered network
ASAP - Transaction carry timestamp (2 bits)
- Processors process transactions in timestamp
order
33Timestamp Snooping Issues
- More address bandwidth
- For 16-processors, 4-ary butterfly, 64-byte
blocks - Directory 38 372 more 240 more
- Timestamp Snooping 218 372 384 (lt 60
more) - Network must guarantee timestamps
- Assert future transactions will have greater
timestamps(so processor can processor older
transactions) - Isotach Reynolds IEEE TPDS 4/97 more
aggressively - Other
- Priority queue at processor to order transactions
- Flow control and buffering issues
34Initial Multifacet Results
- Multicast Snooping ISCA 99
- Ordered multicast of coherence transactions
- Find data directly from memory or caches
- Reduce bandwidth to permit some scaling
- 32-processor results show 2-6 destinations per
multicast - Timestamp Snooping ASPLOS 00
- Broadcast snooping with order determined by
logical timestamps carried by coherence
transactions - No bus Allows arbitrary memory interconnects
- No directory or directory indirection
- 16-processor results show 25 faster for 25 more
traffic
35Selected Issues
- Multicast Snooping
- What program property are mask predictors
exploiting? - Why is there no good model of localityor the
90-10 rule in general? - How does one build multicast networks?
- What about fault tolerance?
- Timestamp Snooping
- What is an optimal network topology?
- What about buffering, deadlock, etc.?
- Implementing switches and priority queues?
36Outline
- Motivation
- System Area Networks
- Designing Multiprocessor Servers
- Server Cluster Trends
- Out-of-box and highly-available servers
- High-performance communication for clusters
37Multiprocessor Servers
- High-Performance Communication within box
- SMPs (e.g., Intel PentiumPro Quads)
- Directory-based (SGI Origin2000)
- Trend toward hierarchical out of box solutions
- Build bigger servers from smaller ones
- Intel Profusion, Sequent NUMA-Q, Sun WildFire
(pictured)
38Multiprocessor Servers, cont.
- Traditionally had poor error isolation
- Double-bit ECC error crashes everything
- Kernel error crashes everything
- Poor match for highly available Internet
infrastructure - Improve error isolation
- IBM 370 virtual machines
- Stanford HIVE cells
39Clusters
- Traditionally
- Good error isolation
- Poor communication performance (especially
latency) - LANs are not optimized for clusters
- Enter Early SANs
- Berkeley NOW w/ Myricom Myrinet
- IBM SP w/ proprietary network
- What now with InfiniBand SAN (or alternatives)?
40A Prediction
- Blurring of cluster server boundaries
- Clusters
- High communication performance
- Servers
- Better error isolation
- Multi-box solutions
- Use same hardware configure in the field
- Issues
- How do we model these hybrids?
- Should PODC SPAA also converge?
41Three Questions
- What is a System Area Network (SAN)and how will
it affect clusters? - E.g., InfiniBand
- Make computation, storage, network orthogonal
- How fat will multiprocessor servers beand how to
we build larger ones? - Varying sizes for soft hard state
- E.g., Multicast Snooping Timestamp Snooping
- Future of multiprocessor servers clusters?
- Servers will support higher availability
extra-box solutions - Clusters will get server communication performance