How Computer Architecture Trends May Affect Future Distributed Systems - PowerPoint PPT Presentation

About This Presentation

Title:

How Computer Architecture Trends May Affect Future Distributed Systems

Description:

Indirect Fat Tree [ISCA 99] P $ D M (C) 2000 Mark D. Hill. PODC00: Computer Architecture Trends ... How does one build multicast networks? What about fault ... – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 42

Provided by: markd69

Learn more at: http://www.podc.org

Category:

more less

Transcript and Presenter's Notes

Title: How Computer Architecture Trends May Affect Future Distributed Systems

1
How Computer Architecture TrendsMay Affect
Future Distributed Systems

Mark D. Hill
Computer Sciences Department
University of Wisconsin--Madison
http//www.cs.wisc.edu/markhill
PODC 00 Invited Talk

2
Three Questions

What is a System Area Network (SAN)and how will
it affect clusters?
E.g., InfiniBand
How fat will multiprocessor servers beand how to
we build larger ones?
E.g. Wisconsin Multifacets Multicast Timestamp
Snooping
Future of multiprocessor servers clusters?
A merging of both?

3
Outline

Motivation
System Area Networks
Designing Multiprocessor Servers
Server Cluster Trends

4
Technology Push Moores Law

What do following intervals have in common?
Prehistory to 2000
2001 to 2002
Answer Equal progress in absolute processor
speed(and more doubling 2003-4, 2005-6, etc.)
Consider salary doubling
Corollary Cost halves every two years
Jim Gray In a decade you can buy a computerfor
less than its sales tax today

5
Application Pull

Should use computers in currently wasteful ways
Already computers in electric razors greeting
cards
New business models
B2C, B2B, C2B, C2C
Mass customization
More proactive (beyond interactive) Tennenhouse
Today P2C where PPerson CComputer
More C2P mattress adjusts to save your back
More C2C Agents surf the web for optimal deal
More sensors (physical/logic worlds coupled)
More hidden computers (c.f., electric motors)
Furthermore, I am wrong

6
The Internet Iceberg

Internet Components
Clients -- mobile, wireless
On Ramp -- LANs/DSL/Cable Modems
WAN Backbone -- IPv6, massive BW
and ...
SERVICES
Scale Storage
Scale Bandwidth
Scale Computation
High Availability

7
Outline

Motivation
System Area Networks
What is a SAN?
InfiniBand
Virtualizing I/O with Queue Pairs
Predictions
Designing Multiprocessor Servers
Server Cluster Trends

8
Regarding Storage/Bandwidth

Currently resides on I/O Bus (PCI)
HW SW protocol stacks
Must add hosts to add storage/bandwidth

bridge
i/o bus
i/o slot 0
i/o slot n-1
9
Want System Area Network (SAN)

SAN vs. Local Area Nework (LAN)
Higher bandwidth (10 Gbps)
Lower latency (few microseconds or less)
More limited size
Other (e.g., single administrative domain, short
distance)
Examples Tandem Servernet Myricom Myrinet
Emerging Standard InfiniBand
www.inifinibandTA.org w/ spec 1.0 Summer 2000
Compaq, Dell, HP, IBM, Intel, Microsoft, Sun,
others
2.5 Gbits/s times 1, 4, or 12 wires

10
InfiniBand Model (from website)
target (disks)
TCA
11
Inifiniband Advantages

Storage/Network made orthogonal from Computation
Reduce hardware stack -- no i/o bridge
Reduce software stack hardware support for
Connected Reliable
Connected Unreliable
Datagram
Reliable Datagram
Raw Datagram
Can eliminate system call for SAN use (next slide)

12
Virtualizing InfiniBand

I/O traditionally virtualized with system call
System enforces isolation
System permits authorized sharing
Memory virtualized
System trap/call for setup
Virtual memory hardware for common-case
translation
Infiniband exploits queue pairs (QPs) in memory
C.f., Intel Virtual Interface Architecture
(VIA)IEEE Micro, Mar/Apr 98
Users issue sends, receives, remote DMA
reads/writes

13
Queue Pair

QP setup system call
Connect with process
Connect with remote QP(not shown here)
QP placed in pinned virtual memory
User directly access QP
E.g., sends, receives remote DMA reads/writes

proc
Main Memory
dma-W4
dma-R3
send2
receive1
send1
receive2
HCA
14
InfiniBand, cont.

Roadmap
NGIO/FIO merger in 99
Spec in 00
Products in 03-10
My Assessment
PCI needs successor
InfiniBand has the necessary features (but also
many others)
InifiniBand has considerable industry buy-in (but
it is recent)
Gigabit Ethernet will be only competitor
Good name with backing from Cisco et al.
But TCP/IP is a killer
Infiniband for storage will be key

15
InfiniBand Research Issues

Software Wide Open
Industry will do local optimization(e.g., still
have device driver virtualized with system calls)
But what is the right way to do software?
Is there a theoretical model for this software?
Other SAN Issues
A theoretical model of a service-providers site?
How to trade performance and availability?
Utility of broadcast or multicast support?
Obtaining quasi-real-time performance?

16
Outline

Motivation
System Area Networks
Designing Multiprocessor Servers
How Fat?
Coherence for Servers
E.g., Multicast Snooping
E.g., Timestamp Snooping
Server Cluster Trends

17
How Fat Should Servers Be?

Use
PCs -- cheap but small
Workgroup servers -- medium cost medium size
Large servers -- premium cost size
One answer yes

18
How Do We Build the Big Servers?

(Industry knows how to build the small ones)
A key problem is the memory system
Memory Wall E.g., 100ns memory access 400
instruction opportunities for 4-way 1GHz
processor
Use per-processor caches to reduce
Effective Latency
Effective Bandwidth Used
But cache coherence problem ...

19
Coherence 101
4
4
r0lt-m100
r1lt-m100
m100lt-5
X 5
100 4
100 4
interconnection network
memory
memory
100 4
20
Broadcast Snooping
P2GETX
P2GETX
data
data
data
21
Broadcast Snooping

Symmetric Multiprocessor (SMP)
Most commercially-successful parallel computer
architecture
Performs well by finding data directly
Scales poorly
Improvements, e.g., Sun E10000
Split address data transactions
Split address data network (e.g., bus
crossbar)
Multiple address buses (e.g., four multiplexed by
address)
Address bus is broadcast tree (not shared wires)
But
Broadcast all address transactions (expensive)
All processors must snoop all transactions

22
Directories
P2GETX P1GETX
P2GETX
data
data
data
23
Directories

Directory Based Cache Coherence
E.g., SGI/Cray Origin2000
Allows arbitrary point-to-point interconnection
network
Scales up well
But
Cache-to-cache transfers common in demanding
apps(55-62 sharing misses for OLTP Barroso
ISCA 98)
Many applications cant use 100s of processors
Must also scale down well

24
Wisconsin Multifacet Big Picture

Build Servers For Internet economy
Moderate multiprocessor sizes 2-8 then 16-64,
but not 1K
Optimize for these workloads (e.g. cache-to-cache
transfers)
Key Tool Multiprocessor Prediction Speculation
Make a guess... verify it later
Uniprocessor predecessors branch set
predictors
Recent multiprocessor work Mukherjee/Hill
ISCA98, Kaxiras/Goodman HPCA99 Lai/Falsafi
ISCA99
Multicast Snooping
Timestamp Snooping

25
Comparison of Coherence Methods
Use prediction to improve on both?
26
Multicast Snooping

On cache miss
Predict "multicast mask" (e.g., bit vector of
processors)
Issue transaction on multicast address network
Networks
Address network that totally-orders address
multicasts
Separate point-to-point data network
Processors snoop all incoming transactions
If it's your own, it "occurs" now
If another's, then invalidate and/or respond
Simplified directory (at memory)
Purpose Allows masks to be wrong (explained
later)

27
Predicting Masks

Performed at Requesting Processor
Include owner (GETS/GETX) all sharers (GETX
only)
Exclude most other processors
Techniques
Many straightforward cases (e.g., stack,
code,space-sharing)
Many options (network load, PC, software,
local/global)

predicted mask
Mask Predictor
block address
feedback
28
Implementing an Ordered Multicast Network

Address Network
Must create the illusion of total order of
multicasts
May deliver a multicast to destinations at
different times
Wish List
High throughput for multicasts
No centralized bottlenecks
Low latency and cost ( pipelined broadcast tree)
...
Sample Solutions
Isotach Networks Reynolds et al., IEEE TPDS
4/97
Indirect Fat Tree ISCA 99
Direct Torus

29
Indirect Fat Tree ISCA 99
P D M
30
Indirect Fat Tree, cont.

Basic Idea
Processors send transactions up to roots
Roots send transactions down with logical
timestamp
Switches stall transactions to keep in order
Null transaction sent to avoid deadlock
Assessment
Viable high cross-section bandwidth
Many "backplane" ASICs means higher cost
Often stalls transactions
Want
Lower cost of direct connections
Always delivery transactions as soon as possible
(ASAP)
Sacrifice some cross-section bandwidth

31
Direct 2-D Torus (work in progress)

Features
Each processor is switch
Switches directly connected
E.g., network of Compaq 21364
Network order?
Broadcasts unordered
Snooping needs total order
Solution
Create order with logical timestampsinstead of
network delivery order
Called Timestamp Snooping ASPLOS 00

0
1
15
14
32
Timestamp Snooping

Timestamp Snooping
Snooping with order determined by logical
timestamps
Broadcast (not multicast) in ASPLOS 00
Basic Idea
Assign timestamp to coherence transactions at
sender
Broadcast transactions over unordered network
ASAP
Transaction carry timestamp (2 bits)
Processors process transactions in timestamp
order

33
Timestamp Snooping Issues

More address bandwidth
For 16-processors, 4-ary butterfly, 64-byte
blocks
Directory 38 372 more 240 more
Timestamp Snooping 218 372 384 (lt 60
more)
Network must guarantee timestamps
Assert future transactions will have greater
timestamps(so processor can processor older
transactions)
Isotach Reynolds IEEE TPDS 4/97 more
aggressively
Other
Priority queue at processor to order transactions
Flow control and buffering issues

34
Initial Multifacet Results

Multicast Snooping ISCA 99
Ordered multicast of coherence transactions
Find data directly from memory or caches
Reduce bandwidth to permit some scaling
32-processor results show 2-6 destinations per
multicast
Timestamp Snooping ASPLOS 00
Broadcast snooping with order determined by
logical timestamps carried by coherence
transactions
No bus Allows arbitrary memory interconnects
No directory or directory indirection
16-processor results show 25 faster for 25 more
traffic

35
Selected Issues