Title: Interconnection%20Networks
1Interconnection Networks
2Overview
- Physical Layer and Message Switching
- Network Topologies
- Metrics
- Deadlock Livelock
- Routing Layer
- The Messaging Layer
3Interconnection Networks
- Fabric for scalable, multiprocessor architectures
- Distinct from traditional networking
architectures such as Internet Protocol (IP)
based systems
4Resource View of Parallel Architectures
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
M
M
M
M
M
M
M
M
M
M
M
- How do we present these resources?
- What are the costs of different interconnection
networks - What are the design considerations?
- What are the applications?
5Example Clusters Google Hardware Infrastructure
- VME rack 19 in. wide, 6 feet tall, 30 inches
deep - Per side 40 1 Rack Unit (RU) PCs 1 HP Ethernet
switch (4 RU) Each blade can contain 8
100-Mbit/s EN or a single 1-Gbit Ethernet
interface - Frontback gt 80 PCs 2 EN switches/rack
- Each rack connects to 2 128 1-Gbit/s EN switches
- Dec 2000 40 racks at most recent site
- 6000 PCs, 12000 disks almost 1 petabyte!
- PC operates at about 55 Watts
- Rack gt 4500 Watts , 60 amps
6Reliability
- For 6000 PCs, 12000s, 200 EN switches
- 20 PCs will need to be rebooted/day
- 2 PCs/day hardware failure, or 2-3 / year
- 5 due to problems with motherboard, power
supply, and connectors - 30 DRAM bits change errors in transmission
(100 MHz) - 30 Disks fail
- 30 Disks go very slow (10-3 expected BW)
- 200 EN switches, 2-3 fail in 2 years
- 6 Foundry switches none failed, but 2-3 of 96
blades of switches have failed (16 blades/switch) - Collocation site reliability
- 1 power failure,1 network outage per year per site
7The Practical Problem
From Ambuj Goyal, Computer Science Grand
Challenge Simplicity of Design, Computing
Research Association Conference on "Grand
Research Challenges" in Computer Science and
Engineering, June 2002
8Example Embedded Devices
picoChip http//www.picochip.com/
- Issues
- Execution performance
- Power dissipation
- Number of chip types
- Size and form factor
PACT XPP Technologies http//www.pactcorp.com/
9Physical Layer and Message Switching
10Messaging Hierarchy
Routing Layer
Where? Destination decisions, i.e., which output
port
Switching Layer
When? When is data forwarded
Physical Layer
How? synchronization of data transfer
- This organization is distinct from traditional
networking implementations - Emphasis is on low latency communication
- Only recently have standards been evolving
- Infiniband http//www.infinibandta.org/home
11The Physical Layer
Data
Packets
checksum
header
Flit flow control digit
Phit physical flow control digit
- Data is transmitted based on a hierarchical data
structuring mechanism - Messages ? packets ? flits ? phits
- While flits and phits are fixed size, packets and
data may be variable sized
12Flow Control
- Flow control digit synchronized transfer of a
unit of information - Based on buffer management
- Asynchronous vs. synchronous flow control
- Flow control occurs at multiple levels
- message flow control
- physical flow control
- Mechanisms
- Credit based flow control
13Switching Layer
- Comprised of three sets of techniques
- switching techniques
- flow control
- buffer management
- Organization and operation of routers are largely
determined by the switching layer - Connection Oriented vs. Connectionless
communication
14Generic Router Architecture
Wire delay
Switching delay
Routing delay
15Virtual Channels
- Each virtual channel is a pair of unidirectional
channels - Independently managed buffers multiplexed over
the physical channel - De-couples buffers from physical channels
- Originally introduced to break cyclic
dependencies - Improves performance through reduction of
blocking delay - Virtual lanes vs. virtual channels
- As the number of virtual channels increase, the
increased channel multiplexing has two effects - decrease in header delay
- increase in average data flit delay
- Impact on router performance
- switch complexity
16Circuit Switching
- Hardware path setup by a routing header or probe
- End-to-end acknowledgment initiates transfer at
full hardware bandwidth - Source routing vs. distributed routing
- System is limited by signaling rate along the
circuits --gt wave pipelining
17Packet Switching
- Blocking delays in circuit switching avoided in
packet switched networks --gt full link
utilization in the presence of data - Increased storage requirements at the nodes
- Packetization and in-order delivery requirements
- Buffering
- use of local processor memory
- central queues
18Virtual Cut-Through
- Messages cut-through to the next router when
feasible - In the absence of blocking, messages are
pipelined - pipeline cycle time is the larger of intra-router
and inter-router flow control delays - When the header is blocked, the complete message
is buffered - High load behavior approaches that of packet
switching
19Wormhole Switching
- Messages are pipelined, but buffer space is on
the order of a few flits - Small buffers message pipelining --gt small
compact buffers - Supports variable sized messages
- Messages cannot be interleaved over a channel
routing information is only associated with the
header - Base Latency is equivalent to that of virtual
cut-through
20Comparison of Switching Techniques
- Packet switching and virtual cut-through
- consume network bandwidth proportional to network
load - predictable demands
- VCT behaves like wormhole at low loads and like
packet switching at high loads - link level error control for packet switching
- Wormhole switching
- provides low latency
- lower saturation point
- higher variance of message latency than packet or
VCT switching - Virtual channels
- blocking delay vs. data delay
- router flow control latency
- Optimistic vs. conservative flow control
21Saturation
22Network Topologies
23Direct Networks
- Generally fixed degree
- Modular
- Topologies
- Meshes
- Multidimensional tori
- Special case of tori the binary hypercube
24Indirect Networks
- indirect networks
- uniform base latency
- centralized or distributed control
- Engineering approximations to direct networks
Multistage Network
Fat Tree Network
Bandwidth increases as you go up the tree
25Generalized MINs
- Columns of k x k switches and connections between
switches - All switches are identical
- Directionality and control
- May concentrate or expand or just connect
26Specific MINs
- Switch sizes and interstage interconnect
establish distinct MINS - Majority of interesting MINs have been shown to
be topologically equivalent
27Metrics
28Evaluation Metrics
bisection
- Bisection bandwidth
- This is minimum bandwidth across any bisection of
the network - Bisection bandwidth is a limiting attribute of
performance - Latency
- Message transit time
- Node degree
- These are related to pin/wiring constraints
29Constant Resource Analysis Bisection Width
30Constant Resource Analysis Pin out
31Latency Under Contention
32-ary 2-cube vs. 10-ary 3 cube
32Deadlock and Livelock
33Deadlock and Live Lock
- Deadlock freedom can be ensured by enforcing
constraints - For example, following dimension order routing in
2D meshes - Similar
34Occurrence of Deadlock
- Deadlock is caused by dependencies between buffers
35Deadlock in a Ring Network
36Deadlock Avoidance Principle
- Deadlock is caused by dependencies between buffers
37Routing Constraints on Virtual Channels
- Add multiple virtual channels to each physical
channel - Place routing restrictions between virtual
channels
38Break Cycles
39Channel Dependence Graph
40Routing Layer
41Routing Protocols
42Key Routing Categories
- Deterministic
- The path is fixed by the source destination pair
- Source Routing
- Path is looked up prior to message injection
- May differ each time the network and NIs are
initialized - Adaptive routing
- Path is determined by run-time network conditions
- Unicast
- Single source to single destination
- Multicast
- Single source to multiple destinations
43Software Layer
44The Message Layer
- Message layer background
- Cluster computers
- Myrinet SAN
- Design properties
- End-to-End communication path
- Injection
- Network transmission
- Ejection
- Overall performance
45Cluster Computers
- Cost-effective alternative to supercomputers
- Number of commodity workstations
- Specialized network hardware and software
- Result Large pool of host processors
Courtesy of C. Ulmer
46For Example..
Courtesy of C. Ulmer
47For Example..
Courtesy of C. Ulmer
48Clusters Networks
- Beowulf clusters
- Use Ethernet TCP/IP
- Cheap, but poor Host-to-Host performance
- Latencies 70-100 µs
- Bandwidths 80-800 Mbps
- System Area Network (SAN) clusters
- Custom hardware/software
- Examples Myrinet, SCI, InfiniBand, QsNet
- Expensive, but good Host-to-Host performance
- Latencies as low as 3 µs
- Bandwidths up to 3 Gbps
Courtesy of C. Ulmer
49Myrinet
- Descendant of Caltech Mosaic project
- Wormhole network
- Source routing
- High-speed, Ultra-reliable network
- Configurable topology Switches, NICs, and cables
Courtesy of C. Ulmer
50Myrinet Switches Links
- 16 Port crossbar chip
- 2.02.0 Gbps per port
- 300 ns Latency
- Line card
- 8 Network ports
- 8 Backplane ports
- Backplane cabinet
- 17 line card slots
- 128 Hosts
Courtesy of C. Ulmer
51Myrinet NI Architecture
- Custom RISC CPU
- 33-200MHz
- Big endian
- gcc is available
- SRAM
- 1-9MB
- No CPU cache
- DMA Engines
- PCI / SRAM
- SRAM / Tx
- Rx / SRAM
SRAM
RISC CPU
PCI
Tx Rx
Host DMA
SAN DMA
LANai Processor
Network Interface Card
Courtesy of C. Ulmer
52Message Layers
Courtesy of C. Ulmer
53Message Layer Communication Software
- Message layers are enabling technology for
clusters - Enable cluster to function as single image
multiprocessor system - Responsible for transferring messages between
resources - Hide hardware details from end users
Courtesy of C. Ulmer
54Message Layer Design Issues
- Performance is critical
- Competing with SMPs, where overhead is lt1us
- Use every trick to get performance
- Single cluster user -- remove device sharing
overhead - Little protection -- co-operative environment
- Reliable hardware -- optimize for common case
of few errors - Smart hardware -- offload host communication
- Arch hacks -- x86 is a turkey, use MMX, SSE, WC..
Courtesy of C. Ulmer
55Message Layer Organization
User-space Application
Kernel NI Device Driver
User-space Message Layer Library
NI Firmware
Courtesy of C. Ulmer
56End Users Perspective
Processor A
Processor B
Msg
Courtesy of C. Ulmer
57End-to-End Communication Path
- Three phases of data transfer
- Injection
- Network
- Ejection
CPU
CPU
Memory
Memory
2
1
3
NI
SAN
NI
Source
Destination
Courtesy of C. Ulmer
58Injecting Data
Courtesy of C. Ulmer
59Injecting Data into the NI
send( dest, data, size )
msg0 header
data
B,F
F
data
msg1 header
Tx
data
PCI
Outgoing Message Queue
msgn-1 header
data
Network Interface Card
Fragmentation
Courtesy of C. Ulmer
60Host-NI Data Injections
- Host-NI transfers challenging
- Host lacks DMA engine
- Multiple transfer methods
- Programmed I/O
- DMA
- What about virtual/physical addresses?
CPU
Main Memory
Cache
Memory Controller
PCI Bus
Network Interface
PCI DMA
Memory
Courtesy of C. Ulmer
61Virtual and Physical Addresses
Physical Address
- Virtual address space
- Applications view
- Contiguous
- Physical address space
- Manage physical memory
- Paged, non-contiguous
- PCI devices part of PA
- PCI devices only use PAs
- Viewing PCI device memory
- Memory map
Host Memory
Virtual Address
User space application
Courtesy of C. Ulmer
62Addresses and Injections
- Programmed I/O (user-space)
- Translation automatic by host CPU
- Example memcpy( ni_mem, source, size )
- Can be enhanced by use of MMX, SSE registers
- DMA (kernel space)
- One-copy
- Copy data into pinned, contiguous block
- DMA out of block
- Zero-copy
- Transfer data right out of VA pages
- Translate address and pin each page
Courtesy of C. Ulmer
63TPIL Performance LANai 9 NI with Pentium
III-550 MHz Host
Bandwidth (MBytes/s)
Injection Size (Bytes)
Courtesy of C. Ulmer
64Network Delivery (NI-NI)
Courtesy of C. Ulmer
65Network Delivery (NI-NI)
- Reliably transfer message between pairs of NIs
- Each NI basically has two threads Send and
Receive - Reliability
- SANs are usually error free
- Worried about buffer overflows in NI cards
- Two approaches to flow control host-level,
NI-level
network
SAN
Network Interface
Network Interface
Sending
Receiving
Courtesy of C. Ulmer
66Host-managed Flow Control
- Reliability managed by the host
- Host-level credit system
- NI just transfers messages between host and wire
- Good points
- Easier to implement
- Host CPU faster than NI
- Bad points
- Poor NI buffer utilization
- Retransmission overhead
Courtesy of C. Ulmer
67NI-Managed Flow Control
- NI manages reliable transmission of message
- NIs use control messages (ACK/NACK)
- Good points
- Better dynamic buffer use
- Offloads host CPU
- Bad points
- Harder to implement
- Added overhead for NI
DATA
DATA
DATA
Receiving Endpoint
Sending Endpoint
SAN
PCI
PCI
ACK
Network Interface
Network Interface
Courtesy of C. Ulmer
68Ejection (NI-Host)
Courtesy of C. Ulmer
69Message Ejection (NI-Host)
- Move message to host
- Store close to host CPU
- Incoming message queue
- Pinned, contiguous memory
- NI can write directly
- Host extracts messages
- Reassemble fragments
-
- How does host see new messages?
CPU
Memory
Network Interface
Courtesy of C. Ulmer
70Notification Polling
- Applications explicitly call extract()
- Call examines queue front back pointers
- Processes message if available
- Good points
- Good performance
- Can tuck away in a thread
- User has more control
- Bad points
- Waste time if no messages
- Queue can backup
- Code can be messy
Courtesy of C. Ulmer
71Notification Interrupts
- NI invokes interrupt after putting message in
queue - Host stops whatever it was doing
- Device drivers Interrupt service routine (ISR)
catches - ISR uses UNIX signal infrastructure to pass to
application - Application catches signal , executes extract()
- Good points
- No wasted polling time
- Bad points
- High overhead
- Interrupts 10 us
- Constantly.. interrupted
Courtesy of C. Ulmer
72Other APIs Remote Memory Ops
- Often just passing data
- Dont disturb receiving application
- Remote memory operations
- Fetch, store remote memory
- NI executes transfer directly (no need for
notification) - Virtual addresses translated by the NI (and
cached)
Courtesy of C. Ulmer
73The Message Path
M
M
CPU
CPU
PCI
PCI
OS
OS
PCI
PCI
Memory
Memory
NI
NI
Network
- Wire bandwidth is not the bottleneck!
- Operating system and/or user level software
limits performance
74Universal Performance Metrics
Sender
(processor busy)
Transmission time (size bandwidth)
Time of Flight
Receiver Overhead
Receiver
(processor busy)
Transport Latency
Total Latency
Total Latency Sender Overhead Time of Flight
Message Size BW
Receiver Overhead
Includes header/trailer in BW calculation?
75Simplified Latency Model
- Total Latency Overhead Message Size / BW
- Overhead Sender Overhead Time of Flight
- Receiver Overhead
- Can relate overhead to network bandwidth
utilization
76Commercial Example
77Scalable Switching Fabrics for Internet Routers
Router
- Internet bandwidth growth ? routers with
- large numbers of ports
- high bisection bandwidth
- Historically these solutions have used
- Backplanes
- Crossbar switches
- White paper Scalable Switching Fabrics for
Internet Routers, by W. J. Dally, http
//www.avici.com/technology/whitepapers/
78Requirements
- Scalable
- Incremental
- Economical ? cost linear in the number of nodes
- Robust
- Fault tolerant ? path diversity reconfiguration
- Non-blocking features
- Performance
- High bisection bandwidth
- Quality of Service (QoS)
- Bounded delay
79Switching Fabric
- Three components
- Topology ? 3D torus
- Routing ? source routing with randomization
- Flow control ? virtual channels and virtual
networks - Maximum configuration 14 x 8 x 5 560
- Channel speed is 10 Gbps
80Packaging
- Uniformly short wires between adjacent nodes
- Can be built in passive backplanes
- Run at high speed
- Bandwidth inversely proportional to square of
wire length - Cabling costs
- Power costs
Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
81Available Bandwidth
- Distinguish between capacity and I/O bandwidth
- Capacity Traffic that will load a link to 100
- I/O bandwidth bit rate in or out
- Discontinuuities
Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
82Properties
- Path diversity
- Avoids tree saturation
- Edge disjoint paths for fault tolerance
- Heart beat checks (100 microsecs) deflecting
while tables are updated
Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
83Properties
Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
84Use of Virtual Channels
- Virtual channels aggregated into virtual networks
- Two networks for each output port
- Distinct networks prevent undesirable coupling
- Only bandwidth on a link is shared
- Fair arbitration mechanisms
- Distinct networks enable QoS constraints to be
met - Separate best effort and constant bit rate traffic
85Summary
- Distinguish between traditional networking and
high performance multiprocessor communication - Hierarchy of implementations
- Physical, switching and routing
- Protocol families and protocol layers (the
protocol stack) - Datapath and architecture of the switches
- Metrics
- Bisection bandwidth
- Reliability
- Traditional latency and bandwidth
86Study Guide
- Given a topology and relevant characteristics
such as channel widths and link bandwidths,
compute the bisection bandwidth - Distinguish between switching mechanisms based on
how channel buffers are reserved/used during
message transmission - Latency expressions for different switching
mechanisms - Compute the network bisection bandwidth when the
software overheads of message transmission are
included - Identify the major delay elements in the message
transmission path starting at the send() call and
ended with the receive() call - How do costs scale in different topologies
- Latency scaling
- Unit of upgrade ? cost of upgrade