Interconnection%20Networks

About This Presentation

Title:

Interconnection%20Networks

Description:

Interconnection Networks – PowerPoint PPT presentation

Number of Views:327

Avg rating:3.0/5.0

Slides: 87

Provided by: CRES81

Learn more at: https://blough.ece.gatech.edu

Category:

more less

Transcript and Presenter's Notes

Title: Interconnection%20Networks

1
Interconnection Networks
2
Overview

Physical Layer and Message Switching
Network Topologies
Metrics
Deadlock Livelock
Routing Layer
The Messaging Layer

3
Interconnection Networks

Fabric for scalable, multiprocessor architectures
Distinct from traditional networking
architectures such as Internet Protocol (IP)
based systems

4
Resource View of Parallel Architectures
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
M
M
M
M
M
M
M
M
M
M
M

How do we present these resources?
What are the costs of different interconnection
networks
What are the design considerations?
What are the applications?

5
Example Clusters Google Hardware Infrastructure

VME rack 19 in. wide, 6 feet tall, 30 inches
deep
Per side 40 1 Rack Unit (RU) PCs 1 HP Ethernet
switch (4 RU) Each blade can contain 8
100-Mbit/s EN or a single 1-Gbit Ethernet
interface
Frontback gt 80 PCs 2 EN switches/rack
Each rack connects to 2 128 1-Gbit/s EN switches
Dec 2000 40 racks at most recent site
6000 PCs, 12000 disks almost 1 petabyte!
PC operates at about 55 Watts
Rack gt 4500 Watts , 60 amps

6
Reliability

For 6000 PCs, 12000s, 200 EN switches
20 PCs will need to be rebooted/day
2 PCs/day hardware failure, or 2-3 / year
5 due to problems with motherboard, power
supply, and connectors
30 DRAM bits change errors in transmission
(100 MHz)
30 Disks fail
30 Disks go very slow (10-3 expected BW)
200 EN switches, 2-3 fail in 2 years
6 Foundry switches none failed, but 2-3 of 96
blades of switches have failed (16 blades/switch)
Collocation site reliability
1 power failure,1 network outage per year per site

7
The Practical Problem
From Ambuj Goyal, Computer Science Grand
Challenge Simplicity of Design, Computing
Research Association Conference on "Grand
Research Challenges" in Computer Science and
Engineering, June 2002
8
Example Embedded Devices
picoChip http//www.picochip.com/

Issues
Execution performance
Power dissipation
Number of chip types
Size and form factor

PACT XPP Technologies http//www.pactcorp.com/
9
Physical Layer and Message Switching
10
Messaging Hierarchy
Routing Layer
Where? Destination decisions, i.e., which output
port
Switching Layer
When? When is data forwarded
Physical Layer
How? synchronization of data transfer

This organization is distinct from traditional
networking implementations
Emphasis is on low latency communication
Only recently have standards been evolving
Infiniband http//www.infinibandta.org/home

11
The Physical Layer
Data
Packets
checksum
header
Flit flow control digit
Phit physical flow control digit

Data is transmitted based on a hierarchical data
structuring mechanism
Messages ? packets ? flits ? phits
While flits and phits are fixed size, packets and
data may be variable sized

12
Flow Control

Flow control digit synchronized transfer of a
unit of information
Based on buffer management
Asynchronous vs. synchronous flow control
Flow control occurs at multiple levels
message flow control
physical flow control
Mechanisms
Credit based flow control

13
Switching Layer

Comprised of three sets of techniques
switching techniques
flow control
buffer management
Organization and operation of routers are largely
determined by the switching layer
Connection Oriented vs. Connectionless
communication

14
Generic Router Architecture
Wire delay
Switching delay
Routing delay
15
Virtual Channels

Each virtual channel is a pair of unidirectional
channels
Independently managed buffers multiplexed over
the physical channel
De-couples buffers from physical channels
Originally introduced to break cyclic
dependencies
Improves performance through reduction of
blocking delay
Virtual lanes vs. virtual channels
As the number of virtual channels increase, the
increased channel multiplexing has two effects
decrease in header delay
increase in average data flit delay
Impact on router performance
switch complexity

16
Circuit Switching

Hardware path setup by a routing header or probe
End-to-end acknowledgment initiates transfer at
full hardware bandwidth
Source routing vs. distributed routing
System is limited by signaling rate along the
circuits --gt wave pipelining

17
Packet Switching

Blocking delays in circuit switching avoided in
packet switched networks --gt full link
utilization in the presence of data
Increased storage requirements at the nodes
Packetization and in-order delivery requirements
Buffering
use of local processor memory
central queues

18
Virtual Cut-Through

Messages cut-through to the next router when
feasible
In the absence of blocking, messages are
pipelined
pipeline cycle time is the larger of intra-router
and inter-router flow control delays
When the header is blocked, the complete message
is buffered
High load behavior approaches that of packet
switching

19
Wormhole Switching

Messages are pipelined, but buffer space is on
the order of a few flits
Small buffers message pipelining --gt small
compact buffers
Supports variable sized messages
Messages cannot be interleaved over a channel
routing information is only associated with the
header
Base Latency is equivalent to that of virtual
cut-through

20
Comparison of Switching Techniques

Packet switching and virtual cut-through
consume network bandwidth proportional to network
load
predictable demands
VCT behaves like wormhole at low loads and like
packet switching at high loads
link level error control for packet switching
Wormhole switching
provides low latency
lower saturation point
higher variance of message latency than packet or
VCT switching
Virtual channels
blocking delay vs. data delay
router flow control latency
Optimistic vs. conservative flow control

21
Saturation
22
Network Topologies
23
Direct Networks

Generally fixed degree
Modular
Topologies
Meshes
Multidimensional tori
Special case of tori the binary hypercube

24
Indirect Networks

indirect networks
uniform base latency
centralized or distributed control
Engineering approximations to direct networks

Multistage Network
Fat Tree Network
Bandwidth increases as you go up the tree
25
Generalized MINs

Columns of k x k switches and connections between
switches
All switches are identical
Directionality and control
May concentrate or expand or just connect

26
Specific MINs

Switch sizes and interstage interconnect
establish distinct MINS
Majority of interesting MINs have been shown to
be topologically equivalent

27
Metrics
28
Evaluation Metrics
bisection

Bisection bandwidth
This is minimum bandwidth across any bisection of
the network
Bisection bandwidth is a limiting attribute of
performance
Latency
Message transit time
Node degree
These are related to pin/wiring constraints

29
Constant Resource Analysis Bisection Width
30
Constant Resource Analysis Pin out
31
Latency Under Contention
32-ary 2-cube vs. 10-ary 3 cube
32
Deadlock and Livelock
33
Deadlock and Live Lock

Deadlock freedom can be ensured by enforcing
constraints
For example, following dimension order routing in
2D meshes
Similar

34
Occurrence of Deadlock

Deadlock is caused by dependencies between buffers

35
Deadlock in a Ring Network
36
Deadlock Avoidance Principle

Deadlock is caused by dependencies between buffers

37
Routing Constraints on Virtual Channels

Add multiple virtual channels to each physical
channel
Place routing restrictions between virtual
channels

38
Break Cycles
39
Channel Dependence Graph
40
Routing Layer
41
Routing Protocols
42
Key Routing Categories

Deterministic
The path is fixed by the source destination pair
Source Routing
Path is looked up prior to message injection
May differ each time the network and NIs are
initialized
Adaptive routing
Path is determined by run-time network conditions
Unicast
Single source to single destination
Multicast
Single source to multiple destinations

43
Software Layer
44
The Message Layer

Message layer background
Cluster computers
Myrinet SAN
Design properties
End-to-End communication path
Injection
Network transmission
Ejection
Overall performance

45
Cluster Computers

Cost-effective alternative to supercomputers
Number of commodity workstations
Specialized network hardware and software
Result Large pool of host processors

Courtesy of C. Ulmer
46
For Example..
Courtesy of C. Ulmer
47
For Example..
Courtesy of C. Ulmer
48
Clusters Networks

Beowulf clusters
Use Ethernet TCP/IP
Cheap, but poor Host-to-Host performance
Latencies 70-100 µs
Bandwidths 80-800 Mbps
System Area Network (SAN) clusters
Custom hardware/software
Examples Myrinet, SCI, InfiniBand, QsNet
Expensive, but good Host-to-Host performance
Latencies as low as 3 µs
Bandwidths up to 3 Gbps

Courtesy of C. Ulmer
49
Myrinet

Descendant of Caltech Mosaic project
Wormhole network
Source routing
High-speed, Ultra-reliable network
Configurable topology Switches, NICs, and cables

Courtesy of C. Ulmer
50
Myrinet Switches Links

16 Port crossbar chip
2.02.0 Gbps per port
300 ns Latency
Line card
8 Network ports
8 Backplane ports
Backplane cabinet
17 line card slots
128 Hosts

Courtesy of C. Ulmer
51
Myrinet NI Architecture

Custom RISC CPU
33-200MHz
Big endian
gcc is available
SRAM
1-9MB
No CPU cache
DMA Engines
PCI / SRAM
SRAM / Tx
Rx / SRAM

SRAM
RISC CPU
PCI
Tx Rx
Host DMA
SAN DMA
LANai Processor
Network Interface Card
Courtesy of C. Ulmer
52
Message Layers
Courtesy of C. Ulmer
53
Message Layer Communication Software

Message layers are enabling technology for
clusters
Enable cluster to function as single image
multiprocessor system
Responsible for transferring messages between
resources
Hide hardware details from end users

Courtesy of C. Ulmer
54
Message Layer Design Issues

Performance is critical
Competing with SMPs, where overhead is lt1us
Use every trick to get performance
Single cluster user -- remove device sharing
overhead
Little protection -- co-operative environment
Reliable hardware -- optimize for common case
of few errors
Smart hardware -- offload host communication
Arch hacks -- x86 is a turkey, use MMX, SSE, WC..

Courtesy of C. Ulmer
55
Message Layer Organization
User-space Application
Kernel NI Device Driver
User-space Message Layer Library
NI Firmware
Courtesy of C. Ulmer
56
End Users Perspective
Processor A
Processor B
Msg
Courtesy of C. Ulmer
57
End-to-End Communication Path

Three phases of data transfer
Injection
Network
Ejection

CPU
CPU
Memory
Memory
2
1
3
NI
SAN
NI
Source
Destination
Courtesy of C. Ulmer
58
Injecting Data
Courtesy of C. Ulmer
59
Injecting Data into the NI
send( dest, data, size )
msg0 header
data
B,F
F
data
msg1 header
Tx
data
PCI
Outgoing Message Queue
msgn-1 header
data
Network Interface Card
Fragmentation
Courtesy of C. Ulmer
60
Host-NI Data Injections

Host-NI transfers challenging
Host lacks DMA engine
Multiple transfer methods
Programmed I/O
DMA
What about virtual/physical addresses?

CPU
Main Memory
Cache
Memory Controller
PCI Bus
Network Interface
PCI DMA
Memory
Courtesy of C. Ulmer
61
Virtual and Physical Addresses
Physical Address

Virtual address space
Applications view
Contiguous
Physical address space
Manage physical memory
Paged, non-contiguous
PCI devices part of PA
PCI devices only use PAs
Viewing PCI device memory
Memory map

Host Memory
Virtual Address
User space application
Courtesy of C. Ulmer
62
Addresses and Injections

Programmed I/O (user-space)
Translation automatic by host CPU
Example memcpy( ni_mem, source, size )
Can be enhanced by use of MMX, SSE registers
DMA (kernel space)
One-copy
Copy data into pinned, contiguous block
DMA out of block
Zero-copy
Transfer data right out of VA pages
Translate address and pin each page

Courtesy of C. Ulmer
63
TPIL Performance LANai 9 NI with Pentium
III-550 MHz Host
Bandwidth (MBytes/s)
Injection Size (Bytes)
Courtesy of C. Ulmer
64
Network Delivery (NI-NI)
Courtesy of C. Ulmer
65
Network Delivery (NI-NI)

Reliably transfer message between pairs of NIs
Each NI basically has two threads Send and
Receive
Reliability
SANs are usually error free
Worried about buffer overflows in NI cards
Two approaches to flow control host-level,
NI-level

network
SAN
Network Interface
Network Interface
Sending
Receiving
Courtesy of C. Ulmer
66
Host-managed Flow Control

Reliability managed by the host
Host-level credit system
NI just transfers messages between host and wire

Good points
Easier to implement
Host CPU faster than NI

Bad points
Poor NI buffer utilization
Retransmission overhead

Courtesy of C. Ulmer
67
NI-Managed Flow Control

NI manages reliable transmission of message
NIs use control messages (ACK/NACK)

Good points
Better dynamic buffer use
Offloads host CPU

Bad points
Harder to implement
Added overhead for NI

DATA
DATA
DATA
Receiving Endpoint
Sending Endpoint
SAN
PCI
PCI
ACK
Network Interface
Network Interface
Courtesy of C. Ulmer
68
Ejection (NI-Host)
Courtesy of C. Ulmer
69
Message Ejection (NI-Host)

Move message to host
Store close to host CPU
Incoming message queue
Pinned, contiguous memory
NI can write directly
Host extracts messages
Reassemble fragments
How does host see new messages?

CPU
Memory
Network Interface
Courtesy of C. Ulmer
70
Notification Polling

Applications explicitly call extract()
Call examines queue front back pointers
Processes message if available

Good points
Good performance
Can tuck away in a thread
User has more control

Bad points
Waste time if no messages
Queue can backup
Code can be messy

Courtesy of C. Ulmer
71
Notification Interrupts

NI invokes interrupt after putting message in
queue
Host stops whatever it was doing
Device drivers Interrupt service routine (ISR)
catches
ISR uses UNIX signal infrastructure to pass to
application
Application catches signal , executes extract()

Good points
No wasted polling time

Bad points
High overhead
Interrupts 10 us
Constantly.. interrupted

Courtesy of C. Ulmer
72
Other APIs Remote Memory Ops

Often just passing data
Dont disturb receiving application
Remote memory operations
Fetch, store remote memory
NI executes transfer directly (no need for
notification)
Virtual addresses translated by the NI (and
cached)

Courtesy of C. Ulmer
73
The Message Path
M
M
CPU
CPU
PCI
PCI
OS
OS
PCI
PCI
Memory
Memory
NI
NI
Network

Wire bandwidth is not the bottleneck!
Operating system and/or user level software
limits performance

74
Universal Performance Metrics
Sender
(processor busy)
Transmission time (size bandwidth)
Time of Flight
Receiver Overhead
Receiver
(processor busy)
Transport Latency
Total Latency
Total Latency Sender Overhead Time of Flight
Message Size BW
Receiver Overhead
Includes header/trailer in BW calculation?
75
Simplified Latency Model

Total Latency Overhead Message Size / BW
Overhead Sender Overhead Time of Flight
Receiver Overhead
Can relate overhead to network bandwidth
utilization

76
Commercial Example
77
Scalable Switching Fabrics for Internet Routers
Router

Internet bandwidth growth ? routers with
large numbers of ports
high bisection bandwidth
Historically these solutions have used
Backplanes
Crossbar switches
White paper Scalable Switching Fabrics for
Internet Routers, by W. J. Dally, http
//www.avici.com/technology/whitepapers/

78
Requirements

Scalable
Incremental
Economical ? cost linear in the number of nodes
Robust
Fault tolerant ? path diversity reconfiguration
Non-blocking features
Performance
High bisection bandwidth
Quality of Service (QoS)
Bounded delay

79
Switching Fabric

Three components
Topology ? 3D torus
Routing ? source routing with randomization
Flow control ? virtual channels and virtual
networks
Maximum configuration 14 x 8 x 5 560
Channel speed is 10 Gbps

80
Packaging

Uniformly short wires between adjacent nodes
Can be built in passive backplanes
Run at high speed
Bandwidth inversely proportional to square of
wire length
Cabling costs
Power costs

Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
81
Available Bandwidth

Distinguish between capacity and I/O bandwidth
Capacity Traffic that will load a link to 100
I/O bandwidth bit rate in or out
Discontinuuities

Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
82
Properties

Path diversity
Avoids tree saturation
Edge disjoint paths for fault tolerance
Heart beat checks (100 microsecs) deflecting
while tables are updated

Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
83
Properties
Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
84
Use of Virtual Channels

Virtual channels aggregated into virtual networks
Two networks for each output port
Distinct networks prevent undesirable coupling
Only bandwidth on a link is shared
Fair arbitration mechanisms
Distinct networks enable QoS constraints to be
met
Separate best effort and constant bit rate traffic

85
Summary

Distinguish between traditional networking and
high performance multiprocessor communication
Hierarchy of implementations
Physical, switching and routing
Protocol families and protocol layers (the
protocol stack)
Datapath and architecture of the switches
Metrics
Bisection bandwidth
Reliability
Traditional latency and bandwidth

86
Study Guide

Given a topology and relevant characteristics
such as channel widths and link bandwidths,
compute the bisection bandwidth
Distinguish between switching mechanisms based on
how channel buffers are reserved/used during
message transmission
Latency expressions for different switching
mechanisms
Compute the network bisection bandwidth when the
software overheads of message transmission are
included
Identify the major delay elements in the message
transmission path starting at the send() call and
ended with the receive() call
How do costs scale in different topologies
Latency scaling
Unit of upgrade ? cost of upgrade

Write a Comment

User Comments (0)

About PowerShow.com

Interconnection%20Networks - PowerPoint PPT Presentation

Interconnection%20Networks

Interconnection Networks – PowerPoint PPT presentation