Interconnection%20Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Interconnection%20Networks

Description:

Interconnection Networks – PowerPoint PPT presentation

Number of Views:323
Avg rating:3.0/5.0
Slides: 87
Provided by: CRES81
Category:

less

Transcript and Presenter's Notes

Title: Interconnection%20Networks


1
Interconnection Networks
2
Overview
  • Physical Layer and Message Switching
  • Network Topologies
  • Metrics
  • Deadlock Livelock
  • Routing Layer
  • The Messaging Layer

3
Interconnection Networks
  • Fabric for scalable, multiprocessor architectures
  • Distinct from traditional networking
    architectures such as Internet Protocol (IP)
    based systems

4
Resource View of Parallel Architectures
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
M
M
M
M
M
M
M
M
M
M
M
  • How do we present these resources?
  • What are the costs of different interconnection
    networks
  • What are the design considerations?
  • What are the applications?

5
Example Clusters Google Hardware Infrastructure
  • VME rack 19 in. wide, 6 feet tall, 30 inches
    deep
  • Per side 40 1 Rack Unit (RU) PCs 1 HP Ethernet
    switch (4 RU) Each blade can contain 8
    100-Mbit/s EN or a single 1-Gbit Ethernet
    interface
  • Frontback gt 80 PCs 2 EN switches/rack
  • Each rack connects to 2 128 1-Gbit/s EN switches
  • Dec 2000 40 racks at most recent site
  • 6000 PCs, 12000 disks almost 1 petabyte!
  • PC operates at about 55 Watts
  • Rack gt 4500 Watts , 60 amps

6
Reliability
  • For 6000 PCs, 12000s, 200 EN switches
  • 20 PCs will need to be rebooted/day
  • 2 PCs/day hardware failure, or 2-3 / year
  • 5 due to problems with motherboard, power
    supply, and connectors
  • 30 DRAM bits change errors in transmission
    (100 MHz)
  • 30 Disks fail
  • 30 Disks go very slow (10-3 expected BW)
  • 200 EN switches, 2-3 fail in 2 years
  • 6 Foundry switches none failed, but 2-3 of 96
    blades of switches have failed (16 blades/switch)
  • Collocation site reliability
  • 1 power failure,1 network outage per year per site

7
The Practical Problem
From Ambuj Goyal, Computer Science Grand
Challenge Simplicity of Design, Computing
Research Association Conference on "Grand
Research Challenges" in Computer Science and
Engineering, June 2002
8
Example Embedded Devices
picoChip http//www.picochip.com/
  • Issues
  • Execution performance
  • Power dissipation
  • Number of chip types
  • Size and form factor

PACT XPP Technologies http//www.pactcorp.com/
9
Physical Layer and Message Switching
10
Messaging Hierarchy
Routing Layer
Where? Destination decisions, i.e., which output
port
Switching Layer
When? When is data forwarded
Physical Layer
How? synchronization of data transfer
  • This organization is distinct from traditional
    networking implementations
  • Emphasis is on low latency communication
  • Only recently have standards been evolving
  • Infiniband http//www.infinibandta.org/home

11
The Physical Layer
Data
Packets
checksum
header
Flit flow control digit
Phit physical flow control digit
  • Data is transmitted based on a hierarchical data
    structuring mechanism
  • Messages ? packets ? flits ? phits
  • While flits and phits are fixed size, packets and
    data may be variable sized

12
Flow Control
  • Flow control digit synchronized transfer of a
    unit of information
  • Based on buffer management
  • Asynchronous vs. synchronous flow control
  • Flow control occurs at multiple levels
  • message flow control
  • physical flow control
  • Mechanisms
  • Credit based flow control

13
Switching Layer
  • Comprised of three sets of techniques
  • switching techniques
  • flow control
  • buffer management
  • Organization and operation of routers are largely
    determined by the switching layer
  • Connection Oriented vs. Connectionless
    communication

14
Generic Router Architecture
Wire delay
Switching delay
Routing delay
15
Virtual Channels
  • Each virtual channel is a pair of unidirectional
    channels
  • Independently managed buffers multiplexed over
    the physical channel
  • De-couples buffers from physical channels
  • Originally introduced to break cyclic
    dependencies
  • Improves performance through reduction of
    blocking delay
  • Virtual lanes vs. virtual channels
  • As the number of virtual channels increase, the
    increased channel multiplexing has two effects
  • decrease in header delay
  • increase in average data flit delay
  • Impact on router performance
  • switch complexity

16
Circuit Switching
  • Hardware path setup by a routing header or probe
  • End-to-end acknowledgment initiates transfer at
    full hardware bandwidth
  • Source routing vs. distributed routing
  • System is limited by signaling rate along the
    circuits --gt wave pipelining

17
Packet Switching
  • Blocking delays in circuit switching avoided in
    packet switched networks --gt full link
    utilization in the presence of data
  • Increased storage requirements at the nodes
  • Packetization and in-order delivery requirements
  • Buffering
  • use of local processor memory
  • central queues

18
Virtual Cut-Through
  • Messages cut-through to the next router when
    feasible
  • In the absence of blocking, messages are
    pipelined
  • pipeline cycle time is the larger of intra-router
    and inter-router flow control delays
  • When the header is blocked, the complete message
    is buffered
  • High load behavior approaches that of packet
    switching

19
Wormhole Switching
  • Messages are pipelined, but buffer space is on
    the order of a few flits
  • Small buffers message pipelining --gt small
    compact buffers
  • Supports variable sized messages
  • Messages cannot be interleaved over a channel
    routing information is only associated with the
    header
  • Base Latency is equivalent to that of virtual
    cut-through

20
Comparison of Switching Techniques
  • Packet switching and virtual cut-through
  • consume network bandwidth proportional to network
    load
  • predictable demands
  • VCT behaves like wormhole at low loads and like
    packet switching at high loads
  • link level error control for packet switching
  • Wormhole switching
  • provides low latency
  • lower saturation point
  • higher variance of message latency than packet or
    VCT switching
  • Virtual channels
  • blocking delay vs. data delay
  • router flow control latency
  • Optimistic vs. conservative flow control

21
Saturation
22
Network Topologies
23
Direct Networks
  • Generally fixed degree
  • Modular
  • Topologies
  • Meshes
  • Multidimensional tori
  • Special case of tori the binary hypercube

24
Indirect Networks
  • indirect networks
  • uniform base latency
  • centralized or distributed control
  • Engineering approximations to direct networks

Multistage Network
Fat Tree Network
Bandwidth increases as you go up the tree
25
Generalized MINs
  • Columns of k x k switches and connections between
    switches
  • All switches are identical
  • Directionality and control
  • May concentrate or expand or just connect

26
Specific MINs
  • Switch sizes and interstage interconnect
    establish distinct MINS
  • Majority of interesting MINs have been shown to
    be topologically equivalent

27
Metrics
28
Evaluation Metrics
bisection
  • Bisection bandwidth
  • This is minimum bandwidth across any bisection of
    the network
  • Bisection bandwidth is a limiting attribute of
    performance
  • Latency
  • Message transit time
  • Node degree
  • These are related to pin/wiring constraints

29
Constant Resource Analysis Bisection Width
30
Constant Resource Analysis Pin out
31
Latency Under Contention
32-ary 2-cube vs. 10-ary 3 cube
32
Deadlock and Livelock
33
Deadlock and Live Lock
  • Deadlock freedom can be ensured by enforcing
    constraints
  • For example, following dimension order routing in
    2D meshes
  • Similar

34
Occurrence of Deadlock
  • Deadlock is caused by dependencies between buffers

35
Deadlock in a Ring Network
36
Deadlock Avoidance Principle
  • Deadlock is caused by dependencies between buffers

37
Routing Constraints on Virtual Channels
  • Add multiple virtual channels to each physical
    channel
  • Place routing restrictions between virtual
    channels

38
Break Cycles
39
Channel Dependence Graph
40
Routing Layer
41
Routing Protocols
42
Key Routing Categories
  • Deterministic
  • The path is fixed by the source destination pair
  • Source Routing
  • Path is looked up prior to message injection
  • May differ each time the network and NIs are
    initialized
  • Adaptive routing
  • Path is determined by run-time network conditions
  • Unicast
  • Single source to single destination
  • Multicast
  • Single source to multiple destinations

43
Software Layer
44
The Message Layer
  • Message layer background
  • Cluster computers
  • Myrinet SAN
  • Design properties
  • End-to-End communication path
  • Injection
  • Network transmission
  • Ejection
  • Overall performance

45
Cluster Computers
  • Cost-effective alternative to supercomputers
  • Number of commodity workstations
  • Specialized network hardware and software
  • Result Large pool of host processors

Courtesy of C. Ulmer
46
For Example..
Courtesy of C. Ulmer
47
For Example..
Courtesy of C. Ulmer
48
Clusters Networks
  • Beowulf clusters
  • Use Ethernet TCP/IP
  • Cheap, but poor Host-to-Host performance
  • Latencies 70-100 µs
  • Bandwidths 80-800 Mbps
  • System Area Network (SAN) clusters
  • Custom hardware/software
  • Examples Myrinet, SCI, InfiniBand, QsNet
  • Expensive, but good Host-to-Host performance
  • Latencies as low as 3 µs
  • Bandwidths up to 3 Gbps

Courtesy of C. Ulmer
49
Myrinet
  • Descendant of Caltech Mosaic project
  • Wormhole network
  • Source routing
  • High-speed, Ultra-reliable network
  • Configurable topology Switches, NICs, and cables

Courtesy of C. Ulmer
50
Myrinet Switches Links
  • 16 Port crossbar chip
  • 2.02.0 Gbps per port
  • 300 ns Latency
  • Line card
  • 8 Network ports
  • 8 Backplane ports
  • Backplane cabinet
  • 17 line card slots
  • 128 Hosts

Courtesy of C. Ulmer
51
Myrinet NI Architecture
  • Custom RISC CPU
  • 33-200MHz
  • Big endian
  • gcc is available
  • SRAM
  • 1-9MB
  • No CPU cache
  • DMA Engines
  • PCI / SRAM
  • SRAM / Tx
  • Rx / SRAM

SRAM
RISC CPU
PCI
Tx Rx
Host DMA
SAN DMA
LANai Processor
Network Interface Card
Courtesy of C. Ulmer
52
Message Layers
Courtesy of C. Ulmer
53
Message Layer Communication Software
  • Message layers are enabling technology for
    clusters
  • Enable cluster to function as single image
    multiprocessor system
  • Responsible for transferring messages between
    resources
  • Hide hardware details from end users

Courtesy of C. Ulmer
54
Message Layer Design Issues
  • Performance is critical
  • Competing with SMPs, where overhead is lt1us
  • Use every trick to get performance
  • Single cluster user -- remove device sharing
    overhead
  • Little protection -- co-operative environment
  • Reliable hardware -- optimize for common case
    of few errors
  • Smart hardware -- offload host communication
  • Arch hacks -- x86 is a turkey, use MMX, SSE, WC..

Courtesy of C. Ulmer
55
Message Layer Organization
User-space Application
Kernel NI Device Driver
User-space Message Layer Library
NI Firmware
Courtesy of C. Ulmer
56
End Users Perspective
Processor A
Processor B
Msg
Courtesy of C. Ulmer
57
End-to-End Communication Path
  • Three phases of data transfer
  • Injection
  • Network
  • Ejection

CPU
CPU
Memory
Memory
2
1
3
NI
SAN
NI
Source
Destination
Courtesy of C. Ulmer
58
Injecting Data
Courtesy of C. Ulmer
59
Injecting Data into the NI
send( dest, data, size )
msg0 header
data
B,F
F
data
msg1 header
Tx
data
PCI
Outgoing Message Queue
msgn-1 header
data
Network Interface Card
Fragmentation
Courtesy of C. Ulmer
60
Host-NI Data Injections
  • Host-NI transfers challenging
  • Host lacks DMA engine
  • Multiple transfer methods
  • Programmed I/O
  • DMA
  • What about virtual/physical addresses?

CPU
Main Memory
Cache
Memory Controller
PCI Bus
Network Interface
PCI DMA
Memory
Courtesy of C. Ulmer
61
Virtual and Physical Addresses
Physical Address
  • Virtual address space
  • Applications view
  • Contiguous
  • Physical address space
  • Manage physical memory
  • Paged, non-contiguous
  • PCI devices part of PA
  • PCI devices only use PAs
  • Viewing PCI device memory
  • Memory map

Host Memory
Virtual Address
User space application
Courtesy of C. Ulmer
62
Addresses and Injections
  • Programmed I/O (user-space)
  • Translation automatic by host CPU
  • Example memcpy( ni_mem, source, size )
  • Can be enhanced by use of MMX, SSE registers
  • DMA (kernel space)
  • One-copy
  • Copy data into pinned, contiguous block
  • DMA out of block
  • Zero-copy
  • Transfer data right out of VA pages
  • Translate address and pin each page

Courtesy of C. Ulmer
63
TPIL Performance LANai 9 NI with Pentium
III-550 MHz Host
Bandwidth (MBytes/s)
Injection Size (Bytes)
Courtesy of C. Ulmer
64
Network Delivery (NI-NI)
Courtesy of C. Ulmer
65
Network Delivery (NI-NI)
  • Reliably transfer message between pairs of NIs
  • Each NI basically has two threads Send and
    Receive
  • Reliability
  • SANs are usually error free
  • Worried about buffer overflows in NI cards
  • Two approaches to flow control host-level,
    NI-level

network
SAN
Network Interface
Network Interface
Sending
Receiving
Courtesy of C. Ulmer
66
Host-managed Flow Control
  • Reliability managed by the host
  • Host-level credit system
  • NI just transfers messages between host and wire
  • Good points
  • Easier to implement
  • Host CPU faster than NI
  • Bad points
  • Poor NI buffer utilization
  • Retransmission overhead

Courtesy of C. Ulmer
67
NI-Managed Flow Control
  • NI manages reliable transmission of message
  • NIs use control messages (ACK/NACK)
  • Good points
  • Better dynamic buffer use
  • Offloads host CPU
  • Bad points
  • Harder to implement
  • Added overhead for NI

DATA
DATA
DATA
Receiving Endpoint
Sending Endpoint
SAN
PCI
PCI
ACK
Network Interface
Network Interface
Courtesy of C. Ulmer
68
Ejection (NI-Host)
Courtesy of C. Ulmer
69
Message Ejection (NI-Host)
  • Move message to host
  • Store close to host CPU
  • Incoming message queue
  • Pinned, contiguous memory
  • NI can write directly
  • Host extracts messages
  • Reassemble fragments
  • How does host see new messages?

CPU
Memory
Network Interface
Courtesy of C. Ulmer
70
Notification Polling
  • Applications explicitly call extract()
  • Call examines queue front back pointers
  • Processes message if available
  • Good points
  • Good performance
  • Can tuck away in a thread
  • User has more control
  • Bad points
  • Waste time if no messages
  • Queue can backup
  • Code can be messy

Courtesy of C. Ulmer
71
Notification Interrupts
  • NI invokes interrupt after putting message in
    queue
  • Host stops whatever it was doing
  • Device drivers Interrupt service routine (ISR)
    catches
  • ISR uses UNIX signal infrastructure to pass to
    application
  • Application catches signal , executes extract()
  • Good points
  • No wasted polling time
  • Bad points
  • High overhead
  • Interrupts 10 us
  • Constantly.. interrupted

Courtesy of C. Ulmer
72
Other APIs Remote Memory Ops
  • Often just passing data
  • Dont disturb receiving application
  • Remote memory operations
  • Fetch, store remote memory
  • NI executes transfer directly (no need for
    notification)
  • Virtual addresses translated by the NI (and
    cached)

Courtesy of C. Ulmer
73
The Message Path
M
M
CPU
CPU
PCI
PCI
OS
OS
PCI
PCI
Memory
Memory
NI
NI
Network
  • Wire bandwidth is not the bottleneck!
  • Operating system and/or user level software
    limits performance

74
Universal Performance Metrics
Sender
(processor busy)
Transmission time (size bandwidth)
Time of Flight
Receiver Overhead
Receiver
(processor busy)
Transport Latency
Total Latency
Total Latency Sender Overhead Time of Flight
Message Size BW
Receiver Overhead
Includes header/trailer in BW calculation?
75
Simplified Latency Model
  • Total Latency Overhead Message Size / BW
  • Overhead Sender Overhead Time of Flight
  • Receiver Overhead
  • Can relate overhead to network bandwidth
    utilization

76
Commercial Example
77
Scalable Switching Fabrics for Internet Routers
Router
  • Internet bandwidth growth ? routers with
  • large numbers of ports
  • high bisection bandwidth
  • Historically these solutions have used
  • Backplanes
  • Crossbar switches
  • White paper Scalable Switching Fabrics for
    Internet Routers, by W. J. Dally, http
    //www.avici.com/technology/whitepapers/

78
Requirements
  • Scalable
  • Incremental
  • Economical ? cost linear in the number of nodes
  • Robust
  • Fault tolerant ? path diversity reconfiguration
  • Non-blocking features
  • Performance
  • High bisection bandwidth
  • Quality of Service (QoS)
  • Bounded delay

79
Switching Fabric
  • Three components
  • Topology ? 3D torus
  • Routing ? source routing with randomization
  • Flow control ? virtual channels and virtual
    networks
  • Maximum configuration 14 x 8 x 5 560
  • Channel speed is 10 Gbps

80
Packaging
  • Uniformly short wires between adjacent nodes
  • Can be built in passive backplanes
  • Run at high speed
  • Bandwidth inversely proportional to square of
    wire length
  • Cabling costs
  • Power costs

Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
81
Available Bandwidth
  • Distinguish between capacity and I/O bandwidth
  • Capacity Traffic that will load a link to 100
  • I/O bandwidth bit rate in or out
  • Discontinuuities

Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
82
Properties
  • Path diversity
  • Avoids tree saturation
  • Edge disjoint paths for fault tolerance
  • Heart beat checks (100 microsecs) deflecting
    while tables are updated

Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
83
Properties
Figures are from Scalable Switching Fabrics for
Internet Routers, by W. J. Dally (can be found at
www.avici.com)
84
Use of Virtual Channels
  • Virtual channels aggregated into virtual networks
  • Two networks for each output port
  • Distinct networks prevent undesirable coupling
  • Only bandwidth on a link is shared
  • Fair arbitration mechanisms
  • Distinct networks enable QoS constraints to be
    met
  • Separate best effort and constant bit rate traffic

85
Summary
  • Distinguish between traditional networking and
    high performance multiprocessor communication
  • Hierarchy of implementations
  • Physical, switching and routing
  • Protocol families and protocol layers (the
    protocol stack)
  • Datapath and architecture of the switches
  • Metrics
  • Bisection bandwidth
  • Reliability
  • Traditional latency and bandwidth

86
Study Guide
  • Given a topology and relevant characteristics
    such as channel widths and link bandwidths,
    compute the bisection bandwidth
  • Distinguish between switching mechanisms based on
    how channel buffers are reserved/used during
    message transmission
  • Latency expressions for different switching
    mechanisms
  • Compute the network bisection bandwidth when the
    software overheads of message transmission are
    included
  • Identify the major delay elements in the message
    transmission path starting at the send() call and
    ended with the receive() call
  • How do costs scale in different topologies
  • Latency scaling
  • Unit of upgrade ? cost of upgrade
Write a Comment
User Comments (0)
About PowerShow.com