Title: Cache Coherence and Interconnection Network Design
1Cache Coherence and Interconnection Network Design
- Adapted from UC, Berkeley Notes
2Cache-based (Pointer-based) Schemes
- How they work
- home only holds pointer to rest of directory info
- distributed linked list of copies, weaves through
caches - cache tag has pointer, points to next cache with
a copy - on read, add yourself to head of the list (comm.
needed) - on write, propagate chain of invalidations down
the list - What if a link fails? gt Scalable Coherent
Interface (SCI) IEEE Standard - doubly linked list
- What to do on replacement?
3Scaling Properties (Cache-based)
- Traffic on write proportional to number of
sharers - Latency on write proportional to number of
sharers! - dont know identity of next sharer until reach
current one - also assist processing at each node along the way
- (even reads involve more than one other assist
home and first sharer on list) - Storage overhead quite good scaling along both
axes - Only one head pointer per memory block
- rest is all prop to cache size
- Very complex!!!
4Hierarchical Directories
- Directory is a hierarchical data structure
- leaves are processing nodes, internal nodes just
directory - logical hierarchy, not necessarily physical
- (can be embedded in general network)
5Caching in the Internet (Server side caching,
Client side caching, Network caching)
6Network Caching Proxy Caching Ref Jia Wang,
A Survey of Web Caching Schemes for the
Internet ACM Computing Surveys
7Comparison with P2P
- Directory is similar to tracker in Bit Torrent
(BT) or root in P2P network Plaxtons root has
one pointer and BT has all pointers - Real object may be somewhere else (call it as
home node), as pointed by directory - The shared caches are extra locations of the
object equivalent to peers - Duplicate directories possible see hierarchical
directory design (MIND) later - Lot of possible designs for directory Can we
- do similar design in P2P?
8Plaxtons Scheme
- Similar to Distributed Shared Memory (DSM) Cache
Protocol - Root is the Home node in DSM, where directory is
kept but not the real object. The root points to
the nearest home peer, where a shared copy can be
found. - Similar to Hierarchical Directory scheme, where
intermediate nodes point to nearest object nodes
(equivalent to shared copy). However, in
hierarchical directory scheme Intermediate
pointers point to all shared copies.
9P2P Research
- How many pointers do we need at the tracker? Cost
model? Depends on popularity? How many peers to
supply to client? Communication model? How to
update directory? - Cache-based (Linked-list) approach like SCI?
- Hierarchical directory Plaxtons paper
Introduce such method for BT network? - Combine directorytree DHTs?
- Is Dirty copy possible in P2P? Then develop
protocol similar to cache - Sub-page coherence like DSM?
10Interconnection Network Design
- Adapted from UC, Berkeley Notes
11Scalable, High Perf. Interconnection Network
- At Core of Parallel Computer Arch.
- Requirements and trade-offs at many levels
- Elegant mathematical structure
- Deep relationships to algorithm structure
- Managing many traffic flows
- Electrical / Optical link properties
- Little consensus
- interactions across levels
- Performance metrics?
- Cost metrics?
- Workload?
- gt need holistic understanding
12Goals
- Job of a multiprocessor network is to transfer
information from source node to destination node
in support of network transactions that realize
the programming model - latency as small as possible
- as many concurrent transfers as possible
- operation bandwidth
- data bandwidth
- cost as low as possible
13Formalism
- network is a graph V switches and nodes
connected by communication channels C Í V V - Channel has width w and signaling rate f 1/t
- channel bandwidth b wf
- phit (physical unit) data transferred per cycle
- flit - basic unit of flow-control
- Number of input (output) channels is switch
degree - Sequence of switches and links followed by a
message is a route - Think streets and intersections
14What characterizes a network?
- Topology (what)
- physical interconnection structure of the network
graph - direct node connected to every switch
- indirect nodes connected to specific subset of
switches - Routing Algorithm (which)
- restricts the set of paths that msgs may follow
- many algorithms with different properties
- gridlock avoidance?
- Switching Strategy (how)
- how data in a msg traverses a route
- circuit switching vs. packet switching
- Flow Control Mechanism (when)
- when a msg or portions of it traverse a route
- what happens when traffic is encountered?
15Topological Properties
- Routing Distance - number of links on route
- Diameter - maximum routing distance between any
two nodes in the network - Average Distance Sum of distances between
nodes/number of nodes - Degree of a Node Number of links connected to a
node gt Cost high if degree is high - A network is partitioned by a set of links if
their removal disconnects the graph - Fault-tolerance Number of alternate paths
between two nodes in a network
16Typical Packet Format
- Two basic mechanisms for abstraction
- encapsulation
- fragmentation
17Review Performance Metrics
Sender
(processor busy)
Transmission time (size bandwidth)
Time of Flight
Receiver Overhead
Receiver
(processor busy)
Transport Latency
Total Latency
Total Latency Sender Overhead Time of Flight
Message Size BW
Receiver Overhead
Includes header/trailer in BW calculation?
18Switching Techniques
- Circuit Switching A control message is sent from
source to destination and a path is reserved.
Communication starts. The path is released when
communication is complete. - Store-and-forward policy (Packet Switching) each
switch waits for the full packet to arrive in
switch before sending to the next switch (good
for WAN) - Cut-through routing or worm hole routing switch
examines the header, decides where to send the
message, and then starts forwarding it
immediately - In worm hole routing, when head of message is
blocked, message stays strung out over the
network, potentially blocking other messages
(needs only buffer the piece of the packet that
is sent between switches). CM-5 uses it, with
each switch buffer being 4 bits per port. - Cut through routing lets the tail continue when
head is blocked, storing the whole message into
an intermmediate switch. (Requires a buffer large
enough to hold the largest packet).
19(No Transcript)
20Store and Forward vs. Cut-Through
- Advantage
- Latency reduces from function ofnumber of
intermediate switches X by the size of the packet
to time for 1st part of the packet to
negotiate the switches the packet size
interconnect BW
21StoreForward vs Cut-Through Routing
- h(n/b D) vs n/b h D
- what if message is fragmented?
- wormhole vs virtual cut-through
22(No Transcript)
23Example
- Q. Compare the efficiency of store-and-forward
(packet switching) vs. wormhole routing for
transmission of a 20 bytes packet between a
source and destination, which are d-nodes apart.
Each node takes 0.25 microsecond and link
transfer rate is 20 MB/sec. - Answer Time to transfer 20 bytes over a link
20/20 MB/sec 1 microsecond. - Packet switching nodes x (node delay
transfer time) d x (.25 1) 1.25 d
microseconds - Wormhole ( nodes x node delay) transfer time
- 0.25 d 1
- Book For d7, packet switching takes 8.75
microseconds vs. 2.75 microseconds for wormhole
routing
24Contention
- Two packets trying to use the same link at same
time - limited buffering
- drop?
- Most parallel mach. networks block in place
- link-level flow control
- tree saturation
- Closed system - offered load depends on delivered
25Delay with Queuing
- Suppose there are L links per node. Each link
sends a packet to another link at Lamda
packets/sec. The service rate (linkswitch) is
Mue packets per second. What is the delay over
a distance D? - Ans There is a queue at each output link to hold
extra packets. Model each output link as an M/M/1
queue with LxLamda input rate and Mue service
rate. - Delay through each link Queuing time Service
time S - Delay over a distance D S x D
26Congestion Control
- Packet switched networks do not reserve
bandwidth this leads to contention (connection
based limits input) - Solution prevent packets from entering until
contention is reduced (e.g., freeway on-ramp
metering lights) - Options
- Packet discarding If packet arrives at switch
and no room in buffer, packet is discarded (e.g.,
UDP) - Flow control between pairs of receivers and
senders use feedback to tell sender when
allowed to send next packet - Back-pressure separate wires to tell to stop
- Window give original sender right to send N
packets before getting permission to send more
overlaps latency of interconnection with
overhead to send receive packet (e.g., TCP),
adjustable window - Choke packets aka rate-based Each packet
received by busy switch in warning state sent
back to the source via choke packet. Source
reduces traffic to that destination by a fixed
(e.g., ATM)
27Routing
- Recall routing algorithm determines
- which of the possible paths are used as routes
- how the route is determined
- R N x N -gt C, which at each switch maps the
destination node nd to the next channel on the
route - Issues
- Routing mechanism
- arithmetic
- source-based port select
- table driven
- general computation
- Properties of the routes
- Deadlock feee
28Routing Mechanism
- need to select output port for each input packet
- in a few cycles
- Simple arithmetic in regular topologies
- ex Dx, Dy routing in a grid
- west (-x) Dx lt 0
- east (x) Dx gt 0
- south (-y) Dx 0, Dy lt 0
- north (y) Dx 0, Dy gt 0
- processor Dx 0, Dy 0
- Reduce relative address of each dimension in
order - Dimension-order routing in k-ary d-cubes
- e-cube routing in n-cube
29Routing Mechanism (cont)
P0
P1
P2
P3
- Source-based
- message header carries series of port selects
- used and stripped en route
- CRC? Packet Format?
- CS-2, Myrinet, MIT Artic
- Table-driven
- message header carried index for next port at
next switch - o Ri
- table also gives index for following hop
- o, I Ri
- ATM, HPPI
30Properties of Routing Algorithms
- Deterministic
- route determined by (source, dest), not
intermediate state (i.e. traffic) - Adaptive
- route influenced by traffic along the way
- Minimal
- only selects shortest paths
- Deadlock free
- no traffic pattern can lead to a situation where
no packets mover forward
31Deadlock Freedom
- How can it arise?
- necessary conditions
- shared resource
- incrementally allocated
- non-preemptible
- think of a channel as a shared resource that
is acquired incrementally - source buffer then dest. buffer
- channels along a route
- How do you avoid it?
- constrain how channel resources are allocated
- ex dimension order
- How do you prove that a routing algorithm is
deadlock free
32Proof Technique
- resources are logically associated with channels
- messages introduce dependencies between resources
as they move forward - need to articulate the possible dependences that
can arise between channels - show that there are no cycles in Channel
Dependence Graph - find a numbering of channel resources such that
every legal route follows a monotonic sequence - gt no traffic pattern can lead to deadlock
- network need not be acyclic, on channel
dependence graph
33Example k-ary 2D array
- Theorem x,y routing is deadlock free
- Numbering
- x channel (i,y) -gt (i1,y) gets i
- similarly for -x with 0 as most positive edge
- y channel (x,j) -gt (x,j1) gets Nj
- similarly for -y channels
- any routing sequence x direction, turn, y
direction is increasing
34Deadlock free wormhole networks?
- Basic dimension order routing techniques dont
work for k-ary d-cubes - only for k-ary d-arrays (bi-directional)
- Idea add channels!
- provide multiple virtual channels to break the
dependence cycle - good for BW too!
- Do not need to add links, or xbar, only buffer
resources - This adds nodes the the CDG, remove edges?
35Breaking deadlock with virtual channels
36Adaptive Routing
- R C x N x S -gt C
- Essential for fault tolerance
- at least multipath
- Can improve utilization of the network
- Simple deterministic algorithms easily run into
bad permutations - fully/partially adaptive, minimal/non-minimal
- can introduce complexity or anomolies
- little adaptation goes a long way!