Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19>

About This Presentation

Title:

Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19>

Description:

Hardware Microarchitecure Lecture-1 ELE-580i PRESENTATION-I 04/01/2003 Canturk ISCI – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 35

Provided by: CITSt8

Category:

more less

Transcript and Presenter's Notes

Title: Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19>

1
Hardware MicroarchitecureLecture-1ltCh.
16,17,18,19gt

ELE-580iPRESENTATION-I
04/01/2003
Canturk ISCI

2
ROUTER ARCHITECTURE

Router
Registers
Switches
Functional Units
Control Logic
Implements
Routing Flow Control
Pipelined
Use credits for buffer space
Flits ? ltdownstreamgt
Credits ? ltupstreamgt
Constitute the credit loop

3
ROUTER Diagram

Virtual Channel Router
Datapath
Input Units Switch Output Units
Control
Router, VC allocator, Switch Allocator

Input Unit
State Vector ( for each VC) Flit Buffer (for
each C)
State vector fields GROPC gtgtgtgt
Output Unit
Latches outgoing flits
State vector (GIC) gtgtgtgtgtgtgtgtgt
Switch
Connect I/p to o/p according to SA
VCA
Arbitrate o/p channel RQs from each I/p packet
Once for each packet!!
SA
Arbirates o/p port RQs from I/p ports
Done for each flit
Router
Determines o/p ports for packets

4
VC State Fields

Input virtual channel
G ? Global State
I, R, V, A, C
R ? Route
O/p port for packet
O ? o/p VC
O/p VC of port R for packet
P ? Pointers
Flit head and tail pointers
C ? Credit Count
of credits for o/p VC R.O
Output virtual channel
G ? Global State
I, A, C
I ? I/p VC
I/p port.VC forwarding this o/p VC
C ? Credit count
of free buffers at the downstream

x( of VCs)
x( of VCs)
5
How it works

1)Packet ? Input controller ?
Router ? o/p port (I.e. P3)
VCA ? o/p VC (I.e. P3.VC2)
? Route Determined
2)Each flit ? input controller ?
SA ? timeslot over Switch
?Flit forwarded to o/p unit
3)Each flit ? output unit ?
Drive downstream physical channel
?Flit Transferred

6
Router Pipeline

Route Compute
Define the o/p port for packet header
VC Allocate
Assign a VC from the port if available
Switch Allocate
Schedule switch state according to o/p port
requests
Switch Traverse
I/p drives the switch for o/p port
Transmit
Transmit the flit over downstream channel

RC, VA ? Only for header
O/p channel is assigned for whole packet
SA, ST, TX ? for all flits
Flits from different packets compete continuously
Flits Transmitted sequentially for routing in
next hops

7
Pipeline Walkthrough

(0)ltstartgt
P4.VC3 (I/p VC)
GI Rx Ox Px Cx
Packet Arrives at I/p port P4
Packet header ? VC3 ?
Packet stored in P4.VC3
(1)ltRCgt
P4.VC3
GR Rx Ox Pltheadgt,lttail??gt Cx
Packet Header ? Router ? select o/p port P3
(2)ltVAgt
P4.VC3
GV RP3 Ox Pltheadgt,lttail??gt Cx
P3.VC2 (o/p VC)
GI Ix Cx
P3 ? VCA ? Allocate VC for o/p port P3 VC2

8
Pipeline Walkthrough

(3)ltSAgt
P4.VC3 (i/p VC)
GA RP3 OVC2 Pltheadgt,lttail??gt C
P3.VC2 (o/p VC)
GA IP4.VC3 C
Packet Processing complete
Flit by flit switch allocation/traverse
Transmit
Head flit allocated on switch ?
Move pointers
Decrement P4.VC3.Credit
Send a credit to upstream node to declare the
available buffer space
(4)ltSTgt
Head flit arrives at output VC
(5)ltTXgt
Head flit transmitted to downstream
(6)ltTail in SAgt
Packet done
(7)ltRelease Resourcesgt
P4.VC3 (i/p VC)

9
Pipeline Stalls

Packet Stalls
P1) I/p VC Busy stall
P2) Routing stall
P3) VC Allocation stall
Flit Stalls
F1) Switch Allocation Stall
F2) Buffer empty stall
F3) Credit stall
Credit Return cyclespipeline(4)RndTrip(4)CT(1)
CU(1)NextSA(1)11

10
Channel Reallocation

1) Conservative
Wait until credit received for tail from
downstream to reallocate o/p VC

2) Aggressive single Global state
Reallocate o/p VC when tail passes SA
(same as VA stall)
Reallocate downstream I/p VC when tail passes SA
(Same as I/p VC busy stall)

11
Channel Reallocation

2) Aggressive Double Global state
Reallocate o/p VC when tail passes SA
(same as VA stall)
Eliminate I/p VC busy stall

Needs 2 I/p VC state vectors at downstream
For A
GA RPx OVCx Plthead Agt lttail Agt C
For B
GR Rx Ox Plthead Bgt lttail??gt Cx

12
Speculation and Lookahead

Reduce latency by reducing pipe stages ?
Speculation (and lookahead)
Speculate virtual channel allocation
Do VA and SA concurrently
If VC set from RC spans more than 1 port
speculate that as well

Lookahead
Do route compute for node I at node I-1
Start at VA at each node overlap NRC VA

13
Flit and Credit Format

Two ways to distinguish credits/flits
Piggybacking Credit
Include a credit field on each flit
No types required
Define types
I.e. 10? start credit, 11 ? start flit, 0x ? idle
Flit Format
Credit Format

Head Flit VC Type (Credit) Route info Payload CRC
Body Flit VC Type (Credit) Payload CRC
Credit VC Type Check
14
ROUTER COMPONENTS

Datapath
Input buffer
Hold waiting flits
Switch
Route flits from I/p ? o/p
Output unit
Send flits to downstream
Control
Arbiter
Grant access to shared resources
Allocator
Allocate VCs to packets and switch time to flits

15
Input Buffer

Smoothes down flit traffic
Hold flits awaiting
VCs
Switch BW or
Channel BW
Organization
Centralized
Partitioned into physical channels
Partitioned into VCs

16
Centralized Input Buffer

Combined single memory across entire router
No separate switch, but
Need to multiplex I/ps to memory
Need to demultiplex memory o/p to o/p ports
PRO
Flexibility in allocating memory space
CONs
High Memory BW requirement
2x?I (write ?I I/ps read ?I o/ps per flit time)
Flit deserialization / reserialization latency
Need to get ?I flits from VCs before writing to
MEM

17
Partitioned Input Buffers

1 buffer per physical I/p port
Each Memory BW 2 (1 read, 1 write)
Buffers shared across VCs for a fixed port
Buffers not shared across ports
Less flexibility
1 buffer per VC
Enable switch I/p speedup
Obviously, bigger switch
Too fine granularity
Inefficient mem usage
Intermediate solutions

18
Input Buffer Data Structures

Data structures required to
Track flit/ packet locations in Memory
Manage available free memory
Allocate multiple VCs
Prevent blocking
Two common types
Circular buffers
Static, simpler yet inefficient mem usage
Linked Lists
Dynamic, complex, but fairer mem usage
Nomenclature
Buffer (flit buffer) entire structure
Cell (flit cell) storage for a single flit

19
Circular Buffer

FIXED! First and Last ptrs
Specify the memory boundary for a VC
Head and Tail specify current content boundary
Flit added from tail
Tail incremented (modular)
Tail Head ? Buffer Full
Flit removed from head
Head incremented (modular)
Head Tail ? Buffer empty
Choose size N power of 2 so that LSB log(N) bits
do the circular increment
I.e. like cache line index byte offset

20
Linked List Buffer

Each cell has a ptr field for next cell
Head and Tail specify 1st and last cells
NULL for empty buffers
Free List Linked list of free cells
Free points to head of list
Counter registers
Count of allocated cells for each buffer
Count of cells in free list
Bit errors have more severe effect compared to
circular buffer

21
Buffer Memory Allocation

Prevent greedy VC to flood all memory and block!
Add a count register to each I/p VC state vector
Keep number of allocated cells
Additional counter for free list
Simple Policy Reserve 1 cell for each VC
Add flit to bufferVCi ifbufferVCi empty or
(free list) gt (empty VCs)
Detailed policy Sliding Limit Allocator(r
reserved cells per buffer, f fraction of empty
space to use)
Add flit to bufferVCi ifbufferVCiltr or
rltbufferVCiltf.(free list) r
fr1 ? same as simple policy

22
SWITCH

Core directs packets/flits to their destination
Speedup provided switch BW / Min. required
switch BW for full thruput on all I/ps and o/ps
of router
Adding speedup simplifies allocation and reveals
higher thruput and lower latency
Realizations
Bus switch
Crossbar
Network switch

23
Bus Switches

Switches in time
Input port accumulates P phits of a flit,
arbirates for bus, transmits P phits over the bus
to any o/p unit
I.e. P3 ?
ltfig. 17.5 P4gt
Feasible only if flits have phits gt P
(preferably int x P)
Fragmentation Loss
If phits per flit not multiple of P

24
Bus timing diagram
25
Bus Pros Cons

Simple switch allocation
I/p port owning bus can access all o/p ports
Multicast made easy
Wasted port BW
Port BW b ? Router BWPb ? Bus BWPb ?I/p
deserializer BWPb ? o/p serializer BWPb ?
Available internal BW PxPbP2b
Used bus BW Pb (speedup 1)
Increased Latency
2P worst case ltsee 17.6-bus timing diagramgt
Can vary from P1 to 2P (phit times)

26
Xbar Switches

Primary issue speedup
1. kxk ? no speedup - fig 17.10(a)
2. skxk ? I/p speedups fig 17.10(b)
3. kxsk ? o/p speedups fig 17.11(a)
4. skxsk ? speedups fig 17.11(b)
(Speedup simplifies allocation)

27
Xbar Throughput

Ex Random separable allocator, I/p speedups,
uniform traffic
ThruputPat least one of the sk flits are
destined for given o/p1-Pnone of the sk I/ps
choose given o/p1-(k-1)/ksk ?
Thruput 1-(k-1)/ksk
sk ? thruput100 (doesnt verify as above!!)
O/p speedup
Need to implement reverse allocator
More complicated for same gain
Overall speedup (both I/p o/p)
Can achieve gt 100 thruput
Cannot sustain since
o/p buffer will expand to inf.
and I/p buffers need to be initially filled with
inf. of flits
I/p speedup si o/p speedup so (sigtso)?
Similar to I/p speedup(si/so), with overall
speedup so ?
Thruput

28
Network Switches

A network of smaller switches
Reduces of crosspoints
Localize logic
Reduces wire length
Requires complex control orintermediate
buffering
Not very profitable!
Ex 7x7 switch as 3 3x3 switches
3x927 switches instead of 7x749

29
OUTPUT UNIT

Essentially a FIFO to match switch speed
If switch o/p speedup1
merely latch the flits to downstream
No need to partition across VCs
Provide backpressure to SA to prevent buffer
overflow
SA should block traffic to the choking o/p buffer

30
ARBITER

Resolve multiple requests for a single source
(N?1)
Building blocks for allocators (N1?N2)
Communication and timing

31
Arbiter Types

Types
Fixed Priority r0gt r1gt r2gt
Variable (iterative) Priority rotate priorities
Make a carry chain, hot 1 inserted from priority
inputs
I.e. r1 gt r2 gt gtr0 ? (p0,p1,p2,,pn)0100
Matrix implements a LRS scheme
Uses a triangular array
M(r,c)1 ? RQr gt RQc
Queueing First come, first served
ltThe bank/STA Travel stylegt
Ticket counter
Gives current ticket to requester
Increments with each ticket
Served counter

variable
32
ALLOCATOR

Provides matching
Multiple Resources ? Multiple Requesters
I.e. switch allocator
Every cycle match I/p ports ? o/p ports
1 flit per I/p port
1 flit goes to each o/p port
nxm allocator
rij requester i wants access to resource j
gij requester i granted access to resource j
Request Grant Matrices
Allocation rules
gij gt rij Grant if requested
gij gt No other gik Only 1 grant for each
requester I/p
gij gt No other gkj Only 1 grant for each
resource o/p

maximum
maximal
33
Allocation Problem

Can be represented as finding the maximum
matching grant matrix
Also a maximum matching in a bipartite graph
Exact algorithms
Augmenting path method
Always finds maximum matching
Not feasible in time budget
Faster Heuristics
Separable allocators
2 sets of arbiration
Across I/ps across o/ps
In either order I/p first OR o/p first

34
4x3 Input-first Separable Allocator

Write a Comment

User Comments (0)

About PowerShow.com

Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19> - PowerPoint PPT Presentation

Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19>

Hardware Microarchitecure Lecture-1 ELE-580i PRESENTATION-I 04/01/2003 Canturk ISCI – PowerPoint PPT presentation