Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19> - PowerPoint PPT Presentation

About This Presentation
Title:

Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19>

Description:

Hardware Microarchitecure Lecture-1 ELE-580i PRESENTATION-I 04/01/2003 Canturk ISCI – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 35
Provided by: CITSt8
Category:

less

Transcript and Presenter's Notes

Title: Hardware Microarchitecure Lecture-1 <Ch. 16,17,18,19>


1
Hardware MicroarchitecureLecture-1ltCh.
16,17,18,19gt
  • ELE-580iPRESENTATION-I
  • 04/01/2003
  • Canturk ISCI

2
ROUTER ARCHITECTURE
  • Router
  • Registers
  • Switches
  • Functional Units
  • Control Logic
  • Implements
  • Routing Flow Control
  • Pipelined
  • Use credits for buffer space
  • Flits ? ltdownstreamgt
  • Credits ? ltupstreamgt
  • Constitute the credit loop

3
ROUTER Diagram
  • Virtual Channel Router
  • Datapath
  • Input Units Switch Output Units
  • Control
  • Router, VC allocator, Switch Allocator
  • Input Unit
  • State Vector ( for each VC) Flit Buffer (for
    each C)
  • State vector fields GROPC gtgtgtgt
  • Output Unit
  • Latches outgoing flits
  • State vector (GIC) gtgtgtgtgtgtgtgtgt
  • Switch
  • Connect I/p to o/p according to SA
  • VCA
  • Arbitrate o/p channel RQs from each I/p packet
  • Once for each packet!!
  • SA
  • Arbirates o/p port RQs from I/p ports
  • Done for each flit
  • Router
  • Determines o/p ports for packets

4
VC State Fields
  • Input virtual channel
  • G ? Global State
  • I, R, V, A, C
  • R ? Route
  • O/p port for packet
  • O ? o/p VC
  • O/p VC of port R for packet
  • P ? Pointers
  • Flit head and tail pointers
  • C ? Credit Count
  • of credits for o/p VC R.O
  • Output virtual channel
  • G ? Global State
  • I, A, C
  • I ? I/p VC
  • I/p port.VC forwarding this o/p VC
  • C ? Credit count
  • of free buffers at the downstream

x( of VCs)
x( of VCs)
5
How it works
  • 1)Packet ? Input controller ?
  • Router ? o/p port (I.e. P3)
  • VCA ? o/p VC (I.e. P3.VC2)
  • ? Route Determined
  • 2)Each flit ? input controller ?
  • SA ? timeslot over Switch
  • ?Flit forwarded to o/p unit
  • 3)Each flit ? output unit ?
  • Drive downstream physical channel
  • ?Flit Transferred

6
Router Pipeline
  • Route Compute
  • Define the o/p port for packet header
  • VC Allocate
  • Assign a VC from the port if available
  • Switch Allocate
  • Schedule switch state according to o/p port
    requests
  • Switch Traverse
  • I/p drives the switch for o/p port
  • Transmit
  • Transmit the flit over downstream channel
  • RC, VA ? Only for header
  • O/p channel is assigned for whole packet
  • SA, ST, TX ? for all flits
  • Flits from different packets compete continuously
  • Flits Transmitted sequentially for routing in
    next hops

7
Pipeline Walkthrough
  • (0)ltstartgt
  • P4.VC3 (I/p VC)
  • GI Rx Ox Px Cx
  • Packet Arrives at I/p port P4
  • Packet header ? VC3 ?
  • Packet stored in P4.VC3
  • (1)ltRCgt
  • P4.VC3
  • GR Rx Ox Pltheadgt,lttail??gt Cx
  • Packet Header ? Router ? select o/p port P3
  • (2)ltVAgt
  • P4.VC3
  • GV RP3 Ox Pltheadgt,lttail??gt Cx
  • P3.VC2 (o/p VC)
  • GI Ix Cx
  • P3 ? VCA ? Allocate VC for o/p port P3 VC2

8
Pipeline Walkthrough
  • (3)ltSAgt
  • P4.VC3 (i/p VC)
  • GA RP3 OVC2 Pltheadgt,lttail??gt C
  • P3.VC2 (o/p VC)
  • GA IP4.VC3 C
  • Packet Processing complete
  • Flit by flit switch allocation/traverse
    Transmit
  • Head flit allocated on switch ?
  • Move pointers
  • Decrement P4.VC3.Credit
  • Send a credit to upstream node to declare the
    available buffer space
  • (4)ltSTgt
  • Head flit arrives at output VC
  • (5)ltTXgt
  • Head flit transmitted to downstream
  • (6)ltTail in SAgt
  • Packet done
  • (7)ltRelease Resourcesgt
  • P4.VC3 (i/p VC)

9
Pipeline Stalls
  • Packet Stalls
  • P1) I/p VC Busy stall
  • P2) Routing stall
  • P3) VC Allocation stall
  • Flit Stalls
  • F1) Switch Allocation Stall
  • F2) Buffer empty stall
  • F3) Credit stall
  • Credit Return cyclespipeline(4)RndTrip(4)CT(1)
    CU(1)NextSA(1)11

10
Channel Reallocation
  • 1) Conservative
  • Wait until credit received for tail from
    downstream to reallocate o/p VC
  • 2) Aggressive single Global state
  • Reallocate o/p VC when tail passes SA
  • (same as VA stall)
  • Reallocate downstream I/p VC when tail passes SA
  • (Same as I/p VC busy stall)

11
Channel Reallocation
  • 2) Aggressive Double Global state
  • Reallocate o/p VC when tail passes SA
  • (same as VA stall)
  • Eliminate I/p VC busy stall
  • Needs 2 I/p VC state vectors at downstream
  • For A
  • GA RPx OVCx Plthead Agt lttail Agt C
  • For B
  • GR Rx Ox Plthead Bgt lttail??gt Cx

12
Speculation and Lookahead
  • Reduce latency by reducing pipe stages ?
    Speculation (and lookahead)
  • Speculate virtual channel allocation
  • Do VA and SA concurrently
  • If VC set from RC spans more than 1 port
    speculate that as well
  • Lookahead
  • Do route compute for node I at node I-1
  • Start at VA at each node overlap NRC VA

13
Flit and Credit Format
  • Two ways to distinguish credits/flits
  • Piggybacking Credit
  • Include a credit field on each flit
  • No types required
  • Define types
  • I.e. 10? start credit, 11 ? start flit, 0x ? idle
  • Flit Format
  • Credit Format

Head Flit VC Type (Credit) Route info Payload CRC
Body Flit VC Type (Credit) Payload CRC
Credit VC Type Check
14
ROUTER COMPONENTS
  • Datapath
  • Input buffer
  • Hold waiting flits
  • Switch
  • Route flits from I/p ? o/p
  • Output unit
  • Send flits to downstream
  • Control
  • Arbiter
  • Grant access to shared resources
  • Allocator
  • Allocate VCs to packets and switch time to flits

15
Input Buffer
  • Smoothes down flit traffic
  • Hold flits awaiting
  • VCs
  • Switch BW or
  • Channel BW
  • Organization
  • Centralized
  • Partitioned into physical channels
  • Partitioned into VCs

16
Centralized Input Buffer
  • Combined single memory across entire router
  • No separate switch, but
  • Need to multiplex I/ps to memory
  • Need to demultiplex memory o/p to o/p ports
  • PRO
  • Flexibility in allocating memory space
  • CONs
  • High Memory BW requirement
  • 2x?I (write ?I I/ps read ?I o/ps per flit time)
  • Flit deserialization / reserialization latency
  • Need to get ?I flits from VCs before writing to
    MEM

17
Partitioned Input Buffers
  • 1 buffer per physical I/p port
  • Each Memory BW 2 (1 read, 1 write)
  • Buffers shared across VCs for a fixed port
  • Buffers not shared across ports
  • Less flexibility
  • 1 buffer per VC
  • Enable switch I/p speedup
  • Obviously, bigger switch
  • Too fine granularity
  • Inefficient mem usage
  • Intermediate solutions

18
Input Buffer Data Structures
  • Data structures required to
  • Track flit/ packet locations in Memory
  • Manage available free memory
  • Allocate multiple VCs
  • Prevent blocking
  • Two common types
  • Circular buffers
  • Static, simpler yet inefficient mem usage
  • Linked Lists
  • Dynamic, complex, but fairer mem usage
  • Nomenclature
  • Buffer (flit buffer) entire structure
  • Cell (flit cell) storage for a single flit

19
Circular Buffer
  • FIXED! First and Last ptrs
  • Specify the memory boundary for a VC
  • Head and Tail specify current content boundary
  • Flit added from tail
  • Tail incremented (modular)
  • Tail Head ? Buffer Full
  • Flit removed from head
  • Head incremented (modular)
  • Head Tail ? Buffer empty
  • Choose size N power of 2 so that LSB log(N) bits
    do the circular increment
  • I.e. like cache line index byte offset

20
Linked List Buffer
  • Each cell has a ptr field for next cell
  • Head and Tail specify 1st and last cells
  • NULL for empty buffers
  • Free List Linked list of free cells
  • Free points to head of list
  • Counter registers
  • Count of allocated cells for each buffer
  • Count of cells in free list
  • Bit errors have more severe effect compared to
    circular buffer

21
Buffer Memory Allocation
  • Prevent greedy VC to flood all memory and block!
  • Add a count register to each I/p VC state vector
  • Keep number of allocated cells
  • Additional counter for free list
  • Simple Policy Reserve 1 cell for each VC
  • Add flit to bufferVCi ifbufferVCi empty or
    (free list) gt (empty VCs)
  • Detailed policy Sliding Limit Allocator(r
    reserved cells per buffer, f fraction of empty
    space to use)
  • Add flit to bufferVCi ifbufferVCiltr or
    rltbufferVCiltf.(free list) r
  • fr1 ? same as simple policy

22
SWITCH
  • Core directs packets/flits to their destination
  • Speedup provided switch BW / Min. required
    switch BW for full thruput on all I/ps and o/ps
    of router
  • Adding speedup simplifies allocation and reveals
    higher thruput and lower latency
  • Realizations
  • Bus switch
  • Crossbar
  • Network switch

23
Bus Switches
  • Switches in time
  • Input port accumulates P phits of a flit,
    arbirates for bus, transmits P phits over the bus
    to any o/p unit
  • I.e. P3 ?
  • ltfig. 17.5 P4gt
  • Feasible only if flits have phits gt P
  • (preferably int x P)
  • Fragmentation Loss
  • If phits per flit not multiple of P

24
Bus timing diagram
25
Bus Pros Cons
  • Simple switch allocation
  • I/p port owning bus can access all o/p ports
  • Multicast made easy
  • Wasted port BW
  • Port BW b ? Router BWPb ? Bus BWPb ?I/p
    deserializer BWPb ? o/p serializer BWPb ?
  • Available internal BW PxPbP2b
  • Used bus BW Pb (speedup 1)
  • Increased Latency
  • 2P worst case ltsee 17.6-bus timing diagramgt
  • Can vary from P1 to 2P (phit times)

26
Xbar Switches
  • Primary issue speedup
  • 1. kxk ? no speedup - fig 17.10(a)
  • 2. skxk ? I/p speedups fig 17.10(b)
  • 3. kxsk ? o/p speedups fig 17.11(a)
  • 4. skxsk ? speedups fig 17.11(b)
  • (Speedup simplifies allocation)

27
Xbar Throughput
  • Ex Random separable allocator, I/p speedups,
    uniform traffic
  • ThruputPat least one of the sk flits are
    destined for given o/p1-Pnone of the sk I/ps
    choose given o/p1-(k-1)/ksk ?
  • Thruput 1-(k-1)/ksk
  • sk ? thruput100 (doesnt verify as above!!)
  • O/p speedup
  • Need to implement reverse allocator
  • More complicated for same gain
  • Overall speedup (both I/p o/p)
  • Can achieve gt 100 thruput
  • Cannot sustain since
  • o/p buffer will expand to inf.
  • and I/p buffers need to be initially filled with
    inf. of flits
  • I/p speedup si o/p speedup so (sigtso)?
  • Similar to I/p speedup(si/so), with overall
    speedup so ?
  • Thruput

28
Network Switches
  • A network of smaller switches
  • Reduces of crosspoints
  • Localize logic
  • Reduces wire length
  • Requires complex control orintermediate
    buffering
  • Not very profitable!
  • Ex 7x7 switch as 3 3x3 switches
  • 3x927 switches instead of 7x749

29
OUTPUT UNIT
  • Essentially a FIFO to match switch speed
  • If switch o/p speedup1
  • merely latch the flits to downstream
  • No need to partition across VCs
  • Provide backpressure to SA to prevent buffer
    overflow
  • SA should block traffic to the choking o/p buffer

30
ARBITER
  • Resolve multiple requests for a single source
    (N?1)
  • Building blocks for allocators (N1?N2)
  • Communication and timing

31
Arbiter Types
  • Types
  • Fixed Priority r0gt r1gt r2gt
  • Variable (iterative) Priority rotate priorities
  • Make a carry chain, hot 1 inserted from priority
    inputs
  • I.e. r1 gt r2 gt gtr0 ? (p0,p1,p2,,pn)0100
  • Matrix implements a LRS scheme
  • Uses a triangular array
  • M(r,c)1 ? RQr gt RQc
  • Queueing First come, first served
  • ltThe bank/STA Travel stylegt
  • Ticket counter
  • Gives current ticket to requester
  • Increments with each ticket
  • Served counter

variable
32
ALLOCATOR
  • Provides matching
  • Multiple Resources ? Multiple Requesters
  • I.e. switch allocator
  • Every cycle match I/p ports ? o/p ports
  • 1 flit per I/p port
  • 1 flit goes to each o/p port
  • nxm allocator
  • rij requester i wants access to resource j
  • gij requester i granted access to resource j
  • Request Grant Matrices
  • Allocation rules
  • gij gt rij Grant if requested
  • gij gt No other gik Only 1 grant for each
    requester I/p
  • gij gt No other gkj Only 1 grant for each
    resource o/p

maximum
maximal
33
Allocation Problem
  • Can be represented as finding the maximum
    matching grant matrix
  • Also a maximum matching in a bipartite graph
  • Exact algorithms
  • Augmenting path method
  • Always finds maximum matching
  • Not feasible in time budget
  • Faster Heuristics
  • Separable allocators
  • 2 sets of arbiration
  • Across I/ps across o/ps
  • In either order I/p first OR o/p first

34
4x3 Input-first Separable Allocator
Write a Comment
User Comments (0)
About PowerShow.com