Multiprocessors on A Snoopy Bus - PowerPoint PPT Presentation

1 / 73
About This Presentation
Title:

Multiprocessors on A Snoopy Bus

Description:

Even though the bus is atomic, a complete protocol transaction involves quite a ... This is called Read-for-Ownership (RFO); also used by Intel atomic xchg instruction ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 74
Provided by: cseIi
Category:

less

Transcript and Presenter's Notes

Title: Multiprocessors on A Snoopy Bus


1
Multiprocessors onA Snoopy Bus
2
Agenda
  • Goal is to understand what influences the
    performance, cost and scalability of SMPs
  • Details of physical design of SMPs
  • At least three goals of any design correctness,
    performance, low hardware complexity
  • Performance gains are normally achieved by
    pipelining memory transactions and having
    multiple outstanding requests
  • These performance optimizations occasionally
    introduce new protocol races involving transient
    states leading to correctness issues in terms of
    coherence and consistency

3
Correctness goals
  • Must enforce coherence and write serialization
  • Recall that write serialization guarantees all
    writes to a location to be seen in the same order
    by all processors
  • Must obey the target memory consistency model
  • If sequential consistency is the goal, the system
    must provide write atomicity and detect write
    completion correctly (write atomicity extends the
    definition of write serialization for any
    location i.e. it guarantees that positions of
    writes within the total order seen by all
    processors be the same)
  • Must be free of deadlock, livelock and starvation
  • Starvation confined to a part of the system is
    not as problematic as deadlock and livelock
  • However, system-wide starvation leads to livelock

4
A simple design
  • Start with a rather naïve design
  • Each processor has a single level of data and
    instruction caches
  • The cache allows exactly one outstanding miss at
    a time i.e. a cache miss request is blocked if
    already another is outstanding (this serializes
    all bus requests from a particular processor)
  • The bus is atomic i.e. it handles one request at
    a time

5
Cache controller
  • Must be able to respond to bus transactions as
    necessary
  • Handled by the snoop logic
  • The snoop logic should have access to the cache
    tags
  • A single set of tags cannot allow concurrent
    accesses by the processor-side and the bus-side
    controllers
  • When the snoop logic accesses the tags the
    processor must remain locked out from accessing
    the tags
  • Possible enhancements two read ports in the tag
    RAM allows concurrent reads duplicate copies are
    also possible multiple banks reduce the
    contention also
  • In all cases, updates to tags must still be
    atomic or must be applied to both copies in case
    of duplicate tags however, tag updates are a lot
    less frequent compared to reads

6
Snoop logic
  • Couple of decisions need to be taken while
    designing the snoop logic
  • How long should the snoop decision take?
  • How should processors convey the snoop decision?
  • Snoop latency (three design choices)
  • Possible to set an upper bound in terms of number
    of cycles advantage no change in memory
    controller hardware disadvantage potentially
    large snoop latency (Pentium Pro, Sun Enterprise
    servers)
  • The memory controller samples the snoop results
    every cycle until all caches have completed snoop
    (SGI Challenge uses this approach where the
    memory controller fetches the line from memory,
    but stalls if all caches havent yet snooped)
  • Maintain a bit per memory line to indicate if it
    is in M state in some cache

7
Snoop logic
  • Conveying snoop result
  • For MESI the bus is augmented with three wired-OR
    snoop result lines (shared, modified, valid) the
    valid line is active low
  • The original Illinois MESI protocol requires
    cache-to-cache transfer even when the line is in
    S state this may complicate the hardware
    enormously due to the involved priority mechanism
  • Commercial MESI protocols normally allow
    cache-to-cache sharing only for lines in M state
  • SGI Challenge and Sun Enterprise allow
    cache-to-cache transfers only in M state
    Challenge updates memory when going from M to S
    while Enterprise exercises a MOESI protocol

8
Writebacks
  • Writebacks are essentially eviction of modified
    lines
  • Caused by a miss mapping to the same cache index
  • Needs two bus transactions one for the miss and
    one for the writeback
  • Definitely the miss should be given first
    priority since this directly impacts forward
    progress of the program
  • Need a writeback buffer (WBB) to hold the evicted
    line until the bus can be acquired for the second
    time by this cache
  • In the meantime a new request from another
    processor may be launched for the evicted line
    the evicting cache must provide the line from the
    WBB and cancel the pending writeback (need an
    address comparator with WBB)

9
A simple design
10
Inherently non-atomic
  • Even though the bus is atomic, a complete
    protocol transaction involves quite a few steps
    which together forms a non-atomic transaction
  • Issuing processor request
  • Looking up cache tags
  • Arbitrating for bus
  • Snoop action in other cache controller
  • Refill in requesting cache controller at the end
  • Different requests from different processors may
    be in a different phase of a transaction
  • This makes a protocol transition inherently
    non-atomic

11
Inherently non-atomic
  • Consider an example
  • P0 and P1 have cache line C in shared state
  • Both proceed to write the line
  • Both cache controllers look up the tags, put a
    BusUpgr into the bus request queue, and start
    arbitrating for the bus
  • P1 gets the bus first and launches its BusUpgr
  • P0 observes the BusUpgr and now it must
    invalidate C in its cache and change the request
    type to BusRdX
  • So every cache controller needs to do an
    associative lookup of the snoop address against
    its pending request queue and depending on the
    request type take appropriate actions

12
Inherently non-atomic
  • One way to reason about the correctness is to
    introduce transient states
  • Possible to think of the last problem as the line
    C being in a transient S?M state
  • On observing a BusUpgr or BusRdX, this state
    transitions to I?M which is also transient
  • The line C goes to stable M state only after the
    transaction completes
  • These transient states are not really encoded in
    the state bits of a cache line because at any
    point in time there will be a small number of
    outstanding requests from a particular processor
    (today the maximum I know of is 16)
  • These states are really determined by the state
    of an outstanding line and the state of the cache
    controller

13
Write serialization
  • Atomic bus makes it rather easy, but
    optimizations are possible
  • Consider a processor write to a shared cache line
  • Is it safe to continue with the write and change
    the state to M even before the bus transaction is
    complete?
  • After the bus transaction is launched it is
    totally safe because the bus is atomic and hence
    the position of the write is committed in the
    total order therefore no need to wait any
    further (note that the exact point in time when
    the other caches invalidate the line is not
    important)
  • If the processor decides to proceed even before
    the bus transaction is launched (very much
    possible in ooo execution), the cache controller
    must take the responsibility of squashing and
    re-executing offending instructions so that the
    total order is consistent across the system

14
Fetch deadlock
  • Just a fancy name for a pretty intuitive deadlock
  • Suppose P0s cache controller is waiting to get
    the bus for launching a BusRdX to cache line A
  • P1 has a modified copy of cache line A
  • P1 has launched a BusRd to cache line B and
    awaiting completion
  • P0 has a modified copy of cache line B
  • If both keep on waiting without responding to
    snoop requests, the deadlock cycle is pretty
    obvious
  • So every controller must continue to respond to
    snoop requests while waiting for the bus for its
    own requests
  • Normally the cache controller is designed as two
    separate independent logic units, namely, the
    inbound unit (handles snoop requests) and the
    outbound unit (handles own requests and
    arbitrates for bus)

15
Livelock
  • Consider the following example
  • P0 and P1 try to write to the same cache line
  • P0 gets exclusive ownership, fills the line in
    cache and notifies the load/store unit (or
    retirement unit) to retry the store
  • While all these are happening P1s request
    appears on the bus and P0s cache controller
    modifies tag state to I before the store could
    retry
  • This can easily lead to a livelock
  • Normally this is avoided by giving the load/store
    unit higher priority for tag access (i.e. the
    snoop logic cannot modify the tag arrays when
    there is a processor access pending in the same
    clock cycle)
  • This is even rarer in multi-level cache hierarchy
    (more later)

16
Starvation
  • Some amount of fairness is necessary in the bus
    arbiter
  • An FCFS policy is possible for granting bus, but
    that needs some buffering in the arbiter to hold
    already placed requests
  • Most machines implement an aging scheme which
    keeps track of the number of times a particular
    request is denied and when the count crosses a
    threshold that request becomes the highest
    priority (this too needs some storage)

17
More on LL/SC
  • We have seen that both LL and SC may suffer from
    cache misses (a read followed by an upgrade miss)
  • Is it possible to save one transaction?
  • What if I design my cache controller in such a
    way that it can recognize LL instructions and
    launch a BusRdX instead of BusRd?
  • This is called Read-for-Ownership (RFO) also
    used by Intel atomic xchg instruction
  • Nice idea, but you have to be careful
  • By doing this you have just enormously increased
    the probability of a livelock before the SC
    executes there is a high probability that another
    LL will take away the line
  • Possible solution is to buffer incoming snoop
    requests until the SC completes (buffer space is
    proportional to P) may introduce new deadlock
    cycles (especially for modern non-atomic busses)

18
Multi-level caches
  • We have talked about multi-level caches and the
    involved inclusion property
  • Multiprocessors create new problems related to
    multi-level caches
  • A bus snoop result may be relevant to inner
    levels of cache e.g., bus transactions are not
    visible to the first level cache controller
  • Similarly, modifications made in the first level
    cache may not be visible to the second level
    cache controller which is responsible for
    handling bus requests
  • Inclusion property makes it easier to maintain
    coherence
  • Since L1 cache is a subset of L2 cache a snoop
    miss in L2 cache need not be sent to L1 cache

19
Recap of inclusion
  • A processor read
  • Looks up L1 first and in case of miss goes to L2,
    and finally may need to launch a BusRd request if
    it misses in L2
  • Finally, the line is in S state in both L1 and L2
  • A processor write
  • Looks up L1 first and if it is in I state sends a
    ReadX request to L2 which may have the line in M
    state
  • In case of L2 hit, the line is filled in M state
    in L1
  • In case of L2 miss, if the line is in S state in
    L2 it launches BusUpgr otherwise it launches
    BusRdX finally, the line is in state M in both
    L1 and L2
  • If the line is in S state in L1, it sends an
    upgrade request to L2 and either there is an L2
    hit or L2 just conveys the upgrade to bus (Why
    cant it get changed to BusRdX?)

20
Recap of inclusion
  • L1 cache replacement
  • Replacement of a line in S state may or may not
    be conveyed to L2
  • Replacement of a line in M state must be sent to
    L2 so that it can hold the most up-to-date copy
  • The line is in I state in L1 after replacement,
    the state of line remains unchanged in L2
  • L2 cache replacement
  • Replacement of a line in S state may or may not
    generate a bus transaction it must send a
    notification to the L1 caches so that they can
    invalidate the line to maintain inclusion
  • Replacement of a line in M state first asks the
    L1 cache to send all the relevant L1 lines (these
    are the most up-to-date copies) and then launches
    a BusWB
  • The state of line in both L1 and L2 is I after
    replacement

21
Recap of inclusion
  • Replacement of a line in E state from L1?
  • Replacement of a line in E state from L2?
  • Replacement of a line in O state from L1?
  • Replacement of a line in O state from L2?
  • In summary
  • A line in S state in L2 may or may not be in L1
    in S state
  • A line in M state in L2 may or may not be in L1
    in M state Why? Can it be in S state?
  • A line in I state in L2 must not be present in L1

22
Inclusion and snoop
  • BusRd snoop
  • Look up L2 cache tag if in I state no action if
    in S state no action if in M state assert
    wired-OR M line, send read intervention to L1
    data cache, L1 data cache sends lines back, L2
    controller launches line on bus, both L1 and L2
    lines go to S state
  • BusRdX snoop
  • Look up L2 cache tag if in I state no action if
    in S state invalidate and also notify L1 if in M
    state assert wired-OR M line, send readX
    intervention to L1 data cache, L1 data cache
    sends lines back, L2 controller launches line on
    bus, both L1 and L2 lines go to I state
  • BusUpgr snoop
  • Similar to BusRdX without the cache line flush

23
L2 to L1 interventions
  • Two types of interventions
  • One is read/readX intervention that requires data
    reply
  • Other is plain invalidation that does not need
    data reply
  • Data interventions can be eliminated by making L1
    cache write-through
  • But introduces too much of write traffic to L2
  • One possible solution is to have a store buffer
    that can handle the stores in background obeying
    the available BW, so that the processor can
    proceed independently this can easily violate
    sequential consistency unless store buffer also
    becomes a part of snoop logic
  • Useless invalidations can be eliminated by
    introducing an inclusion bit in L2 cache state

24
Invalidation acks?
  • On a BusRdX or BusUpgr in case of a snoop hit in
    S state L2 cache sends invalidation to L1 caches
  • Does the snoop logic wait for an invalidation
    acknowledgment from L1 cache before the
    transaction can be marked complete?
  • Do we need a two-phase mechanism?
  • What are the issues?

25
Intervention races
  • Writebacks introduce new races in multi-level
    cache hierarchy
  • Suppose L2 sends a read intervention to L1 and in
    the meantime L1 decides to replace that line (due
    to some conflicting processor access)
  • The intervention will naturally miss the
    up-to-date copy
  • When the writeback arrives at L2, L2 realizes
    that the intervention race has occurred (need
    extra hardware to implement this logic what
    hardware?)
  • When the intervention reply arrives from L1, L2
    can apply the newly received writeback and launch
    the line on bus
  • Exactly same situation may arise even in
    uniprocessor if a dirty replacement from L2
    misses the line in L1 because L1 just replaced
    that line too

26
Tag RAM design
  • A multi-level cache hierarchy reduces tag
    contention
  • L1 tags are mostly accessed by the processor
    because L2 cache acts as a filter for external
    requests
  • L2 tags are mostly accessed by the system because
    hopefully L1 cache can absorb most of the
    processor traffic
  • Still some machines maintain duplicate tags at
    all or the outermost level only

27
Exclusive cache levels
  • AMD K7 (Athlon XP) and K8 (Athlon64, Opteron)
    architectures chose to have exclusive levels of
    caches instead of inclusive
  • Definitely provides you much better utilization
    of on-chip caches since there is no duplication
  • But complicates many issues related to coherence
  • The uniprocessor protocol is to refill requested
    lines directly into L1 without placing a copy in
    L2 only on an L1 eviction put the line into L2
    on an L1 miss look up L2 and in case of L2 hit
    replace line from L2 and put it in L1 (may have
    to replace multiple L1 lines to accommodate the
    full L2 line not sure what K8 does possible to
    maintain inclusion bit per L1 line sector in L2
    cache)
  • For multiprocessors one solution could be to have
    one snoop engine per cache level and a tournament
    logic that selects the successful snoop result

28
Split-transaction bus
  • Atomic bus leads to underutilization of bus
    resources
  • Between the address is taken off the bus and the
    snoop responses are available the bus stays idle
  • Even after the snoop result is available the bus
    may remain idle due to high memory access latency
  • Split-transaction bus divides each transaction
    into two parts request and response
  • Between the request and response of a particular
    transaction there may be other requests and/or
    responses from different transactions
  • Outstanding transactions that have not yet
    started or have completed only one phase are
    buffered in the requesting cache controllers

29
New issues
  • Split-transaction bus introduces new protocol
    races
  • P0 and P1 have a line in S state and both issue
    BusUpgr, say, in consecutive cycles
  • Snoop response arrives later because it takes
    time
  • Now both P0 and P1 may think that they have
    ownership
  • Flow control is important since buffer space is
    finite
  • In-order or out-of-order response?
  • Out-of-order response may better tolerate
    variable memory latency by servicing other
    requests
  • Pentium Pro uses in-order response
  • SGI Challenge and Sun Enterprise use out-of-order
    response i.e. no ordering is enforced

30
SGI Powerpath-2 bus
  • Used in SGI Challenge
  • Conflicts are resolved by not allowing multiple
    bus transactions to the same cache line
  • Allows eight outstanding requests on the bus at
    any point in time
  • Flow control on buffers is provided by negative
    acknowledgments (NACKs) the bus has a dedicated
    NACK line which remains asserted if the buffer
    holding outstanding transactions is full a
    NACKed transaction must be retried
  • The request order determines the total order of
    memory accesses, but the responses may be
    delivered in a different order depending on the
    completion time of them
  • In subsequent slides we call this design
    Powerpath-2 since it is loosely based on that

31
SGI Powerpath-2 bus
  • Logically two separate buses
  • Request bus for launching the command type
    (BusRd, BusWB etc.) and the involved address
  • Response bus for providing the data response, if
    any
  • Since responses may arrive in an order different
    from the request order, a 3-bit tag is assigned
    to each request
  • Responses launch this tag on the tag bus along
    with the data reply so that the address bus may
    be left free for other requests
  • The data bus is 256-bit wide while a cache line
    is 128 bytes
  • One data response phase needs four bus cycles
    along with one additional hardware turnaround
    cycle

32
SGI Powerpath-2 bus
  • Essentially two main buses and various control
    wires for snoop results, flow control etc.
  • Address bus five cycle arbitration, used during
    request
  • Data bus five cycle arbitration, five cycle
    transfer, used during response
  • Three different transactions may be in one of
    these three phases at any point in time

33
SGI Powerpath-2 bus
  • Forming a total order
  • After the decode cycle during request phase every
    cache controller takes appropriate coherence
    actions i.e. BusRd downgrades M line to S, BusRdX
    invalidates line
  • If a cache controller does not get the tags due
    to contention with the processor it simply
    lengthens the ack phase beyond one cycle
  • Thus the total order is formed during the request
    phase itself i.e. the position of each request in
    the total order is determined at that point

34
SGI Powerpath-2 bus
  • BusWB case
  • BusWB only needs the request phase
  • However needs both address and data lines
    together
  • Must arbitrate for both together
  • BusUpgr case
  • Consists only of the request phase
  • No response or acknowledgment
  • As soon as the ack phase of address arbitration
    is completed by the issuing node, the upgrade has
    sealed a position in the total order and hence is
    marked complete by sending a completion signal to
    the issuing processor by its local bus controller
    (each node has its own bus controller to handle
    bus requests)

35
Bus interface logic
A request table entry is freed when the response
is observed on the bus
36
Snoop results
  • Three snoop wires shared, modified, inhibit (all
    wired-OR)
  • The inhibit wire helps in holding off snoop
    responses until the data response is launched on
    the bus
  • Although the request phase determines who will
    source the data i.e. some cache or memory, the
    memory controller does not know it
  • The cache with a modified copy keeps the inhibit
    line asserted until it gets the data bus and
    flushes the data this prevents memory controller
    from sourcing the data
  • Otherwise memory controller arbitrates for the
    data bus
  • When the data appears all cache controllers
    appropriately assert the shared and modified line
  • Why not launch snoop results as soon as they are
    available?

37
Conflict resolution
  • Use the pending request table to resolve
    conflicts
  • Every processor has a copy of the table
  • Before arbitrating for the address bus every
    processor looks up the table to see if there is a
    match
  • In case of a match the request is not issued and
    is held in a pending buffer
  • Flow control is needed at different levels
  • Essentially need to detect if any buffer is full
  • SGI Challenge uses a separate NACK line for each
    of address and data phases
  • Before the phases reach the ack cycle any cache
    controller can assert the NACK line if it runs
    out of some critical buffer this invalidates the
    transaction and the requester must retry (may use
    back-off and/or priority)
  • Sun Enterprise requires the receiver to generate
    the retry when it has buffer space (thus only one
    retry)

38
Path of a cache miss
  • Assume a read miss
  • Look up request table in case of a match with
    BusRd just mark the entry indicating that this
    processor will snoop the response from the bus
    and that it will also assert the shared line
  • In case of a request table hit with BusRdX the
    cache controller must hold on to the request
    until the conflict resolves
  • In case of a request table miss the requester
    arbitrates for address bus while arbitrating if
    a conflicting request arrives, the controller
    must put a NOP transaction within the slot it is
    granted and hold on to the request until the
    conflict resolves

39
Path of a cache miss
  • Suppose the requester succeeds in putting the
    request on address/command bus
  • Other cache controllers snoop the request,
    register it in request table (the requester also
    does this), take appropriate coherence action
    within own cache hierarchy, main memory also
    starts fetching the cache line
  • If a cache holds the line in M state it should
    source it on bus during response phase it keeps
    the inhibit line asserted until it gets the data
    bus then it lowers inhibit line and asserts the
    modified line at this point the memory
    controller aborts the data fetch/response and
    instead fields the line from the data bus for
    writing back

40
Path of a cache miss
  • If the memory fetches the line even before the
    snoop is complete, the inhibit line will not
    allow the memory controller to launch the data on
    bus
  • After the inhibit line is lowered depending on
    the state of the modified line memory cancels the
    data response
  • If no one has the line in M state, the requester
    grabs the response from memory
  • A store miss is similar
  • Only difference is that even if a cache has the
    line in M state, the memory controller does not
    write the response back
  • Also any pending BusUpgr to the same cache line
    must be converted to BusReadX

41
Write serialization
  • In a split-transaction bus setting, the request
    table provides sufficient support for write
    serialization
  • Requests to the same cache line are not allowed
    to proceed at the same time
  • A read to a line after a write to the same line
    can be launched only after the write response
    phase has completed this guarantees that the
    read will see the new value
  • A write after a read to the same line can be
    started only after the read response has
    completed this guarantees that the value of the
    read cannot be altered by the value written

42
Write atomicity and SC
  • Sequential consistency (SC) requires write
    atomicity i.e. total order of all writes seen by
    all processors should be identical
  • Since a BusRdX or BusUpgr does not wait until the
    invalidations are actually applied to the caches,
    you have to be careful
  • P0 A1 B1
  • P1 print B print A
  • Under SC (A, B) (0, 1) is not allowed
  • Suppose to start with P1 has the line containing
    A in cache, but not the line containing B
  • The stores of P0 queue the invalidation of A in
    P1s cache controller
  • P1 takes read miss for B, but the response of B
    is re-ordered by P1s cache controller so that it
    overtakes the invalidaton (thought it may be
    better to prioritize reads)

43
Another example
  • P0 A1 print B
  • P1 B1 print A
  • Under SC (A, B) (0, 0) is not allowed
  • Same problem if P0 executes both instructions
    first, then P1 executes the write of B (which
    lets assume generates an upgrade so that it is
    marked complete as soon as the address
    arbitration phase finishes), then the upgrade
    completion is re-ordered with the pending
    invalidation of A
  • So, the reason these two cases fail is that the
    new values are made visible before older
    invalidations are applied
  • One solution is to have a strict FIFO queue
    between the bus controller and the cache
    hierarchy
  • But it is sufficient as long as replies do not
    overtake invalidations otherwise the bus
    responses can be re-ordered without violating
    write atomicity and hence SC (e.g., if there are
    only read and write responses in the queue, it
    sometimes may make sense to prioritize read
    responses)

44
In-order response
  • In-order response can simplify quite a few things
    in the design
  • The fully associative request table can be
    replaced by a FIFO queue
  • Conflicting requests where one is a write can
    actually be allowed now (multiple reads were
    allowed even before although only the first one
    actually appears on the bus)
  • Consider a BusRdX followed by a BusRd from two
    different processors
  • With in-order response it is guaranteed that the
    BusRdX response will be granted the data bus
    before the BusRd response (which may not be true
    for ooo response and hence such a conflict is
    disallowed)
  • So when the cache controller generating the
    BusRdX sees the BusRd it only notes that it
    should source the line for this request after its
    own write is completed

45
In-order response
  • The performance penalty may be huge
  • Essentially because of the memory
  • Consider a situation where three requests are
    pending to cache lines A, B, C in that order
  • A and B map to the same memory bank while C is in
    a different bank
  • Although the response for C may be ready long
    before that of B, it cannot get the bus

46
Multi-level caches
  • Split-transaction bus makes the design of
    multi-level caches a little more difficult
  • The usual design is to have queues between levels
    of caches in each direction
  • How do you size the queues? Between processor and
    L1 one buffer is sufficient (assume one
    outstanding processor access), L1-to-L2 needs P1
    buffers (why?), L2-to-L1 needs P buffers (why?),
    L1 to processor needs one buffer
  • With smaller buffers there is a possibility of
    deadlock suppose the L1-to-L2 and L2-to-L1 have
    one queue entry each, there is a request in
    L1-to-L2 queue and there is also an intervention
    in L2-to-L1 queue clearly L1 cannot pick up the
    intervention because it does not have space to
    put the reply in L1-to-L2 queue while L2 cannot
    pick up the request because it might need space
    in L2-to-L1 queue in case of an L2 hit

47
Multi-level caches
  • Formalizing the deadlock with dependence graph
  • There are four types of transactions in the cache
    hierarchy 1. Processor requests (outbound
    requests), 2. Responses to processor requests
    (inbound responses), 3. Interventions (inbound
    requests), 4. Intervention responses (outbound
    responses)
  • Processor requests need space in L1-to-L2 queue
    responses to processors need space in L2-to-L1
    queue interventions need space in L2-to-L1
    queue intervention responses need space in
    L1-to-L2 queue
  • Thus a message in L1-to-L2 queue may need space
    in L2-to-L1 queue (e.g. a processor request
    generating a response due to L2 hit) also a
    message in L2-to-L1 queue may need space in
    L1-to-L2 queue (e.g. an intervention response)
  • This creates a cycle in queue space dependence
    graph

48
Dependence graph
  • Represent a queue by a vertex in the graph
  • Number of vertices number of queues
  • A directed edge from vertex u to vertex v is
    present if a message at the head of queue u may
    generate another message which requires space in
    queue v
  • In our case we have two queues
  • L2?L1 and L1?L2 the graph is not a DAG, hence
    deadlock

L2?L1
L1?L2
49
Multi-level caches
  • In summary
  • L2 cache controller refuses to drain L1-to-L2
    queue if there is no space in L2-to-L1 queue
    this is rather conservative because the message
    at the head of L1-to-L2 queue may not need space
    in L2-to-L1 queue e.g., in case of L2 miss or if
    it is an intervention reply but after popping
    the head of L1-to-L2 queue it is impossible to
    backtrack if the message does need space in
    L2-to-L1 queue
  • Similarly, L1 cache controller refuses to drain
    L2-to-L1 queue if there is no space in L1-to-L2
    queue
  • How do we break this cycle?
  • Observe that responses for processor requests are
    guaranteed not to generate any more messages and
    intervention requests do not generate new
    requests, but can only generate replies

50
Multi-level caches
  • Solving the queue deadlock
  • Introduce one more queue in each direction i.e.
    have a pair of queues in each direction
  • L1-to-L2 processor request queue and L1-to-L2
    intervention response queue
  • Similarly, L2-to-L1 intervention request queue
    and L2-to-L1 processor response queue
  • Now L2 cache controller can serve L1-to-L2
    processor request queue as long as there is space
    in L2-to-L1 processor response queue, but there
    is no constraint on L1 cache controller for
    draining L2-to-L1 processor response queue
  • Similarly, L1 cache controller can serve L2-to-L1
    intervention request queue as long as there is
    space in L1-to-L2 intervention response queue,
    but L1-to-L2 intervention response queue will
    drain as soon as bus is granted

51
Dependence graph
  • Now we have four queues
  • Processor request (PR) and intervention reply
    (IY) are L1 to L2
  • Processor reply (PY) and intervention request
    (IR) are L2 to L1

PR
PY
IR
IY
52
Dependence graph
  • Possible to combine PR and IY into a supernode of
    the graph and still be cycle-free
  • Leads to one L1 to L2 queue
  • Similarly, possible to combine IR and PY into a
    supernode
  • Leads to one L2 to L1 queue
  • Cannot do both
  • Leads to cycle as already discussed
  • Bottomline need at least three queues for
    two-level cache hierarchy

53
Multiple outstanding requests
  • Today all processors allow multiple outstanding
    cache misses
  • We have already discussed issues related to ooo
    execution
  • Not much needs to be added on top of that to
    support multiple outstanding misses
  • For multi-level cache hierarchy the queue depths
    may be made bigger for performance reasons
  • Various other buffers such as writeback buffer
    need to be made bigger

54
SGI Challenge
Supports 36 MIPS R4400 (4 per board) or 18 MIPS
R8000 (2 per board) A-chip has the address bus
interface, request table CC-chip handles
coherence through the duplicate set of tags Each
D-chip handles 64 bits of data and as a whole 4
D-chips interface to a 256-bit wide data bus
55
Sun Enterprise
Supports up to 30 UltraSPARC processors 2
processors and 1 GB memory per board Wide 64-byte
memory bus and hence two memory cycles to
transfer the entire cache line (128 bytes)
56
Sun Gigaplane bus
  • Split-transaction, 256 bits data, 41 bits
    address, 83.5 MHz (compare to 47.6 MHz of SGI
    Powerpath-2)
  • Supports 16 boards
  • 112 outstanding transactions (up to 7 from each
    board)
  • Snoop result is available 5 cycles after the
    request phase
  • Memory fetches data speculatively
  • MOESI protocol

57
Special Topics
58
Virtually indexed caches
  • Recall that to have concurrent accesses to TLB
    and cache, L1 caches are often made virtually
    indexed
  • Can read the physical tag and data while the TLB
    lookup takes place
  • Later compare the tag for hit/miss detection
  • How does it impact the functioning of coherence
    protocols and snoop logic?
  • Even for uniprocessor the synonym problem
  • Two different virtual addresses may map to the
    same physical page frame
  • One simple solution may be to flush all cache
    lines mapped to a page frame at the time of
    replacement
  • But this clearly prevents page sharing between
    two processes

59
Virtual indexing
  • Software normally employs page coloring to solve
    the synonym issue
  • Allow two virtual pages to point to the same
    physical page frame only if the two virtual
    addresses have at least lower k bits common where
    k is equal to cache line block offset plus log2
    (number of cache sets)
  • This guarantees that in a virtually indexed
    cache, lines from both pages will map to the same
    index range
  • What about the snoop logic?
  • Putting virtual address on the bus requires a VA
    to PA translation in the snoop so that physical
    tags can be generated (adds extra latency to
    snoop and also requires duplicate set of
    translations)
  • Putting physical address on the bus requires a
    reverse translation to generate the virtual index
    (requires an inverted page table)

60
Virtual indexing
  • Dual tags (Goodman, 1987)
  • Hardware solution to avoid synonyms in shared
    memory
  • Maintain virtual and physical tags each
    corresponding tag pair points to each other
  • Assume no page coloring
  • Use virtual address to look up cache (i.e.
    virtual index and virtual tag) from processor
    side if it hits everything is fine if it misses
    use the physical address to look up the physical
    tag and if it hits follow the physical tag to
    virtual tag pointer to find the index
  • If virtual tag misses and physical tag hits, that
    means the synonym problem has happened i.e. two
    different VAs are mapped to the same PA in this
    case invalidate the cache line pointed to by
    physical tag, replace the line at the virtual
    index of the current virtual address, place the
    contents of the invalidated line there and update
    the physical tag pointer to point to the new
    virtual index

61
Virtual indexing
  • Goodman, 1987
  • Always use physical address for snooping
  • Obviates the need for a TLB in memory controller
  • The physical tag is used to look up the cache for
    snoop decision
  • In case of a snoop hit the pointer stored with
    the physical tag is followed to get the virtual
    index and then the cache block can be accessed if
    needed (e.g., in M state)
  • Note that even if there are two different types
    of tags the state of a cache line is the same and
    does not depend on what type of tag is used to
    access the line

62
Virtual indexing
  • Multi-level cache hierarchy
  • Normally the L1 cache is designed to be virtually
    indexed and other levels are physically indexed
  • L2 sends interventions to L1 by communicating the
    PA
  • L1 must determine the virtual index from that to
    access the cache dual tags are sufficient for
    this purpose

63
TLB coherence
  • A page table entry (PTE) may be held in multiple
    processors in shared memory because all of them
    access the same shared page
  • A PTE may get modified when the page is swapped
    out and/or access permissions are changed
  • Must tell all processors having this PTE to
    invalidate
  • How to do it efficiently?
  • No TLB virtually indexed virtually tagged L1
    caches
  • On L1 miss directly access PTE in memory and
    bring it to cache then use normal cache
    coherence because the PTEs also reside in the
    shared memory segment
  • On page replacement the page fault handler can
    flush the cache line containing the replaced PTE
  • Too impractical fully virtual caches are rare,
    still uses a TLB for upper levels (Alpha 21264
    instruction cache)

64
TLB coherence
  • Hardware solution
  • Extend snoop logic to handle TLB coherence
  • PowerPC family exercises a tlbie instruction (TLB
    invalidate entry)
  • When OS modifies a PTE it puts a tlbie
    instruction on bus
  • Snoop logic picks it up and invalidates the TLB
    entry if present in all processors
  • This is well suited for bus-based SMPs, but not
    for DSMs because broadcast in a large-scale
    machine is not good

65
TLB shootdown
  • Popular TLB coherence solution
  • Invoked by an initiator (the processor which
    modifies the PTE) by sending interrupt to
    processors that might be caching the PTE in TLBs
    before doing so OS also locks the involved PTE to
    avoid any further access to it in case of TLB
    misses from other processors
  • The receiver of the interrupt simply invalidates
    the involved PTE if it is present in its TLB and
    sets a flag in shared memory on which the
    initiator polls
  • On completion the initiator unlocks the PTE
  • SGI Origin uses a lazy TLB shootdown i.e. it
    invalidates a TLB entry only when a processor
    tries to access it next time (will discuss in
    detail)

66
Snooping on a ring
  • Length of the bus limits the frequency at which
    it can be clocked which in turn limits the
    bandwidth offered by the bus leading to a limited
    number of processors
  • A ring interconnect provides a better solution
  • Connect a processor only to its two neighbors
  • Short wires, much higher switching frequency,
    better bandwidth, more processors
  • Each node has private local memory (more like a
    distributed shared memory multiprocessor)
  • Every cache line has a home node i.e. the node
    where the memory contains this line can be
    determined by upper few bits of the PA
  • Transactions traverse the ring node by node

67
Snooping on a ring
  • Snoop mechanism
  • When a transaction passes by the ring interface
    of a node it snoops the transaction, takes
    appropriate coherence actions, and forwards the
    transaction to its neighbor if necessary
  • The home node also receives the transaction
    eventually and lets assume that it has a dirty
    bit associated with every memory line (otherwise
    you need a two-phase protocol)
  • A request transaction is removed from the ring
    when it comes back to the requester (serves as an
    acknowledgment that every node has seen the
    request)
  • The ring is essentially divided into time slots
    where a node can insert new request or response
    if there is no free time slot it must wait until
    one passes by called a slotted ring

68
Snooping on a ring
  • The snoop logic must be able to finish coherence
    actions for a transaction before the next time
    slot arrives
  • The main problem of a ring is the end-to-end
    latency, since the transactions must traverse
    hop-by-hop
  • Serialization and sequential consistency is
    trickier
  • The order of two transactions may be differently
    seen by two processors if the source of one
    transaction is between the two processors
  • The home node can resort to NACKs if it sees
    conflicting outstanding requests
  • Introduces many races in the protocol

69
Scaling bandwidth
  • Data bandwidth
  • Make the bus wider costly hardware
  • Replace bus by point-to-point crossbar since
    only the address portion of a transaction is
    needed for coherence, the data transaction can be
    directly between source and destination
  • Add multiple data busses
  • Snoop or coherence bandwidth
  • This is determined by the number of snoop actions
    that can be executed in unit time
  • Having concurrent non-conflicting snoop actions
    definitely helps improve the protocol throughput
  • Multiple address busses a separate snoop engine
    is associated with each bus on each node
  • Order the address busses logically to define a
    partial order among concurrent requests so that
    these partial orders can be combined to form a
    total order

70
AMD Opteron
  • Each node contains an x86-64 core, 64 KB L1 data
    and instruction caches, 1 MB L2 cache, on-chip
    integrated memory controller, three fast routing
    links called hyperTransport, local DDR memory
  • Glueless MP just connect 8 Opteron chips via HT
    to design a distributed shared memory
    multiprocessor
  • L2 cache supports 10 outstanding misses

Produced from IEEE Micro
71
AMD Opteron
  • Integrated memory controller and north bridge
    functionality help a lot
  • Can clock the memory controller at processor
    frequency (2 GHz)
  • No need to have a cumbersome motherboard just
    buy the Opteron chip and connect it to a few
    peripherals (system maintenance is much easier)
  • Overall, improves performance by 20-25 over
    Athlon
  • Snoop throughput and bandwidth is much higher
    since the snoop logic is clocked at 2 GHz
  • Integrated hyperTransport provides very high
    communication bandwidth
  • Point-to-point links, split-transaction and full
    duplex (bidirectional links)
  • On each HT link you can connect a processor or I/O

72
Opteron servers
Produced from IEEE Micro
73
AMD Hammer protocol
  • Opteron uses the snoop-based Hammer protocol
  • First the requester sends a transaction to home
    node
  • The home node starts accessing main memory and in
    parallel broadcasts the request to all the nodes
    via point-to-point messages
  • The nodes individually snoop the request, take
    appropriate coherence actions in their local
    caches, and sends data (if someone has it in M or
    O state) or an empty completion acknowledgment to
    the requester the home memory also sends the
    data speculatively
  • After gathering all responses the requester sends
    a completion message to the home node so that it
    can proceed with subsequent requests (this
    completion ack may be needed for serializing
    conflicting requests)
  • This is one example of a snoopy protocol over a
    point-to-point interconnect unlike the shared bus
Write a Comment
User Comments (0)
About PowerShow.com