Title: Multiprocessors on A Snoopy Bus
1Multiprocessors onA Snoopy Bus
2Agenda
- Goal is to understand what influences the
performance, cost and scalability of SMPs - Details of physical design of SMPs
- At least three goals of any design correctness,
performance, low hardware complexity - Performance gains are normally achieved by
pipelining memory transactions and having
multiple outstanding requests - These performance optimizations occasionally
introduce new protocol races involving transient
states leading to correctness issues in terms of
coherence and consistency
3Correctness goals
- Must enforce coherence and write serialization
- Recall that write serialization guarantees all
writes to a location to be seen in the same order
by all processors - Must obey the target memory consistency model
- If sequential consistency is the goal, the system
must provide write atomicity and detect write
completion correctly (write atomicity extends the
definition of write serialization for any
location i.e. it guarantees that positions of
writes within the total order seen by all
processors be the same) - Must be free of deadlock, livelock and starvation
- Starvation confined to a part of the system is
not as problematic as deadlock and livelock - However, system-wide starvation leads to livelock
4A simple design
- Start with a rather naïve design
- Each processor has a single level of data and
instruction caches - The cache allows exactly one outstanding miss at
a time i.e. a cache miss request is blocked if
already another is outstanding (this serializes
all bus requests from a particular processor) - The bus is atomic i.e. it handles one request at
a time
5Cache controller
- Must be able to respond to bus transactions as
necessary - Handled by the snoop logic
- The snoop logic should have access to the cache
tags - A single set of tags cannot allow concurrent
accesses by the processor-side and the bus-side
controllers - When the snoop logic accesses the tags the
processor must remain locked out from accessing
the tags - Possible enhancements two read ports in the tag
RAM allows concurrent reads duplicate copies are
also possible multiple banks reduce the
contention also - In all cases, updates to tags must still be
atomic or must be applied to both copies in case
of duplicate tags however, tag updates are a lot
less frequent compared to reads
6Snoop logic
- Couple of decisions need to be taken while
designing the snoop logic - How long should the snoop decision take?
- How should processors convey the snoop decision?
- Snoop latency (three design choices)
- Possible to set an upper bound in terms of number
of cycles advantage no change in memory
controller hardware disadvantage potentially
large snoop latency (Pentium Pro, Sun Enterprise
servers) - The memory controller samples the snoop results
every cycle until all caches have completed snoop
(SGI Challenge uses this approach where the
memory controller fetches the line from memory,
but stalls if all caches havent yet snooped) - Maintain a bit per memory line to indicate if it
is in M state in some cache
7Snoop logic
- Conveying snoop result
- For MESI the bus is augmented with three wired-OR
snoop result lines (shared, modified, valid) the
valid line is active low - The original Illinois MESI protocol requires
cache-to-cache transfer even when the line is in
S state this may complicate the hardware
enormously due to the involved priority mechanism - Commercial MESI protocols normally allow
cache-to-cache sharing only for lines in M state - SGI Challenge and Sun Enterprise allow
cache-to-cache transfers only in M state
Challenge updates memory when going from M to S
while Enterprise exercises a MOESI protocol
8Writebacks
- Writebacks are essentially eviction of modified
lines - Caused by a miss mapping to the same cache index
- Needs two bus transactions one for the miss and
one for the writeback - Definitely the miss should be given first
priority since this directly impacts forward
progress of the program - Need a writeback buffer (WBB) to hold the evicted
line until the bus can be acquired for the second
time by this cache - In the meantime a new request from another
processor may be launched for the evicted line
the evicting cache must provide the line from the
WBB and cancel the pending writeback (need an
address comparator with WBB)
9A simple design
10Inherently non-atomic
- Even though the bus is atomic, a complete
protocol transaction involves quite a few steps
which together forms a non-atomic transaction - Issuing processor request
- Looking up cache tags
- Arbitrating for bus
- Snoop action in other cache controller
- Refill in requesting cache controller at the end
- Different requests from different processors may
be in a different phase of a transaction - This makes a protocol transition inherently
non-atomic
11Inherently non-atomic
- Consider an example
- P0 and P1 have cache line C in shared state
- Both proceed to write the line
- Both cache controllers look up the tags, put a
BusUpgr into the bus request queue, and start
arbitrating for the bus - P1 gets the bus first and launches its BusUpgr
- P0 observes the BusUpgr and now it must
invalidate C in its cache and change the request
type to BusRdX - So every cache controller needs to do an
associative lookup of the snoop address against
its pending request queue and depending on the
request type take appropriate actions
12Inherently non-atomic
- One way to reason about the correctness is to
introduce transient states - Possible to think of the last problem as the line
C being in a transient S?M state - On observing a BusUpgr or BusRdX, this state
transitions to I?M which is also transient - The line C goes to stable M state only after the
transaction completes - These transient states are not really encoded in
the state bits of a cache line because at any
point in time there will be a small number of
outstanding requests from a particular processor
(today the maximum I know of is 16) - These states are really determined by the state
of an outstanding line and the state of the cache
controller
13Write serialization
- Atomic bus makes it rather easy, but
optimizations are possible - Consider a processor write to a shared cache line
- Is it safe to continue with the write and change
the state to M even before the bus transaction is
complete? - After the bus transaction is launched it is
totally safe because the bus is atomic and hence
the position of the write is committed in the
total order therefore no need to wait any
further (note that the exact point in time when
the other caches invalidate the line is not
important) - If the processor decides to proceed even before
the bus transaction is launched (very much
possible in ooo execution), the cache controller
must take the responsibility of squashing and
re-executing offending instructions so that the
total order is consistent across the system
14Fetch deadlock
- Just a fancy name for a pretty intuitive deadlock
- Suppose P0s cache controller is waiting to get
the bus for launching a BusRdX to cache line A - P1 has a modified copy of cache line A
- P1 has launched a BusRd to cache line B and
awaiting completion - P0 has a modified copy of cache line B
- If both keep on waiting without responding to
snoop requests, the deadlock cycle is pretty
obvious - So every controller must continue to respond to
snoop requests while waiting for the bus for its
own requests - Normally the cache controller is designed as two
separate independent logic units, namely, the
inbound unit (handles snoop requests) and the
outbound unit (handles own requests and
arbitrates for bus)
15Livelock
- Consider the following example
- P0 and P1 try to write to the same cache line
- P0 gets exclusive ownership, fills the line in
cache and notifies the load/store unit (or
retirement unit) to retry the store - While all these are happening P1s request
appears on the bus and P0s cache controller
modifies tag state to I before the store could
retry - This can easily lead to a livelock
- Normally this is avoided by giving the load/store
unit higher priority for tag access (i.e. the
snoop logic cannot modify the tag arrays when
there is a processor access pending in the same
clock cycle) - This is even rarer in multi-level cache hierarchy
(more later)
16Starvation
- Some amount of fairness is necessary in the bus
arbiter - An FCFS policy is possible for granting bus, but
that needs some buffering in the arbiter to hold
already placed requests - Most machines implement an aging scheme which
keeps track of the number of times a particular
request is denied and when the count crosses a
threshold that request becomes the highest
priority (this too needs some storage)
17More on LL/SC
- We have seen that both LL and SC may suffer from
cache misses (a read followed by an upgrade miss) - Is it possible to save one transaction?
- What if I design my cache controller in such a
way that it can recognize LL instructions and
launch a BusRdX instead of BusRd? - This is called Read-for-Ownership (RFO) also
used by Intel atomic xchg instruction - Nice idea, but you have to be careful
- By doing this you have just enormously increased
the probability of a livelock before the SC
executes there is a high probability that another
LL will take away the line - Possible solution is to buffer incoming snoop
requests until the SC completes (buffer space is
proportional to P) may introduce new deadlock
cycles (especially for modern non-atomic busses)
18Multi-level caches
- We have talked about multi-level caches and the
involved inclusion property - Multiprocessors create new problems related to
multi-level caches - A bus snoop result may be relevant to inner
levels of cache e.g., bus transactions are not
visible to the first level cache controller - Similarly, modifications made in the first level
cache may not be visible to the second level
cache controller which is responsible for
handling bus requests - Inclusion property makes it easier to maintain
coherence - Since L1 cache is a subset of L2 cache a snoop
miss in L2 cache need not be sent to L1 cache
19Recap of inclusion
- A processor read
- Looks up L1 first and in case of miss goes to L2,
and finally may need to launch a BusRd request if
it misses in L2 - Finally, the line is in S state in both L1 and L2
- A processor write
- Looks up L1 first and if it is in I state sends a
ReadX request to L2 which may have the line in M
state - In case of L2 hit, the line is filled in M state
in L1 - In case of L2 miss, if the line is in S state in
L2 it launches BusUpgr otherwise it launches
BusRdX finally, the line is in state M in both
L1 and L2 - If the line is in S state in L1, it sends an
upgrade request to L2 and either there is an L2
hit or L2 just conveys the upgrade to bus (Why
cant it get changed to BusRdX?)
20Recap of inclusion
- L1 cache replacement
- Replacement of a line in S state may or may not
be conveyed to L2 - Replacement of a line in M state must be sent to
L2 so that it can hold the most up-to-date copy - The line is in I state in L1 after replacement,
the state of line remains unchanged in L2 - L2 cache replacement
- Replacement of a line in S state may or may not
generate a bus transaction it must send a
notification to the L1 caches so that they can
invalidate the line to maintain inclusion - Replacement of a line in M state first asks the
L1 cache to send all the relevant L1 lines (these
are the most up-to-date copies) and then launches
a BusWB - The state of line in both L1 and L2 is I after
replacement
21Recap of inclusion
- Replacement of a line in E state from L1?
- Replacement of a line in E state from L2?
- Replacement of a line in O state from L1?
- Replacement of a line in O state from L2?
- In summary
- A line in S state in L2 may or may not be in L1
in S state - A line in M state in L2 may or may not be in L1
in M state Why? Can it be in S state? - A line in I state in L2 must not be present in L1
22Inclusion and snoop
- BusRd snoop
- Look up L2 cache tag if in I state no action if
in S state no action if in M state assert
wired-OR M line, send read intervention to L1
data cache, L1 data cache sends lines back, L2
controller launches line on bus, both L1 and L2
lines go to S state - BusRdX snoop
- Look up L2 cache tag if in I state no action if
in S state invalidate and also notify L1 if in M
state assert wired-OR M line, send readX
intervention to L1 data cache, L1 data cache
sends lines back, L2 controller launches line on
bus, both L1 and L2 lines go to I state - BusUpgr snoop
- Similar to BusRdX without the cache line flush
23L2 to L1 interventions
- Two types of interventions
- One is read/readX intervention that requires data
reply - Other is plain invalidation that does not need
data reply - Data interventions can be eliminated by making L1
cache write-through - But introduces too much of write traffic to L2
- One possible solution is to have a store buffer
that can handle the stores in background obeying
the available BW, so that the processor can
proceed independently this can easily violate
sequential consistency unless store buffer also
becomes a part of snoop logic - Useless invalidations can be eliminated by
introducing an inclusion bit in L2 cache state
24Invalidation acks?
- On a BusRdX or BusUpgr in case of a snoop hit in
S state L2 cache sends invalidation to L1 caches - Does the snoop logic wait for an invalidation
acknowledgment from L1 cache before the
transaction can be marked complete? - Do we need a two-phase mechanism?
- What are the issues?
25Intervention races
- Writebacks introduce new races in multi-level
cache hierarchy - Suppose L2 sends a read intervention to L1 and in
the meantime L1 decides to replace that line (due
to some conflicting processor access) - The intervention will naturally miss the
up-to-date copy - When the writeback arrives at L2, L2 realizes
that the intervention race has occurred (need
extra hardware to implement this logic what
hardware?) - When the intervention reply arrives from L1, L2
can apply the newly received writeback and launch
the line on bus - Exactly same situation may arise even in
uniprocessor if a dirty replacement from L2
misses the line in L1 because L1 just replaced
that line too
26Tag RAM design
- A multi-level cache hierarchy reduces tag
contention - L1 tags are mostly accessed by the processor
because L2 cache acts as a filter for external
requests - L2 tags are mostly accessed by the system because
hopefully L1 cache can absorb most of the
processor traffic - Still some machines maintain duplicate tags at
all or the outermost level only
27Exclusive cache levels
- AMD K7 (Athlon XP) and K8 (Athlon64, Opteron)
architectures chose to have exclusive levels of
caches instead of inclusive - Definitely provides you much better utilization
of on-chip caches since there is no duplication - But complicates many issues related to coherence
- The uniprocessor protocol is to refill requested
lines directly into L1 without placing a copy in
L2 only on an L1 eviction put the line into L2
on an L1 miss look up L2 and in case of L2 hit
replace line from L2 and put it in L1 (may have
to replace multiple L1 lines to accommodate the
full L2 line not sure what K8 does possible to
maintain inclusion bit per L1 line sector in L2
cache) - For multiprocessors one solution could be to have
one snoop engine per cache level and a tournament
logic that selects the successful snoop result
28Split-transaction bus
- Atomic bus leads to underutilization of bus
resources - Between the address is taken off the bus and the
snoop responses are available the bus stays idle - Even after the snoop result is available the bus
may remain idle due to high memory access latency - Split-transaction bus divides each transaction
into two parts request and response - Between the request and response of a particular
transaction there may be other requests and/or
responses from different transactions - Outstanding transactions that have not yet
started or have completed only one phase are
buffered in the requesting cache controllers
29New issues
- Split-transaction bus introduces new protocol
races - P0 and P1 have a line in S state and both issue
BusUpgr, say, in consecutive cycles - Snoop response arrives later because it takes
time - Now both P0 and P1 may think that they have
ownership - Flow control is important since buffer space is
finite - In-order or out-of-order response?
- Out-of-order response may better tolerate
variable memory latency by servicing other
requests - Pentium Pro uses in-order response
- SGI Challenge and Sun Enterprise use out-of-order
response i.e. no ordering is enforced
30SGI Powerpath-2 bus
- Used in SGI Challenge
- Conflicts are resolved by not allowing multiple
bus transactions to the same cache line - Allows eight outstanding requests on the bus at
any point in time - Flow control on buffers is provided by negative
acknowledgments (NACKs) the bus has a dedicated
NACK line which remains asserted if the buffer
holding outstanding transactions is full a
NACKed transaction must be retried - The request order determines the total order of
memory accesses, but the responses may be
delivered in a different order depending on the
completion time of them - In subsequent slides we call this design
Powerpath-2 since it is loosely based on that
31SGI Powerpath-2 bus
- Logically two separate buses
- Request bus for launching the command type
(BusRd, BusWB etc.) and the involved address - Response bus for providing the data response, if
any - Since responses may arrive in an order different
from the request order, a 3-bit tag is assigned
to each request - Responses launch this tag on the tag bus along
with the data reply so that the address bus may
be left free for other requests - The data bus is 256-bit wide while a cache line
is 128 bytes - One data response phase needs four bus cycles
along with one additional hardware turnaround
cycle
32SGI Powerpath-2 bus
- Essentially two main buses and various control
wires for snoop results, flow control etc. - Address bus five cycle arbitration, used during
request - Data bus five cycle arbitration, five cycle
transfer, used during response - Three different transactions may be in one of
these three phases at any point in time
33SGI Powerpath-2 bus
- Forming a total order
- After the decode cycle during request phase every
cache controller takes appropriate coherence
actions i.e. BusRd downgrades M line to S, BusRdX
invalidates line - If a cache controller does not get the tags due
to contention with the processor it simply
lengthens the ack phase beyond one cycle - Thus the total order is formed during the request
phase itself i.e. the position of each request in
the total order is determined at that point
34SGI Powerpath-2 bus
- BusWB case
- BusWB only needs the request phase
- However needs both address and data lines
together - Must arbitrate for both together
- BusUpgr case
- Consists only of the request phase
- No response or acknowledgment
- As soon as the ack phase of address arbitration
is completed by the issuing node, the upgrade has
sealed a position in the total order and hence is
marked complete by sending a completion signal to
the issuing processor by its local bus controller
(each node has its own bus controller to handle
bus requests)
35Bus interface logic
A request table entry is freed when the response
is observed on the bus
36Snoop results
- Three snoop wires shared, modified, inhibit (all
wired-OR) - The inhibit wire helps in holding off snoop
responses until the data response is launched on
the bus - Although the request phase determines who will
source the data i.e. some cache or memory, the
memory controller does not know it - The cache with a modified copy keeps the inhibit
line asserted until it gets the data bus and
flushes the data this prevents memory controller
from sourcing the data - Otherwise memory controller arbitrates for the
data bus - When the data appears all cache controllers
appropriately assert the shared and modified line - Why not launch snoop results as soon as they are
available?
37Conflict resolution
- Use the pending request table to resolve
conflicts - Every processor has a copy of the table
- Before arbitrating for the address bus every
processor looks up the table to see if there is a
match - In case of a match the request is not issued and
is held in a pending buffer - Flow control is needed at different levels
- Essentially need to detect if any buffer is full
- SGI Challenge uses a separate NACK line for each
of address and data phases - Before the phases reach the ack cycle any cache
controller can assert the NACK line if it runs
out of some critical buffer this invalidates the
transaction and the requester must retry (may use
back-off and/or priority) - Sun Enterprise requires the receiver to generate
the retry when it has buffer space (thus only one
retry)
38Path of a cache miss
- Assume a read miss
- Look up request table in case of a match with
BusRd just mark the entry indicating that this
processor will snoop the response from the bus
and that it will also assert the shared line - In case of a request table hit with BusRdX the
cache controller must hold on to the request
until the conflict resolves - In case of a request table miss the requester
arbitrates for address bus while arbitrating if
a conflicting request arrives, the controller
must put a NOP transaction within the slot it is
granted and hold on to the request until the
conflict resolves
39Path of a cache miss
- Suppose the requester succeeds in putting the
request on address/command bus - Other cache controllers snoop the request,
register it in request table (the requester also
does this), take appropriate coherence action
within own cache hierarchy, main memory also
starts fetching the cache line - If a cache holds the line in M state it should
source it on bus during response phase it keeps
the inhibit line asserted until it gets the data
bus then it lowers inhibit line and asserts the
modified line at this point the memory
controller aborts the data fetch/response and
instead fields the line from the data bus for
writing back
40Path of a cache miss
- If the memory fetches the line even before the
snoop is complete, the inhibit line will not
allow the memory controller to launch the data on
bus - After the inhibit line is lowered depending on
the state of the modified line memory cancels the
data response - If no one has the line in M state, the requester
grabs the response from memory - A store miss is similar
- Only difference is that even if a cache has the
line in M state, the memory controller does not
write the response back - Also any pending BusUpgr to the same cache line
must be converted to BusReadX
41Write serialization
- In a split-transaction bus setting, the request
table provides sufficient support for write
serialization - Requests to the same cache line are not allowed
to proceed at the same time - A read to a line after a write to the same line
can be launched only after the write response
phase has completed this guarantees that the
read will see the new value - A write after a read to the same line can be
started only after the read response has
completed this guarantees that the value of the
read cannot be altered by the value written
42Write atomicity and SC
- Sequential consistency (SC) requires write
atomicity i.e. total order of all writes seen by
all processors should be identical - Since a BusRdX or BusUpgr does not wait until the
invalidations are actually applied to the caches,
you have to be careful - P0 A1 B1
- P1 print B print A
- Under SC (A, B) (0, 1) is not allowed
- Suppose to start with P1 has the line containing
A in cache, but not the line containing B - The stores of P0 queue the invalidation of A in
P1s cache controller - P1 takes read miss for B, but the response of B
is re-ordered by P1s cache controller so that it
overtakes the invalidaton (thought it may be
better to prioritize reads)
43Another example
- P0 A1 print B
- P1 B1 print A
- Under SC (A, B) (0, 0) is not allowed
- Same problem if P0 executes both instructions
first, then P1 executes the write of B (which
lets assume generates an upgrade so that it is
marked complete as soon as the address
arbitration phase finishes), then the upgrade
completion is re-ordered with the pending
invalidation of A - So, the reason these two cases fail is that the
new values are made visible before older
invalidations are applied - One solution is to have a strict FIFO queue
between the bus controller and the cache
hierarchy - But it is sufficient as long as replies do not
overtake invalidations otherwise the bus
responses can be re-ordered without violating
write atomicity and hence SC (e.g., if there are
only read and write responses in the queue, it
sometimes may make sense to prioritize read
responses)
44In-order response
- In-order response can simplify quite a few things
in the design - The fully associative request table can be
replaced by a FIFO queue - Conflicting requests where one is a write can
actually be allowed now (multiple reads were
allowed even before although only the first one
actually appears on the bus) - Consider a BusRdX followed by a BusRd from two
different processors - With in-order response it is guaranteed that the
BusRdX response will be granted the data bus
before the BusRd response (which may not be true
for ooo response and hence such a conflict is
disallowed) - So when the cache controller generating the
BusRdX sees the BusRd it only notes that it
should source the line for this request after its
own write is completed
45In-order response
- The performance penalty may be huge
- Essentially because of the memory
- Consider a situation where three requests are
pending to cache lines A, B, C in that order - A and B map to the same memory bank while C is in
a different bank - Although the response for C may be ready long
before that of B, it cannot get the bus
46Multi-level caches
- Split-transaction bus makes the design of
multi-level caches a little more difficult - The usual design is to have queues between levels
of caches in each direction - How do you size the queues? Between processor and
L1 one buffer is sufficient (assume one
outstanding processor access), L1-to-L2 needs P1
buffers (why?), L2-to-L1 needs P buffers (why?),
L1 to processor needs one buffer - With smaller buffers there is a possibility of
deadlock suppose the L1-to-L2 and L2-to-L1 have
one queue entry each, there is a request in
L1-to-L2 queue and there is also an intervention
in L2-to-L1 queue clearly L1 cannot pick up the
intervention because it does not have space to
put the reply in L1-to-L2 queue while L2 cannot
pick up the request because it might need space
in L2-to-L1 queue in case of an L2 hit
47Multi-level caches
- Formalizing the deadlock with dependence graph
- There are four types of transactions in the cache
hierarchy 1. Processor requests (outbound
requests), 2. Responses to processor requests
(inbound responses), 3. Interventions (inbound
requests), 4. Intervention responses (outbound
responses) - Processor requests need space in L1-to-L2 queue
responses to processors need space in L2-to-L1
queue interventions need space in L2-to-L1
queue intervention responses need space in
L1-to-L2 queue - Thus a message in L1-to-L2 queue may need space
in L2-to-L1 queue (e.g. a processor request
generating a response due to L2 hit) also a
message in L2-to-L1 queue may need space in
L1-to-L2 queue (e.g. an intervention response) - This creates a cycle in queue space dependence
graph
48Dependence graph
- Represent a queue by a vertex in the graph
- Number of vertices number of queues
- A directed edge from vertex u to vertex v is
present if a message at the head of queue u may
generate another message which requires space in
queue v - In our case we have two queues
- L2?L1 and L1?L2 the graph is not a DAG, hence
deadlock
L2?L1
L1?L2
49Multi-level caches
- In summary
- L2 cache controller refuses to drain L1-to-L2
queue if there is no space in L2-to-L1 queue
this is rather conservative because the message
at the head of L1-to-L2 queue may not need space
in L2-to-L1 queue e.g., in case of L2 miss or if
it is an intervention reply but after popping
the head of L1-to-L2 queue it is impossible to
backtrack if the message does need space in
L2-to-L1 queue - Similarly, L1 cache controller refuses to drain
L2-to-L1 queue if there is no space in L1-to-L2
queue - How do we break this cycle?
- Observe that responses for processor requests are
guaranteed not to generate any more messages and
intervention requests do not generate new
requests, but can only generate replies
50Multi-level caches
- Solving the queue deadlock
- Introduce one more queue in each direction i.e.
have a pair of queues in each direction - L1-to-L2 processor request queue and L1-to-L2
intervention response queue - Similarly, L2-to-L1 intervention request queue
and L2-to-L1 processor response queue - Now L2 cache controller can serve L1-to-L2
processor request queue as long as there is space
in L2-to-L1 processor response queue, but there
is no constraint on L1 cache controller for
draining L2-to-L1 processor response queue - Similarly, L1 cache controller can serve L2-to-L1
intervention request queue as long as there is
space in L1-to-L2 intervention response queue,
but L1-to-L2 intervention response queue will
drain as soon as bus is granted
51Dependence graph
- Now we have four queues
- Processor request (PR) and intervention reply
(IY) are L1 to L2 - Processor reply (PY) and intervention request
(IR) are L2 to L1
PR
PY
IR
IY
52Dependence graph
- Possible to combine PR and IY into a supernode of
the graph and still be cycle-free - Leads to one L1 to L2 queue
- Similarly, possible to combine IR and PY into a
supernode - Leads to one L2 to L1 queue
- Cannot do both
- Leads to cycle as already discussed
- Bottomline need at least three queues for
two-level cache hierarchy
53Multiple outstanding requests
- Today all processors allow multiple outstanding
cache misses - We have already discussed issues related to ooo
execution - Not much needs to be added on top of that to
support multiple outstanding misses - For multi-level cache hierarchy the queue depths
may be made bigger for performance reasons - Various other buffers such as writeback buffer
need to be made bigger
54SGI Challenge
Supports 36 MIPS R4400 (4 per board) or 18 MIPS
R8000 (2 per board) A-chip has the address bus
interface, request table CC-chip handles
coherence through the duplicate set of tags Each
D-chip handles 64 bits of data and as a whole 4
D-chips interface to a 256-bit wide data bus
55Sun Enterprise
Supports up to 30 UltraSPARC processors 2
processors and 1 GB memory per board Wide 64-byte
memory bus and hence two memory cycles to
transfer the entire cache line (128 bytes)
56Sun Gigaplane bus
- Split-transaction, 256 bits data, 41 bits
address, 83.5 MHz (compare to 47.6 MHz of SGI
Powerpath-2) - Supports 16 boards
- 112 outstanding transactions (up to 7 from each
board) - Snoop result is available 5 cycles after the
request phase - Memory fetches data speculatively
- MOESI protocol
57Special Topics
58Virtually indexed caches
- Recall that to have concurrent accesses to TLB
and cache, L1 caches are often made virtually
indexed - Can read the physical tag and data while the TLB
lookup takes place - Later compare the tag for hit/miss detection
- How does it impact the functioning of coherence
protocols and snoop logic? - Even for uniprocessor the synonym problem
- Two different virtual addresses may map to the
same physical page frame - One simple solution may be to flush all cache
lines mapped to a page frame at the time of
replacement - But this clearly prevents page sharing between
two processes
59Virtual indexing
- Software normally employs page coloring to solve
the synonym issue - Allow two virtual pages to point to the same
physical page frame only if the two virtual
addresses have at least lower k bits common where
k is equal to cache line block offset plus log2
(number of cache sets) - This guarantees that in a virtually indexed
cache, lines from both pages will map to the same
index range - What about the snoop logic?
- Putting virtual address on the bus requires a VA
to PA translation in the snoop so that physical
tags can be generated (adds extra latency to
snoop and also requires duplicate set of
translations) - Putting physical address on the bus requires a
reverse translation to generate the virtual index
(requires an inverted page table)
60Virtual indexing
- Dual tags (Goodman, 1987)
- Hardware solution to avoid synonyms in shared
memory - Maintain virtual and physical tags each
corresponding tag pair points to each other - Assume no page coloring
- Use virtual address to look up cache (i.e.
virtual index and virtual tag) from processor
side if it hits everything is fine if it misses
use the physical address to look up the physical
tag and if it hits follow the physical tag to
virtual tag pointer to find the index - If virtual tag misses and physical tag hits, that
means the synonym problem has happened i.e. two
different VAs are mapped to the same PA in this
case invalidate the cache line pointed to by
physical tag, replace the line at the virtual
index of the current virtual address, place the
contents of the invalidated line there and update
the physical tag pointer to point to the new
virtual index
61Virtual indexing
- Goodman, 1987
- Always use physical address for snooping
- Obviates the need for a TLB in memory controller
- The physical tag is used to look up the cache for
snoop decision - In case of a snoop hit the pointer stored with
the physical tag is followed to get the virtual
index and then the cache block can be accessed if
needed (e.g., in M state) - Note that even if there are two different types
of tags the state of a cache line is the same and
does not depend on what type of tag is used to
access the line
62Virtual indexing
- Multi-level cache hierarchy
- Normally the L1 cache is designed to be virtually
indexed and other levels are physically indexed - L2 sends interventions to L1 by communicating the
PA - L1 must determine the virtual index from that to
access the cache dual tags are sufficient for
this purpose
63TLB coherence
- A page table entry (PTE) may be held in multiple
processors in shared memory because all of them
access the same shared page - A PTE may get modified when the page is swapped
out and/or access permissions are changed - Must tell all processors having this PTE to
invalidate - How to do it efficiently?
- No TLB virtually indexed virtually tagged L1
caches - On L1 miss directly access PTE in memory and
bring it to cache then use normal cache
coherence because the PTEs also reside in the
shared memory segment - On page replacement the page fault handler can
flush the cache line containing the replaced PTE - Too impractical fully virtual caches are rare,
still uses a TLB for upper levels (Alpha 21264
instruction cache)
64TLB coherence
- Hardware solution
- Extend snoop logic to handle TLB coherence
- PowerPC family exercises a tlbie instruction (TLB
invalidate entry) - When OS modifies a PTE it puts a tlbie
instruction on bus - Snoop logic picks it up and invalidates the TLB
entry if present in all processors - This is well suited for bus-based SMPs, but not
for DSMs because broadcast in a large-scale
machine is not good
65TLB shootdown
- Popular TLB coherence solution
- Invoked by an initiator (the processor which
modifies the PTE) by sending interrupt to
processors that might be caching the PTE in TLBs
before doing so OS also locks the involved PTE to
avoid any further access to it in case of TLB
misses from other processors - The receiver of the interrupt simply invalidates
the involved PTE if it is present in its TLB and
sets a flag in shared memory on which the
initiator polls - On completion the initiator unlocks the PTE
- SGI Origin uses a lazy TLB shootdown i.e. it
invalidates a TLB entry only when a processor
tries to access it next time (will discuss in
detail)
66Snooping on a ring
- Length of the bus limits the frequency at which
it can be clocked which in turn limits the
bandwidth offered by the bus leading to a limited
number of processors - A ring interconnect provides a better solution
- Connect a processor only to its two neighbors
- Short wires, much higher switching frequency,
better bandwidth, more processors - Each node has private local memory (more like a
distributed shared memory multiprocessor) - Every cache line has a home node i.e. the node
where the memory contains this line can be
determined by upper few bits of the PA - Transactions traverse the ring node by node
67Snooping on a ring
- Snoop mechanism
- When a transaction passes by the ring interface
of a node it snoops the transaction, takes
appropriate coherence actions, and forwards the
transaction to its neighbor if necessary - The home node also receives the transaction
eventually and lets assume that it has a dirty
bit associated with every memory line (otherwise
you need a two-phase protocol) - A request transaction is removed from the ring
when it comes back to the requester (serves as an
acknowledgment that every node has seen the
request) - The ring is essentially divided into time slots
where a node can insert new request or response
if there is no free time slot it must wait until
one passes by called a slotted ring
68Snooping on a ring
- The snoop logic must be able to finish coherence
actions for a transaction before the next time
slot arrives - The main problem of a ring is the end-to-end
latency, since the transactions must traverse
hop-by-hop - Serialization and sequential consistency is
trickier - The order of two transactions may be differently
seen by two processors if the source of one
transaction is between the two processors - The home node can resort to NACKs if it sees
conflicting outstanding requests - Introduces many races in the protocol
69Scaling bandwidth
- Data bandwidth
- Make the bus wider costly hardware
- Replace bus by point-to-point crossbar since
only the address portion of a transaction is
needed for coherence, the data transaction can be
directly between source and destination - Add multiple data busses
- Snoop or coherence bandwidth
- This is determined by the number of snoop actions
that can be executed in unit time - Having concurrent non-conflicting snoop actions
definitely helps improve the protocol throughput - Multiple address busses a separate snoop engine
is associated with each bus on each node - Order the address busses logically to define a
partial order among concurrent requests so that
these partial orders can be combined to form a
total order
70AMD Opteron
- Each node contains an x86-64 core, 64 KB L1 data
and instruction caches, 1 MB L2 cache, on-chip
integrated memory controller, three fast routing
links called hyperTransport, local DDR memory - Glueless MP just connect 8 Opteron chips via HT
to design a distributed shared memory
multiprocessor - L2 cache supports 10 outstanding misses
Produced from IEEE Micro
71AMD Opteron
- Integrated memory controller and north bridge
functionality help a lot - Can clock the memory controller at processor
frequency (2 GHz) - No need to have a cumbersome motherboard just
buy the Opteron chip and connect it to a few
peripherals (system maintenance is much easier) - Overall, improves performance by 20-25 over
Athlon - Snoop throughput and bandwidth is much higher
since the snoop logic is clocked at 2 GHz - Integrated hyperTransport provides very high
communication bandwidth - Point-to-point links, split-transaction and full
duplex (bidirectional links) - On each HT link you can connect a processor or I/O
72Opteron servers
Produced from IEEE Micro
73AMD Hammer protocol
- Opteron uses the snoop-based Hammer protocol
- First the requester sends a transaction to home
node - The home node starts accessing main memory and in
parallel broadcasts the request to all the nodes
via point-to-point messages - The nodes individually snoop the request, take
appropriate coherence actions in their local
caches, and sends data (if someone has it in M or
O state) or an empty completion acknowledgment to
the requester the home memory also sends the
data speculatively - After gathering all responses the requester sends
a completion message to the home node so that it
can proceed with subsequent requests (this
completion ack may be needed for serializing
conflicting requests) - This is one example of a snoopy protocol over a
point-to-point interconnect unlike the shared bus