Multiprocessors on A Snoopy Bus

About This Presentation

Title:

Multiprocessors on A Snoopy Bus

Description:

Even though the bus is atomic, a complete protocol transaction involves quite a ... This is called Read-for-Ownership (RFO); also used by Intel atomic xchg instruction ... – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 74

Provided by: cseIi

Category:

more less

Transcript and Presenter's Notes

Title: Multiprocessors on A Snoopy Bus

1
Multiprocessors onA Snoopy Bus
2
Agenda

Goal is to understand what influences the
performance, cost and scalability of SMPs
Details of physical design of SMPs
At least three goals of any design correctness,
performance, low hardware complexity
Performance gains are normally achieved by
pipelining memory transactions and having
multiple outstanding requests
These performance optimizations occasionally
introduce new protocol races involving transient
states leading to correctness issues in terms of
coherence and consistency

3
Correctness goals

Must enforce coherence and write serialization
Recall that write serialization guarantees all
writes to a location to be seen in the same order
by all processors
Must obey the target memory consistency model
If sequential consistency is the goal, the system
must provide write atomicity and detect write
completion correctly (write atomicity extends the
definition of write serialization for any
location i.e. it guarantees that positions of
writes within the total order seen by all
processors be the same)
Must be free of deadlock, livelock and starvation
Starvation confined to a part of the system is
not as problematic as deadlock and livelock
However, system-wide starvation leads to livelock

4
A simple design

Start with a rather naïve design
Each processor has a single level of data and
instruction caches
The cache allows exactly one outstanding miss at
a time i.e. a cache miss request is blocked if
already another is outstanding (this serializes
all bus requests from a particular processor)
The bus is atomic i.e. it handles one request at
a time

5
Cache controller

Must be able to respond to bus transactions as
necessary
Handled by the snoop logic
The snoop logic should have access to the cache
tags
A single set of tags cannot allow concurrent
accesses by the processor-side and the bus-side
controllers
When the snoop logic accesses the tags the
processor must remain locked out from accessing
the tags
Possible enhancements two read ports in the tag
RAM allows concurrent reads duplicate copies are
also possible multiple banks reduce the
contention also
In all cases, updates to tags must still be
atomic or must be applied to both copies in case
of duplicate tags however, tag updates are a lot
less frequent compared to reads

6
Snoop logic

Couple of decisions need to be taken while
designing the snoop logic
How long should the snoop decision take?
How should processors convey the snoop decision?
Snoop latency (three design choices)
Possible to set an upper bound in terms of number
of cycles advantage no change in memory
controller hardware disadvantage potentially
large snoop latency (Pentium Pro, Sun Enterprise
servers)
The memory controller samples the snoop results
every cycle until all caches have completed snoop
(SGI Challenge uses this approach where the
memory controller fetches the line from memory,
but stalls if all caches havent yet snooped)
Maintain a bit per memory line to indicate if it
is in M state in some cache

7
Snoop logic

Conveying snoop result
For MESI the bus is augmented with three wired-OR
snoop result lines (shared, modified, valid) the
valid line is active low
The original Illinois MESI protocol requires
cache-to-cache transfer even when the line is in
S state this may complicate the hardware
enormously due to the involved priority mechanism
Commercial MESI protocols normally allow
cache-to-cache sharing only for lines in M state
SGI Challenge and Sun Enterprise allow
cache-to-cache transfers only in M state
Challenge updates memory when going from M to S
while Enterprise exercises a MOESI protocol

8
Writebacks

Writebacks are essentially eviction of modified
lines
Caused by a miss mapping to the same cache index
Needs two bus transactions one for the miss and
one for the writeback
Definitely the miss should be given first
priority since this directly impacts forward
progress of the program
Need a writeback buffer (WBB) to hold the evicted
line until the bus can be acquired for the second
time by this cache
In the meantime a new request from another
processor may be launched for the evicted line
the evicting cache must provide the line from the
WBB and cancel the pending writeback (need an
address comparator with WBB)

9
A simple design
10
Inherently non-atomic

Even though the bus is atomic, a complete
protocol transaction involves quite a few steps
which together forms a non-atomic transaction
Issuing processor request
Looking up cache tags
Arbitrating for bus
Snoop action in other cache controller
Refill in requesting cache controller at the end
Different requests from different processors may
be in a different phase of a transaction
This makes a protocol transition inherently
non-atomic

11
Inherently non-atomic

Consider an example
P0 and P1 have cache line C in shared state
Both proceed to write the line
Both cache controllers look up the tags, put a
BusUpgr into the bus request queue, and start
arbitrating for the bus
P1 gets the bus first and launches its BusUpgr
P0 observes the BusUpgr and now it must
invalidate C in its cache and change the request
type to BusRdX
So every cache controller needs to do an
associative lookup of the snoop address against
its pending request queue and depending on the
request type take appropriate actions

12
Inherently non-atomic

One way to reason about the correctness is to
introduce transient states
Possible to think of the last problem as the line
C being in a transient S?M state
On observing a BusUpgr or BusRdX, this state
transitions to I?M which is also transient
The line C goes to stable M state only after the
transaction completes
These transient states are not really encoded in
the state bits of a cache line because at any
point in time there will be a small number of
outstanding requests from a particular processor
(today the maximum I know of is 16)
These states are really determined by the state
of an outstanding line and the state of the cache
controller

13
Write serialization

Atomic bus makes it rather easy, but
optimizations are possible
Consider a processor write to a shared cache line
Is it safe to continue with the write and change
the state to M even before the bus transaction is
complete?
After the bus transaction is launched it is
totally safe because the bus is atomic and hence
the position of the write is committed in the
total order therefore no need to wait any
further (note that the exact point in time when
the other caches invalidate the line is not
important)
If the processor decides to proceed even before
the bus transaction is launched (very much
possible in ooo execution), the cache controller
must take the responsibility of squashing and
re-executing offending instructions so that the
total order is consistent across the system

14
Fetch deadlock

Just a fancy name for a pretty intuitive deadlock
Suppose P0s cache controller is waiting to get
the bus for launching a BusRdX to cache line A
P1 has a modified copy of cache line A
P1 has launched a BusRd to cache line B and
awaiting completion
P0 has a modified copy of cache line B
If both keep on waiting without responding to
snoop requests, the deadlock cycle is pretty
obvious
So every controller must continue to respond to
snoop requests while waiting for the bus for its
own requests
Normally the cache controller is designed as two
separate independent logic units, namely, the
inbound unit (handles snoop requests) and the
outbound unit (handles own requests and
arbitrates for bus)

15
Livelock

Consider the following example
P0 and P1 try to write to the same cache line
P0 gets exclusive ownership, fills the line in
cache and notifies the load/store unit (or
retirement unit) to retry the store
While all these are happening P1s request
appears on the bus and P0s cache controller
modifies tag state to I before the store could
retry
This can easily lead to a livelock
Normally this is avoided by giving the load/store
unit higher priority for tag access (i.e. the
snoop logic cannot modify the tag arrays when
there is a processor access pending in the same
clock cycle)
This is even rarer in multi-level cache hierarchy
(more later)

16
Starvation

Some amount of fairness is necessary in the bus
arbiter
An FCFS policy is possible for granting bus, but
that needs some buffering in the arbiter to hold
already placed requests
Most machines implement an aging scheme which
keeps track of the number of times a particular
request is denied and when the count crosses a
threshold that request becomes the highest
priority (this too needs some storage)

17
More on LL/SC

We have seen that both LL and SC may suffer from
cache misses (a read followed by an upgrade miss)
Is it possible to save one transaction?
What if I design my cache controller in such a
way that it can recognize LL instructions and
launch a BusRdX instead of BusRd?
This is called Read-for-Ownership (RFO) also
used by Intel atomic xchg instruction
Nice idea, but you have to be careful
By doing this you have just enormously increased
the probability of a livelock before the SC
executes there is a high probability that another
LL will take away the line
Possible solution is to buffer incoming snoop
requests until the SC completes (buffer space is
proportional to P) may introduce new deadlock
cycles (especially for modern non-atomic busses)

18
Multi-level caches

We have talked about multi-level caches and the
involved inclusion property
Multiprocessors create new problems related to
multi-level caches
A bus snoop result may be relevant to inner
levels of cache e.g., bus transactions are not
visible to the first level cache controller
Similarly, modifications made in the first level
cache may not be visible to the second level
cache controller which is responsible for
handling bus requests
Inclusion property makes it easier to maintain
coherence
Since L1 cache is a subset of L2 cache a snoop
miss in L2 cache need not be sent to L1 cache

19
Recap of inclusion

A processor read
Looks up L1 first and in case of miss goes to L2,
and finally may need to launch a BusRd request if
it misses in L2
Finally, the line is in S state in both L1 and L2
A processor write
Looks up L1 first and if it is in I state sends a
ReadX request to L2 which may have the line in M
state
In case of L2 hit, the line is filled in M state
in L1
In case of L2 miss, if the line is in S state in
L2 it launches BusUpgr otherwise it launches
BusRdX finally, the line is in state M in both
L1 and L2
If the line is in S state in L1, it sends an
upgrade request to L2 and either there is an L2
hit or L2 just conveys the upgrade to bus (Why
cant it get changed to BusRdX?)

20
Recap of inclusion

L1 cache replacement
Replacement of a line in S state may or may not
be conveyed to L2
Replacement of a line in M state must be sent to
L2 so that it can hold the most up-to-date copy
The line is in I state in L1 after replacement,
the state of line remains unchanged in L2
L2 cache replacement
Replacement of a line in S state may or may not
generate a bus transaction it must send a
notification to the L1 caches so that they can
invalidate the line to maintain inclusion
Replacement of a line in M state first asks the
L1 cache to send all the relevant L1 lines (these
are the most up-to-date copies) and then launches
a BusWB
The state of line in both L1 and L2 is I after
replacement

21
Recap of inclusion

Replacement of a line in E state from L1?
Replacement of a line in E state from L2?
Replacement of a line in O state from L1?
Replacement of a line in O state from L2?
In summary
A line in S state in L2 may or may not be in L1
in S state
A line in M state in L2 may or may not be in L1
in M state Why? Can it be in S state?
A line in I state in L2 must not be present in L1

22
Inclusion and snoop

BusRd snoop
Look up L2 cache tag if in I state no action if
in S state no action if in M state assert
wired-OR M line, send read intervention to L1
data cache, L1 data cache sends lines back, L2
controller launches line on bus, both L1 and L2
lines go to S state
BusRdX snoop
Look up L2 cache tag if in I state no action if
in S state invalidate and also notify L1 if in M
state assert wired-OR M line, send readX
intervention to L1 data cache, L1 data cache
sends lines back, L2 controller launches line on
bus, both L1 and L2 lines go to I state
BusUpgr snoop
Similar to BusRdX without the cache line flush

23
L2 to L1 interventions

Two types of interventions
One is read/readX intervention that requires data
reply
Other is plain invalidation that does not need
data reply
Data interventions can be eliminated by making L1
cache write-through
But introduces too much of write traffic to L2
One possible solution is to have a store buffer
that can handle the stores in background obeying
the available BW, so that the processor can
proceed independently this can easily violate
sequential consistency unless store buffer also
becomes a part of snoop logic
Useless invalidations can be eliminated by
introducing an inclusion bit in L2 cache state

24
Invalidation acks?

On a BusRdX or BusUpgr in case of a snoop hit in
S state L2 cache sends invalidation to L1 caches
Does the snoop logic wait for an invalidation
acknowledgment from L1 cache before the
transaction can be marked complete?
Do we need a two-phase mechanism?
What are the issues?

25
Intervention races

Writebacks introduce new races in multi-level
cache hierarchy
Suppose L2 sends a read intervention to L1 and in
the meantime L1 decides to replace that line (due
to some conflicting processor access)
The intervention will naturally miss the
up-to-date copy
When the writeback arrives at L2, L2 realizes
that the intervention race has occurred (need
extra hardware to implement this logic what
hardware?)
When the intervention reply arrives from L1, L2
can apply the newly received writeback and launch
the line on bus
Exactly same situation may arise even in
uniprocessor if a dirty replacement from L2
misses the line in L1 because L1 just replaced
that line too

26
Tag RAM design

A multi-level cache hierarchy reduces tag
contention
L1 tags are mostly accessed by the processor
because L2 cache acts as a filter for external
requests
L2 tags are mostly accessed by the system because
hopefully L1 cache can absorb most of the
processor traffic
Still some machines maintain duplicate tags at
all or the outermost level only

27
Exclusive cache levels

AMD K7 (Athlon XP) and K8 (Athlon64, Opteron)
architectures chose to have exclusive levels of
caches instead of inclusive
Definitely provides you much better utilization
of on-chip caches since there is no duplication
But complicates many issues related to coherence
The uniprocessor protocol is to refill requested
lines directly into L1 without placing a copy in
L2 only on an L1 eviction put the line into L2
on an L1 miss look up L2 and in case of L2 hit
replace line from L2 and put it in L1 (may have
to replace multiple L1 lines to accommodate the
full L2 line not sure what K8 does possible to
maintain inclusion bit per L1 line sector in L2
cache)
For multiprocessors one solution could be to have
one snoop engine per cache level and a tournament
logic that selects the successful snoop result

28
Split-transaction bus

Atomic bus leads to underutilization of bus
resources
Between the address is taken off the bus and the
snoop responses are available the bus stays idle
Even after the snoop result is available the bus
may remain idle due to high memory access latency
Split-transaction bus divides each transaction
into two parts request and response
Between the request and response of a particular
transaction there may be other requests and/or
responses from different transactions
Outstanding transactions that have not yet
started or have completed only one phase are
buffered in the requesting cache controllers

29
New issues

Split-transaction bus introduces new protocol
races
P0 and P1 have a line in S state and both issue
BusUpgr, say, in consecutive cycles
Snoop response arrives later because it takes
time
Now both P0 and P1 may think that they have
ownership
Flow control is important since buffer space is
finite
In-order or out-of-order response?
Out-of-order response may better tolerate
variable memory latency by servicing other
requests
Pentium Pro uses in-order response
SGI Challenge and Sun Enterprise use out-of-order
response i.e. no ordering is enforced

30
SGI Powerpath-2 bus

Used in SGI Challenge
Conflicts are resolved by not allowing multiple
bus transactions to the same cache line
Allows eight outstanding requests on the bus at
any point in time
Flow control on buffers is provided by negative
acknowledgments (NACKs) the bus has a dedicated
NACK line which remains asserted if the buffer
holding outstanding transactions is full a
NACKed transaction must be retried
The request order determines the total order of
memory accesses, but the responses may be
delivered in a different order depending on the
completion time of them
In subsequent slides we call this design
Powerpath-2 since it is loosely based on that

31
SGI Powerpath-2 bus

Logically two separate buses
Request bus for launching the command type
(BusRd, BusWB etc.) and the involved address
Response bus for providing the data response, if
any
Since responses may arrive in an order different
from the request order, a 3-bit tag is assigned
to each request
Responses launch this tag on the tag bus along
with the data reply so that the address bus may
be left free for other requests
The data bus is 256-bit wide while a cache line
is 128 bytes
One data response phase needs four bus cycles
along with one additional hardware turnaround
cycle

32
SGI Powerpath-2 bus

Essentially two main buses and various control
wires for snoop results, flow control etc.
Address bus five cycle arbitration, used during
request
Data bus five cycle arbitration, five cycle
transfer, used during response
Three different transactions may be in one of
these three phases at any point in time

33
SGI Powerpath-2 bus

Forming a total order
After the decode cycle during request phase every
cache controller takes appropriate coherence
actions i.e. BusRd downgrades M line to S, BusRdX
invalidates line
If a cache controller does not get the tags due
to contention with the processor it simply
lengthens the ack phase beyond one cycle
Thus the total order is formed during the request
phase itself i.e. the position of each request in
the total order is determined at that point

34
SGI Powerpath-2 bus

BusWB case
BusWB only needs the request phase
However needs both address and data lines
together
Must arbitrate for both together
BusUpgr case
Consists only of the request phase
No response or acknowledgment
As soon as the ack phase of address arbitration
is completed by the issuing node, the upgrade has
sealed a position in the total order and hence is
marked complete by sending a completion signal to
the issuing processor by its local bus controller
(each node has its own bus controller to handle
bus requests)

35
Bus interface logic
A request table entry is freed when the response
is observed on the bus
36
Snoop results

Three snoop wires shared, modified, inhibit (all
wired-OR)
The inhibit wire helps in holding off snoop
responses until the data response is launched on
the bus
Although the request phase determines who will
source the data i.e. some cache or memory, the
memory controller does not know it
The cache with a modified copy keeps the inhibit
line asserted until it gets the data bus and
flushes the data this prevents memory controller
from sourcing the data
Otherwise memory controller arbitrates for the
data bus
When the data appears all cache controllers
appropriately assert the shared and modified line
Why not launch snoop results as soon as they are
available?

37
Conflict resolution

Use the pending request table to resolve
conflicts
Every processor has a copy of the table
Before arbitrating for the address bus every
processor looks up the table to see if there is a
match
In case of a match the request is not issued and
is held in a pending buffer
Flow control is needed at different levels
Essentially need to detect if any buffer is full
SGI Challenge uses a separate NACK line for each
of address and data phases
Before the phases reach the ack cycle any cache
controller can assert the NACK line if it runs
out of some critical buffer this invalidates the
transaction and the requester must retry (may use
back-off and/or priority)
Sun Enterprise requires the receiver to generate
the retry when it has buffer space (thus only one
retry)

38
Path of a cache miss

Assume a read miss
Look up request table in case of a match with
BusRd just mark the entry indicating that this
processor will snoop the response from the bus
and that it will also assert the shared line
In case of a request table hit with BusRdX the
cache controller must hold on to the request
until the conflict resolves
In case of a request table miss the requester
arbitrates for address bus while arbitrating if
a conflicting request arrives, the controller
must put a NOP transaction within the slot it is
granted and hold on to the request until the
conflict resolves

39
Path of a cache miss

Suppose the requester succeeds in putting the
request on address/command bus
Other cache controllers snoop the request,
register it in request table (the requester also
does this), take appropriate coherence action
within own cache hierarchy, main memory also
starts fetching the cache line
If a cache holds the line in M state it should
source it on bus during response phase it keeps
the inhibit line asserted until it gets the data
bus then it lowers inhibit line and asserts the
modified line at this point the memory
controller aborts the data fetch/response and
instead fields the line from the data bus for
writing back

40
Path of a cache miss

If the memory fetches the line even before the
snoop is complete, the inhibit line will not
allow the memory controller to launch the data on
bus
After the inhibit line is lowered depending on
the state of the modified line memory cancels the
data response
If no one has the line in M state, the requester
grabs the response from memory
A store miss is similar
Only difference is that even if a cache has the
line in M state, the memory controller does not
write the response back
Also any pending BusUpgr to the same cache line
must be converted to BusReadX

41
Write serialization

In a split-transaction bus setting, the request
table provides sufficient support for write
serialization
Requests to the same cache line are not allowed
to proceed at the same time
A read to a line after a write to the same line
can be launched only after the write response
phase has completed this guarantees that the
read will see the new value
A write after a read to the same line can be
started only after the read response has
completed this guarantees that the value of the
read cannot be altered by the value written

42
Write atomicity and SC

Sequential consistency (SC) requires write
atomicity i.e. total order of all writes seen by
all processors should be identical
Since a BusRdX or BusUpgr does not wait until the
invalidations are actually applied to the caches,
you have to be careful
P0 A1 B1
P1 print B print A
Under SC (A, B) (0, 1) is not allowed
Suppose to start with P1 has the line containing
A in cache, but not the line containing B
The stores of P0 queue the invalidation of A in
P1s cache controller
P1 takes read miss for B, but the response of B
is re-ordered by P1s cache controller so that it
overtakes the invalidaton (thought it may be
better to prioritize reads)

43
Another example

P0 A1 print B
P1 B1 print A
Under SC (A, B) (0, 0) is not allowed
Same problem if P0 executes both instructions
first, then P1 executes the write of B (which
lets assume generates an upgrade so that it is
marked complete as soon as the address
arbitration phase finishes), then the upgrade
completion is re-ordered with the pending
invalidation of A
So, the reason these two cases fail is that the
new values are made visible before older
invalidations are applied
One solution is to have a strict FIFO queue
between the bus controller and the cache
hierarchy
But it is sufficient as long as replies do not
overtake invalidations otherwise the bus
responses can be re-ordered without violating
write atomicity and hence SC (e.g., if there are
only read and write responses in the queue, it
sometimes may make sense to prioritize read
responses)

44
In-order response

In-order response can simplify quite a few things
in the design
The fully associative request table can be
replaced by a FIFO queue
Conflicting requests where one is a write can
actually be allowed now (multiple reads were
allowed even before although only the first one
actually appears on the bus)
Consider a BusRdX followed by a BusRd from two
different processors
With in-order response it is guaranteed that the
BusRdX response will be granted the data bus
before the BusRd response (which may not be true
for ooo response and hence such a conflict is
disallowed)
So when the cache controller generating the
BusRdX sees the BusRd it only notes that it
should source the line for this request after its
own write is completed

45
In-order response

The performance penalty may be huge
Essentially because of the memory
Consider a situation where three requests are
pending to cache lines A, B, C in that order
A and B map to the same memory bank while C is in
a different bank
Although the response for C may be ready long
before that of B, it cannot get the bus

46
Multi-level caches

Split-transaction bus makes the design of
multi-level caches a little more difficult
The usual design is to have queues between levels
of caches in each direction
How do you size the queues? Between processor and
L1 one buffer is sufficient (assume one
outstanding processor access), L1-to-L2 needs P1
buffers (why?), L2-to-L1 needs P buffers (why?),
L1 to processor needs one buffer
With smaller buffers there is a possibility of
deadlock suppose the L1-to-L2 and L2-to-L1 have
one queue entry each, there is a request in
L1-to-L2 queue and there is also an intervention
in L2-to-L1 queue clearly L1 cannot pick up the
intervention because it does not have space to
put the reply in L1-to-L2 queue while L2 cannot
pick up the request because it might need space
in L2-to-L1 queue in case of an L2 hit

47
Multi-level caches

Formalizing the deadlock with dependence graph
There are four types of transactions in the cache
hierarchy 1. Processor requests (outbound
requests), 2. Responses to processor requests
(inbound responses), 3. Interventions (inbound
requests), 4. Intervention responses (outbound
responses)
Processor requests need space in L1-to-L2 queue
responses to processors need space in L2-to-L1
queue interventions need space in L2-to-L1
queue intervention responses need space in
L1-to-L2 queue
Thus a message in L1-to-L2 queue may need space
in L2-to-L1 queue (e.g. a processor request
generating a response due to L2 hit) also a
message in L2-to-L1 queue may need space in
L1-to-L2 queue (e.g. an intervention response)
This creates a cycle in queue space dependence
graph

48
Dependence graph

Represent a queue by a vertex in the graph
Number of vertices number of queues
A directed edge from vertex u to vertex v is
present if a message at the head of queue u may
generate another message which requires space in
queue v
In our case we have two queues
L2?L1 and L1?L2 the graph is not a DAG, hence
deadlock

L2?L1
L1?L2
49
Multi-level caches

In summary
L2 cache controller refuses to drain L1-to-L2
queue if there is no space in L2-to-L1 queue
this is rather conservative because the message
at the head of L1-to-L2 queue may not need space
in L2-to-L1 queue e.g., in case of L2 miss or if
it is an intervention reply but after popping
the head of L1-to-L2 queue it is impossible to
backtrack if the message does need space in
L2-to-L1 queue
Similarly, L1 cache controller refuses to drain
L2-to-L1 queue if there is no space in L1-to-L2
queue
How do we break this cycle?
Observe that responses for processor requests are
guaranteed not to generate any more messages and
intervention requests do not generate new
requests, but can only generate replies

50
Multi-level caches

Solving the queue deadlock
Introduce one more queue in each direction i.e.
have a pair of queues in each direction
L1-to-L2 processor request queue and L1-to-L2
intervention response queue
Similarly, L2-to-L1 intervention request queue
and L2-to-L1 processor response queue
Now L2 cache controller can serve L1-to-L2
processor request queue as long as there is space
in L2-to-L1 processor response queue, but there
is no constraint on L1 cache controller for
draining L2-to-L1 processor response queue
Similarly, L1 cache controller can serve L2-to-L1
intervention request queue as long as there is
space in L1-to-L2 intervention response queue,
but L1-to-L2 intervention response queue will
drain as soon as bus is granted

51
Dependence graph

Now we have four queues
Processor request (PR) and intervention reply
(IY) are L1 to L2
Processor reply (PY) and intervention request
(IR) are L2 to L1

PR
PY
IR
IY
52
Dependence graph

Possible to combine PR and IY into a supernode of
the graph and still be cycle-free
Leads to one L1 to L2 queue
Similarly, possible to combine IR and PY into a
supernode
Leads to one L2 to L1 queue
Cannot do both
Leads to cycle as already discussed
Bottomline need at least three queues for
two-level cache hierarchy

53
Multiple outstanding requests

Today all processors allow multiple outstanding
cache misses
We have already discussed issues related to ooo
execution
Not much needs to be added on top of that to
support multiple outstanding misses
For multi-level cache hierarchy the queue depths
may be made bigger for performance reasons
Various other buffers such as writeback buffer
need to be made bigger

54
SGI Challenge
Supports 36 MIPS R4400 (4 per board) or 18 MIPS
R8000 (2 per board) A-chip has the address bus
interface, request table CC-chip handles
coherence through the duplicate set of tags Each
D-chip handles 64 bits of data and as a whole 4
D-chips interface to a 256-bit wide data bus
55
Sun Enterprise
Supports up to 30 UltraSPARC processors 2
processors and 1 GB memory per board Wide 64-byte
memory bus and hence two memory cycles to
transfer the entire cache line (128 bytes)
56
Sun Gigaplane bus

Split-transaction, 256 bits data, 41 bits
address, 83.5 MHz (compare to 47.6 MHz of SGI
Powerpath-2)
Supports 16 boards
112 outstanding transactions (up to 7 from each
board)
Snoop result is available 5 cycles after the
request phase
Memory fetches data speculatively
MOESI protocol

57
Special Topics
58
Virtually indexed caches

Recall that to have concurrent accesses to TLB
and cache, L1 caches are often made virtually
indexed
Can read the physical tag and data while the TLB
lookup takes place
Later compare the tag for hit/miss detection
How does it impact the functioning of coherence
protocols and snoop logic?
Even for uniprocessor the synonym problem
Two different virtual addresses may map to the
same physical page frame
One simple solution may be to flush all cache
lines mapped to a page frame at the time of
replacement
But this clearly prevents page sharing between
two processes

59
Virtual indexing

Software normally employs page coloring to solve
the synonym issue
Allow two virtual pages to point to the same
physical page frame only if the two virtual
addresses have at least lower k bits common where
k is equal to cache line block offset plus log2
(number of cache sets)
This guarantees that in a virtually indexed
cache, lines from both pages will map to the same
index range
What about the snoop logic?
Putting virtual address on the bus requires a VA
to PA translation in the snoop so that physical
tags can be generated (adds extra latency to
snoop and also requires duplicate set of
translations)
Putting physical address on the bus requires a
reverse translation to generate the virtual index
(requires an inverted page table)

60
Virtual indexing

Dual tags (Goodman, 1987)
Hardware solution to avoid synonyms in shared
memory
Maintain virtual and physical tags each
corresponding tag pair points to each other
Assume no page coloring
Use virtual address to look up cache (i.e.
virtual index and virtual tag) from processor
side if it hits everything is fine if it misses
use the physical address to look up the physical
tag and if it hits follow the physical tag to
virtual tag pointer to find the index
If virtual tag misses and physical tag hits, that
means the synonym problem has happened i.e. two
different VAs are mapped to the same PA in this
case invalidate the cache line pointed to by
physical tag, replace the line at the virtual
index of the current virtual address, place the
contents of the invalidated line there and update
the physical tag pointer to point to the new
virtual index

61
Virtual indexing

Goodman, 1987
Always use physical address for snooping
Obviates the need for a TLB in memory controller
The physical tag is used to look up the cache for
snoop decision
In case of a snoop hit the pointer stored with
the physical tag is followed to get the virtual
index and then the cache block can be accessed if
needed (e.g., in M state)
Note that even if there are two different types
of tags the state of a cache line is the same and
does not depend on what type of tag is used to
access the line

62
Virtual indexing

Multi-level cache hierarchy
Normally the L1 cache is designed to be virtually
indexed and other levels are physically indexed
L2 sends interventions to L1 by communicating the
PA
L1 must determine the virtual index from that to
access the cache dual tags are sufficient for
this purpose

63
TLB coherence

A page table entry (PTE) may be held in multiple
processors in shared memory because all of them
access the same shared page
A PTE may get modified when the page is swapped
out and/or access permissions are changed
Must tell all processors having this PTE to
invalidate
How to do it efficiently?
No TLB virtually indexed virtually tagged L1
caches
On L1 miss directly access PTE in memory and
bring it to cache then use normal cache
coherence because the PTEs also reside in the
shared memory segment
On page replacement the page fault handler can
flush the cache line containing the replaced PTE
Too impractical fully virtual caches are rare,
still uses a TLB for upper levels (Alpha 21264
instruction cache)

64
TLB coherence

Hardware solution
Extend snoop logic to handle TLB coherence
PowerPC family exercises a tlbie instruction (TLB
invalidate entry)
When OS modifies a PTE it puts a tlbie
instruction on bus
Snoop logic picks it up and invalidates the TLB
entry if present in all processors
This is well suited for bus-based SMPs, but not
for DSMs because broadcast in a large-scale
machine is not good

65
TLB shootdown

Popular TLB coherence solution
Invoked by an initiator (the processor which
modifies the PTE) by sending interrupt to
processors that might be caching the PTE in TLBs
before doing so OS also locks the involved PTE to
avoid any further access to it in case of TLB
misses from other processors
The receiver of the interrupt simply invalidates
the involved PTE if it is present in its TLB and
sets a flag in shared memory on which the
initiator polls
On completion the initiator unlocks the PTE
SGI Origin uses a lazy TLB shootdown i.e. it
invalidates a TLB entry only when a processor
tries to access it next time (will discuss in
detail)

66
Snooping on a ring

Length of the bus limits the frequency at which
it can be clocked which in turn limits the
bandwidth offered by the bus leading to a limited
number of processors
A ring interconnect provides a better solution
Connect a processor only to its two neighbors
Short wires, much higher switching frequency,
better bandwidth, more processors
Each node has private local memory (more like a
distributed shared memory multiprocessor)
Every cache line has a home node i.e. the node
where the memory contains this line can be
determined by upper few bits of the PA
Transactions traverse the ring node by node

67
Snooping on a ring

Snoop mechanism
When a transaction passes by the ring interface
of a node it snoops the transaction, takes
appropriate coherence actions, and forwards the
transaction to its neighbor if necessary
The home node also receives the transaction
eventually and lets assume that it has a dirty
bit associated with every memory line (otherwise
you need a two-phase protocol)
A request transaction is removed from the ring
when it comes back to the requester (serves as an
acknowledgment that every node has seen the
request)
The ring is essentially divided into time slots
where a node can insert new request or response
if there is no free time slot it must wait until
one passes by called a slotted ring

68
Snooping on a ring

The snoop logic must be able to finish coherence
actions for a transaction before the next time
slot arrives
The main problem of a ring is the end-to-end
latency, since the transactions must traverse
hop-by-hop
Serialization and sequential consistency is
trickier
The order of two transactions may be differently
seen by two processors if the source of one
transaction is between the two processors
The home node can resort to NACKs if it sees
conflicting outstanding requests
Introduces many races in the protocol

69
Scaling bandwidth

Data bandwidth
Make the bus wider costly hardware
Replace bus by point-to-point crossbar since
only the address portion of a transaction is
needed for coherence, the data transaction can be
directly between source and destination
Add multiple data busses
Snoop or coherence bandwidth
This is determined by the number of snoop actions
that can be executed in unit time
Having concurrent non-conflicting snoop actions
definitely helps improve the protocol throughput
Multiple address busses a separate snoop engine
is associated with each bus on each node
Order the address busses logically to define a
partial order among concurrent requests so that
these partial orders can be combined to form a
total order

70
AMD Opteron

Each node contains an x86-64 core, 64 KB L1 data
and instruction caches, 1 MB L2 cache, on-chip
integrated memory controller, three fast routing
links called hyperTransport, local DDR memory
Glueless MP just connect 8 Opteron chips via HT
to design a distributed shared memory
multiprocessor
L2 cache supports 10 outstanding misses

Produced from IEEE Micro
71
AMD Opteron

Integrated memory controller and north bridge
functionality help a lot
Can clock the memory controller at processor
frequency (2 GHz)
No need to have a cumbersome motherboard just
buy the Opteron chip and connect it to a few
peripherals (system maintenance is much easier)
Overall, improves performance by 20-25 over
Athlon
Snoop throughput and bandwidth is much higher
since the snoop logic is clocked at 2 GHz
Integrated hyperTransport provides very high
communication bandwidth
Point-to-point links, split-transaction and full
duplex (bidirectional links)
On each HT link you can connect a processor or I/O

72
Opteron servers
Produced from IEEE Micro
73
AMD Hammer protocol

Opteron uses the snoop-based Hammer protocol
First the requester sends a transaction to home
node
The home node starts accessing main memory and in
parallel broadcasts the request to all the nodes
via point-to-point messages
The nodes individually snoop the request, take
appropriate coherence actions in their local
caches, and sends data (if someone has it in M or
O state) or an empty completion acknowledgment to
the requester the home memory also sends the
data speculatively
After gathering all responses the requester sends
a completion message to the home node so that it
can proceed with subsequent requests (this
completion ack may be needed for serializing
conflicting requests)
This is one example of a snoopy protocol over a
point-to-point interconnect unlike the shared bus