Hardware Transactional Memory

About This Presentation

Title:

Hardware Transactional Memory

Description:

Snoopy Cache ... We describe here how to extend 'snoopy' cache protocol for a shared bus to ... on both cache coherence protocols: snoopy and directory cache. ... – PowerPoint PPT presentation

Number of Views:214

Avg rating:3.0/5.0

Slides: 75

Provided by: mick3

Category:

more less

Transcript and Presenter's Notes

Title: Hardware Transactional Memory

1
Hardware Transactional Memory

Royi Maimon
Merav Havuv
27/5/2007

2
References

M. Herlihy and J. Moss, Transactional Memory
Architectural Support for Lock-Free Data
Structures
C. Scott Ananian, Krste Asanovic, Bradley  C.
Kuszmaul, Charles  E. Leiserson, Sean  Lie
Unbounded Transactional  Memory.
Hammond, Wong, Chen, Carlstrom, Davis (Jun
2004).Transactional Memory Coherence and
Consistency

3
Today

What are transactions?
What is Hardware Transactional Memory?
Various implementations of HTM

4
Outline

Lock-Free
Hardware Transactional Memory (HTM)
Transactions
Cache coherence protocol
General Implementation
Simulation
UTM
LTM
TCC (briefly)
Conclusions

5
Outline

Lock-Free
Hardware Transactional Memory (HTM)
Transactions
Cache coherence protocol
General Implementation
Simulation
UTM
LTM
TCC (briefly)
Conclusions

6
Lock-free

A shared data structure is lock-free if its
operations do not require mutual exclusion.
If one process is interrupted in the middle of an
operation, other processes will not be prevented
from operating on that object.

7
Lock-free (cont)

Lock-free data structures avoid common problems
associated with conventional locking techniques
in highly concurrent systems
Priority inversion
Convoying occurs when a process holding a lock is
descheduled, and then, other processes capable of
running may be unable to progress.
Deadlock

8
Priority inversion

Priority inversion occurs when a lower-priority
process is preempted while holding a lock needed
by higher-priority processes.

9
Deadlock

Deadlock two or more processes are waiting
indefinitely for an event that can be caused by
only one of waiting processes.
Let S and Q be two resources
P0 P1
Lock(S) Lock(Q)
Lock(Q) Lock(S)

10
Outline

Lock-Free
Hardware Transactional Memory (HTM)
Transactions
Cache coherence protocol
General Implementation
Simulation
UTM
LTM
TCC (briefly)
Conclusions

11
What is a transaction?

A transaction is a sequence of memory loads and
stores executed by a single process that either
commits or aborts
If a transaction commits, all the loads and
stores appear to have executed atomically
If a transaction aborts, none of its stores take
effect
Transaction operations aren't visible until they
commit or abort

12
Transactions properties

A transaction satisfies the following properties
Serializability
Atomicity
Simplified version of traditional ACID database
(Atomicity, Consistency, Isolation, and
Durability)

13
Transactional Memory

A new multiprocessor architecture
The goal Implementing a lock-free
synchronization
efficient
easy to use
comparing to conventional techniques based on
mutual exclusion
Implemented by straightforward extensions to
multiprocessor cache-coherence protocols.

14
An Example

Locks
if (iltj)
a i b j
else
a j b i
Lock(La)
Lock(Lb)
Flowi Flowi X
Flowj Flowj X
Unlock(Lb)
Unlock(La)

Transactional Memory
StartTransaction
Flowi Flowi X
Flowj Flowj X
EndTransaction

15
Transactional Memory

Transactions execute in commit order

Time
0xbeef
0xbeef
16
Outline

Lock-Free
Hardware Transactional Memory (HTM)
Transactions
Cache coherence protocol
General Implementation
Simulation
UTM
LTM
TCC (briefly)
Conclusions

17
Cache-Coherence Protocol

A protocol for managing the caches of a
multiprocessor system
No data is lost
No overwritten before the data is transferred
from a cache to the target memory.
When multiprocessing, each processor may have
its own memory cache that is separate from the
shared memory

18
The Problem (Cache-Coherence)

Solving the problem in either of two ways
directory-based
snooping system

19
Snoopy Cache

All caches watches the activity (snoop) on a
global bus to determine if they have a copy of
the block of data that is requested on the bus.

20
Directory-based

The data being shared is placed in a common
directory that maintains the coherence between
caches.
The directory acts as a filter through which the
processor must ask permission to load an entry
from the primary memory to its cache.
When an entry is changed the directory either
updates or invalidates the other caches with that
entry.

21
Outline

Lock-Free
Hardware Transactional Memory (HTM)
Transactions
Cache coherence protocol
General Implementation
Simulation
UTM
LTM
TCC (briefly)
Conclusions

22
How it Works?

The following primitive instructions for
accessing memory are provided
Load-transactional (LT) reads value of a shared
memory location into a private register.
Load-transactional-exclusive (LTX) Like LT, but
hinting that the location is likely to be
modified.
Store-transactional (ST) tentatively writes a
value from a private register to a shared memory
location.
Commit (COMMIT)
Abort (ABORT)
Validate (VALIDATE) tests the current transaction
status.

23
Some definitions

Read set the set of locations read by LT by a
transaction
Write set the set of locations accessed by LTX
or ST by a transaction
Data set (footprints) the union of the read and
write sets.
A set of values in memory is inconsistent if it
couldnt have been produced by any serial
execution of transactions

24
Intended Use

Instead of acquiring a lock, executing the
critical section, and releasing the lock, a
process would
use LT or LTX to read from a set of locations
use VALIDATE to check that the values read are
consistent,
use ST to modify a set of locations
use COMMIT to make the changes permanent.
If either the VALIDATE or the COMMIT fails, the
process returns to Step (1).

25
Implementation

Transactional memory is implemented by modifying
standard multiprocessor cache coherence protocols
We describe here how to extend snoopy cache
protocol for a shared bus to support
transactional memory
Our transactions are short-lived activities with
relatively small Data set.

26
The basic idea

Any protocol capable of detecting accessibility
conflicts can also detect transaction conflict at
no extra cost
Once a transaction conflict is detected, it can
be resolved in a variety of ways

27
Implementation

Each processor maintains two caches
Regular cache for non-transactional operations,
Transactional cache for transactional operations.
It holds all the tentative writes, without
propagating them to other processors or to main
memory (until commit)
Why using two caches?

28
Cache line states

Each cache line (regular or transactional) has
one of the following states

The transactional cache expends these states

29
Cleanup

When the transactional cache needs space for a
new entry, it searches for
EMPTY entry
If not found - a NORMAL entry
finally for an XCOMMIT entry.

30
Processor actions

Each processor maintains two flags
The transaction active (TACTIVE) flag indicates
whether a transaction is in progress
The transaction status (TSTATUS) flag indicates
whether that transaction is active (True) or
aborted (False)
Non-transactional operations behave exactly as in
original cache-coherence protocol

31
Example LT operation
Not Found?
Found?
Not Found?
Cache miss
Found?
Successful read
Unsuccessful read
32
Snoopy cache actions

Both the regular cache and the transactional
cache snoop on the bus.
A cache ignores any bus cycles for lines not in
that cache.
The transactional caches behavior
If TSTATUSFalse, or if the operation isnt
transactional, the cache acts just like the
regular cache, but ignores entries with state
other than NORMAL
On LT of other cpu, if the state is VALID, the
cache returns the value, and for all other
transactional operations it returns BUSY

33
Outline

Lock-Free
Hardware Transactional Memory (HTM)
Transactions
Cache coherence protocol
General Implementation
Simulation
UTM
LTM
TCC (briefly)
Conclusions

34
Simulation

Well see an example code for the
producer/consumer algorithm using transactional
memory architecture.
The simulation runs on both cache coherence
protocols snoopy and directory cache.
The simulation use 32 processors
The simulation finishes when 216 operations have
completed.

35
Part Of Producer/Consumer Code
unsigned queue_deq(queue q) unsigned head,
tail, result unsigned backoff BACKOFF_MIN
unsigned wait while (1) result
QUEUE_EMPTY tail LTX(q-gtenqs)
head LTX(q-gtdeqs) if (head !
tail) / queue not empty? /
result LT(q-gtitemshead
QUEUE_SIZE) / advance counter
/ ST(q-gtdeqs, head 1)
if (COMMIT()) break / abort gt
backoff / wait random() (01 ltlt
backoff) while (wait--) if
(backoff lt BACKOFF_MAX) backoff
return result
typedef struct Word deqs // Holds
the heads index Word enqs // Holds
the tails index Word itemsQUEUE_SIZE
queue
36
The results
37

So Far

In both HTM and STM the transactions shouldnt
touch many memory locations
There is a (small) bound on the transactions
footprint
In addition, there is a duration limit.

38
Outline

Lock-Free
Hardware Transactional Memory (HTM)
Transactions
Cache coherence protocol
General Implementation
Simulation
UTM
LTM
TCC (briefly)
Conclusions

39
Unbounded Transactional Memory (UTM)

UTM new thesis supports transactions of
arbitrary footprint and duration.
The UTM architecture allows
transactions as large as virtual memory
transactions of unlimited duration
transactions which can migrate between processors
UTM supports a semantics for nested transactions
In contrast to previous HTM implementation UTM
is optimized for transactions below a certain
size but still operate correctly for larger
transactions

40
The Goal of UTM

The primary goal
make concurrent programming easier.
Reducing implementation overhead.
Why do we want unbounded TM?

Neither programmers nor compilers can easily cope
with an imposed hard limit on transaction size.
41
UTM architecture

The transaction log data structure that
maintains bookkeeping information for a
transaction
Why is it needed?
Enables transactions to survive time slice
interrupts
Enables process migration from one processor to
another.

42
Two new instructions

All the programmer must specify is where a
transaction begins and ends
XBEGIN pc
Begin a new transaction. Entry point to an abort
handler specified by pc.
If transaction must fail, roll back processor and
memory state to what it was when XBEGIN was
executed, and jump to pc.
We can think of an XBEGIN instruction as a
conditional branch to the abort handler.
XEND
End the current transaction. If XEND completes,
the transaction is committed and appeared atomic.
Nested transactions are subsumed into outer
transaction.

43
Transaction Semantics

XBEGIN L1
ADD R1, R1, R1
ST 1000, R1
XEND
L2 XBEGIN L2
ADD R1, R1, R1
ST 2000, R1
XEND
Two transactions
A has an abort handler at L1
B has an abort handler at L2
Here, very simplistic retry.

A
B
44
Register renaming

A name dependence occurs when two instructions
Inst1 and Inst2 use the same register (or memory
location), but there is no data transmitted
between Inst1 and Inst2.
If the register is renamed so that Inst1 and
Inst2 do not conflict, the two instructions can
execute simultaneously or be reordered.
This technique that dynamically eliminates name
dependences in registers, is called register
renaming.
Register renaming can be done statically ( by
compiler) or dynamically ( by hardware).

45
Rolling back processor state

After XBEGIN instruction we take a snapshot of
the rename table
To keep track of busy registers, we maintain an S
(saved) bit for each physical register to
indicate which registers are part of the active
transaction and it includes the S bits with every
renaming-table snapshot
An active transactions abort handler address,
nesting depth, and snapshot are part of its
transactional state.

46
Memory State

UTM represents the set of active transactions
with a single data structure held in system
memory, the x-state (short for transaction
state).

47
Xstate Implementation

The x-state contains a transaction log for each
active transaction in the system.
Each log consists of
A commit record maintains the transactions
status
pending
committed
aborted
A vector of log entries corresponds to a memory
block that the transaction has read or written
to. The entry provides
pointer to the block
The blocks old value (for rollback)
A pointer to the commit record
Pointers that form a linked list of all entries
in all transaction logs that refer to the same
block. (Reader List)

48
Xstate Implementation (Cont)

The final part of the x-state consists of
log pointer
read-write bit
for each memory block

49
X-state Data Structure
X-state
Application memory
Commit record
Old value
W
Block pointer
Reader list
Commit record pointer
log pointer
RW bit
Transaction log entry
R
Old value
block
Block pointer
Reader list
Commit record pointer
50
More on x-state

When a processor references a block that is
already part of a pending transaction, the system
checks the RW bit and log pointer to determine
the correct action
use the old value
use the new value
abort the transaction

51
Commit action
X-state
Application memory
Commit record
Old value
W
Block pointer
Reader list
Commit record pointer
log pointer
RW bit
Transaction log entry
R
Old value
block
Block pointer
Reader list
Commit record pointer
52
Cleanup action
X-state
Application memory
Commit record
Old value
W
Block pointer
Reader list
Commit record pointer
log pointer
RW bit
Transaction log entry
R
Old value
block
Block pointer
Reader list
Commit record pointer
53
Abort action
X-state
Application memory
Commit record
Old value
W
Block pointer
Reader list
Commit record pointer
log pointer
RW bit
Transaction log entry
R
Old value
block
Block pointer
Reader list
Commit record pointer
54
Transactions Conflict

A conflict When two or more pending transactions
have accessed a block and at least one of the
accesses is for writing.
Performing a transaction load
check that the log pointer refers to an entry in
the current transaction log or the RW bit is R.
Performing a transaction store
check that the log pointer references no other
transaction
In case of a conflict, some of the conflicting
transactions are aborted.
Which transaction should be aborted?

55
Caching

For small transaction that fits in cache, UTM,
like earlier proposed HTM systems, uses cache
coherence protocol to identify conflicts
For transactions too big to fit in cache, the
x-state for the transaction overflows into the
ordinary memory hierarchy
Most log entries don't need to be created
Only create transaction log when transaction is
run out of physical memory.

56
UTMs Goal

support transactions that run for an indefinite
length of time
migrate from one processor to another
footprints bigger than the physical memory.
The main technique we propose is to treat the
x-state as a systemwide data structure that uses
global virtual addresses

57
Benefits and Limits of UTM

Limits
Complicated implementation
Benefits
Unlimited footprint
Unlimited duration
Migration possible
Good performance in the common case (small
transactions)

58
Outline

Lock-Free
Hardware Transactional Memory (HTM)
Transactions
Cache coherence protocol
General Implementation
Simulation
UTM
LTM
TCC (briefly)
Conclusions

59
LTM Visible, Large, Frequent, Scalable

Large Transactional Memory
Not truly unbounded, but simple and cheap
Minimal architectural changes, high performance
Small modifications to cache and processor core
No changes to main memory, cache coherence
protocol
Can be pin-compatible with conventional
processors

60
LTMs Restrictions

Limiting a transactions footprint to (nearly)
the size of physical memory.
Duration must be less than a time slice
Transactions cannot migrate between processors.
With these restrictions, we can implement LTM by
modifying only the cache and processor core

61
LTM vs UTM

Like UTM, LTM maintains data about pending
transactions in the cache and detects conflicts
using the cache coherency protocol
Unlike UTM, LTM does not treat the transaction as
a data structure. Instead, it binds a transaction
to a particular cache.
Transactional data overflows from the cache into
a hash table in main memory
LTM and UTM have similar semantics XBEGIN and
XEND instructions are the same
In LTM, the cache plays a major part

62
Addition to Cache

LTM adds a bit (T) per cache line to indicate
that the data has been accessed as part of a
pending transaction.
An additional bit (O) is added per cache set to
indicate that it has overflowed.

63
Cache overflow mechanism
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
64
Cache overflow mechanism
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
65
Cache overflow recording reads
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
66
Cache overflow recording writes
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
67
Cache overflow spilling
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
68
Cache overflow miss handling
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
69
LTM - Summary

Transactions as large as physical memory
Scalable overflow and commit
Easy to implement!
Low overhead

70
Outline

Lock-Free
Hardware Transactional Memory (HTM)
Transactions
Cache coherence protocol
General Implementation
Simulation
UTM
LTM
TCC (briefly)
Conclusions

71
Transactional Memory Coherence and Consistency
(TCC)

Hammond, Wong, Chen, Carlstrom, Davis (Jun
2004).Transactional Memory Coherence and
Consistency
All transactions, all the time! Code partitioned
into transactions by programmer or tools
Possibly at run-time, for legacy code!
All writes buffered in caches, CPUs arbitrate
system-wide for which one gets to commit
Updates broadcast to all CPUs. CPUs detect
conflicts of their transactions and abort

72
TCC Implementation
Loads stores
CPU Core
storesonly
Local cache hierarchy
r m ? V tag data
Write buffer
Commit control
snooping
commits
Broadcast bus or network
73
Conclusions