Title: Hardware Transactional Memory
1Hardware Transactional Memory
- Royi Maimon
- Merav Havuv
- 27/5/2007
2References
- M. Herlihy and J. Moss,  Transactional Memory
Architectural Support for Lock-Free Data
Structures - C. Scott Ananian, Krste Asanovic, Bradley  C.
Kuszmaul, Charles  E. Leiserson, Sean  Lie
Unbounded Transactional  Memory. - Hammond, Wong, Chen, Carlstrom, Davis (Jun
2004).Transactional Memory Coherence and
Consistency
3Today
- What are transactions?
- What is Hardware Transactional Memory?
- Various implementations of HTM
4Outline
- Lock-Free
- Hardware Transactional Memory (HTM)
- Transactions
- Cache coherence protocol
- General Implementation
- Simulation
- UTM
- LTM
- TCC (briefly)
- Conclusions
5Outline
- Lock-Free
- Hardware Transactional Memory (HTM)
- Transactions
- Cache coherence protocol
- General Implementation
- Simulation
- UTM
- LTM
- TCC (briefly)
- Conclusions
6Lock-free
- A shared data structure is lock-free if its
operations do not require mutual exclusion. - If one process is interrupted in the middle of an
operation, other processes will not be prevented
from operating on that object.
7Lock-free (cont)
- Lock-free data structures avoid common problems
associated with conventional locking techniques
in highly concurrent systems - Priority inversion
- Convoying occurs when a process holding a lock is
descheduled, and then, other processes capable of
running may be unable to progress. -
- Deadlock
8Priority inversion
- Priority inversion occurs when a lower-priority
process is preempted while holding a lock needed
by higher-priority processes.
9Deadlock
- Deadlock two or more processes are waiting
indefinitely for an event that can be caused by
only one of waiting processes. - Let S and Q be two resources
- P0 P1
- Lock(S) Lock(Q)
- Lock(Q) Lock(S)
-
10Outline
- Lock-Free
- Hardware Transactional Memory (HTM)
- Transactions
- Cache coherence protocol
- General Implementation
- Simulation
- UTM
- LTM
- TCC (briefly)
- Conclusions
11What is a transaction?
- A transaction is a sequence of memory loads and
stores executed by a single process that either
commits or aborts - If a transaction commits, all the loads and
stores appear to have executed atomically - If a transaction aborts, none of its stores take
effect - Transaction operations aren't visible until they
commit or abort -
12Transactions properties
- A transaction satisfies the following properties
- Serializability
- Atomicity
- Simplified version of traditional ACID database
(Atomicity, Consistency, Isolation, and
Durability)
13Transactional Memory
- A new multiprocessor architecture
- The goal Implementing a lock-free
synchronization - efficient
- easy to use
- comparing to conventional techniques based on
mutual exclusion - Implemented by straightforward extensions to
multiprocessor cache-coherence protocols.
14An Example
- Locks
- if (iltj)
- a i b j
- else
- a j b i
-
- Lock(La)
- Lock(Lb)
- Flowi Flowi X
- Flowj Flowj X
- Unlock(Lb)
- Unlock(La)
- Transactional Memory
- StartTransaction
- Flowi Flowi X
- Flowj Flowj X
- EndTransaction
15Transactional Memory
- Transactions execute in commit order
Time
0xbeef
0xbeef
16Outline
- Lock-Free
- Hardware Transactional Memory (HTM)
- Transactions
- Cache coherence protocol
- General Implementation
- Simulation
- UTM
- LTM
- TCC (briefly)
- Conclusions
17Cache-Coherence Protocol
- A protocol for managing the caches of a
multiprocessor system - No data is lost
- No overwritten before the data is transferred
from a cache to the target memory. - When multiprocessing, each processor may have
its own memory cache that is separate from the
shared memory
18The Problem (Cache-Coherence)
- Solving the problem in either of two ways
- directory-based
- snooping system
19Snoopy Cache
- All caches watches the activity (snoop) on a
global bus to determine if they have a copy of
the block of data that is requested on the bus.
20Directory-based
- The data being shared is placed in a common
directory that maintains the coherence between
caches. - The directory acts as a filter through which the
processor must ask permission to load an entry
from the primary memory to its cache. - When an entry is changed the directory either
updates or invalidates the other caches with that
entry.
21Outline
- Lock-Free
- Hardware Transactional Memory (HTM)
- Transactions
- Cache coherence protocol
- General Implementation
- Simulation
- UTM
- LTM
- TCC (briefly)
- Conclusions
22How it Works?
- The following primitive instructions for
accessing memory are provided - Load-transactional (LT) reads value of a shared
memory location into a private register. - Load-transactional-exclusive (LTX) Like LT, but
hinting that the location is likely to be
modified. - Store-transactional (ST) tentatively writes a
value from a private register to a shared memory
location. - Commit (COMMIT)
- Abort (ABORT)
- Validate (VALIDATE) tests the current transaction
status.
23Some definitions
- Read set the set of locations read by LT by a
transaction - Write set the set of locations accessed by LTX
or ST by a transaction - Data set (footprints) the union of the read and
write sets. - A set of values in memory is inconsistent if it
couldnt have been produced by any serial
execution of transactions
24Intended Use
- Instead of acquiring a lock, executing the
critical section, and releasing the lock, a
process would - use LT or LTX to read from a set of locations
- use VALIDATE to check that the values read are
consistent, - use ST to modify a set of locations
- use COMMIT to make the changes permanent.
- If either the VALIDATE or the COMMIT fails, the
process returns to Step (1).
25Implementation
- Transactional memory is implemented by modifying
standard multiprocessor cache coherence protocols - We describe here how to extend snoopy cache
protocol for a shared bus to support
transactional memory - Our transactions are short-lived activities with
relatively small Data set.
26The basic idea
- Any protocol capable of detecting accessibility
conflicts can also detect transaction conflict at
no extra cost - Once a transaction conflict is detected, it can
be resolved in a variety of ways
27Implementation
- Each processor maintains two caches
- Regular cache for non-transactional operations,
- Transactional cache for transactional operations.
It holds all the tentative writes, without
propagating them to other processors or to main
memory (until commit) - Why using two caches?
28Cache line states
- Each cache line (regular or transactional) has
one of the following states
- The transactional cache expends these states
29Cleanup
- When the transactional cache needs space for a
new entry, it searches for - EMPTY entry
- If not found - a NORMAL entry
- finally for an XCOMMIT entry.
30Processor actions
- Each processor maintains two flags
- The transaction active (TACTIVE) flag indicates
whether a transaction is in progress - The transaction status (TSTATUS) flag indicates
whether that transaction is active (True) or
aborted (False) - Non-transactional operations behave exactly as in
original cache-coherence protocol
31Example LT operation
Not Found?
Found?
Not Found?
Cache miss
Found?
Successful read
Unsuccessful read
32Snoopy cache actions
- Both the regular cache and the transactional
cache snoop on the bus. - A cache ignores any bus cycles for lines not in
that cache. - The transactional caches behavior
- If TSTATUSFalse, or if the operation isnt
transactional, the cache acts just like the
regular cache, but ignores entries with state
other than NORMAL - On LT of other cpu, if the state is VALID, the
cache returns the value, and for all other
transactional operations it returns BUSY
33Outline
- Lock-Free
- Hardware Transactional Memory (HTM)
- Transactions
- Cache coherence protocol
- General Implementation
- Simulation
- UTM
- LTM
- TCC (briefly)
- Conclusions
34Simulation
- Well see an example code for the
producer/consumer algorithm using transactional
memory architecture. - The simulation runs on both cache coherence
protocols snoopy and directory cache. - The simulation use 32 processors
- The simulation finishes when 216 operations have
completed.
35Part Of Producer/Consumer Code
unsigned queue_deq(queue q) unsigned head,
tail, result unsigned backoff BACKOFF_MIN
unsigned wait while (1) result
QUEUE_EMPTY tail LTX(q-gtenqs)
head LTX(q-gtdeqs) if (head !
tail) / queue not empty? /
result LT(q-gtitemshead
QUEUE_SIZE) / advance counter
/ ST(q-gtdeqs, head 1)
if (COMMIT()) break / abort gt
backoff / wait random() (01 ltlt
backoff) while (wait--) if
(backoff lt BACKOFF_MAX) backoff
return result
typedef struct Word deqs // Holds
the heads index Word enqs // Holds
the tails index Word itemsQUEUE_SIZE
queue
36The results
37 So Far
- In both HTM and STM the transactions shouldnt
touch many memory locations - There is a (small) bound on the transactions
footprint - In addition, there is a duration limit.
38Outline
- Lock-Free
- Hardware Transactional Memory (HTM)
- Transactions
- Cache coherence protocol
- General Implementation
- Simulation
- UTM
- LTM
- TCC (briefly)
- Conclusions
39Unbounded Transactional Memory (UTM)
- UTM new thesis supports transactions of
arbitrary footprint and duration. - The UTM architecture allows
- transactions as large as virtual memory
- transactions of unlimited duration
- transactions which can migrate between processors
- UTM supports a semantics for nested transactions
- In contrast to previous HTM implementation UTM
is optimized for transactions below a certain
size but still operate correctly for larger
transactions
40The Goal of UTM
- The primary goal
- make concurrent programming easier.
- Reducing implementation overhead.
- Why do we want unbounded TM?
Neither programmers nor compilers can easily cope
with an imposed hard limit on transaction size.
41UTM architecture
- The transaction log data structure that
maintains bookkeeping information for a
transaction - Why is it needed?
- Enables transactions to survive time slice
interrupts - Enables process migration from one processor to
another.
42Two new instructions
- All the programmer must specify is where a
transaction begins and ends - XBEGIN pc
- Begin a new transaction. Entry point to an abort
handler specified by pc. - If transaction must fail, roll back processor and
memory state to what it was when XBEGIN was
executed, and jump to pc. - We can think of an XBEGIN instruction as a
conditional branch to the abort handler. - XEND
- End the current transaction. If XEND completes,
the transaction is committed and appeared atomic. - Nested transactions are subsumed into outer
transaction.
43Transaction Semantics
- XBEGIN L1
- ADD R1, R1, R1
- ST 1000, R1
- XEND
- L2 XBEGIN L2
- ADD R1, R1, R1
- ST 2000, R1
- XEND
- Two transactions
- A has an abort handler at L1
- B has an abort handler at L2
- Here, very simplistic retry.
A
B
44Register renaming
- A name dependence occurs when two instructions
Inst1 and Inst2 use the same register (or memory
location), but there is no data transmitted
between Inst1 and Inst2. - If the register is renamed so that Inst1 and
Inst2 do not conflict, the two instructions can
execute simultaneously or be reordered. - This technique that dynamically eliminates name
dependences in registers, is called register
renaming. - Register renaming can be done statically ( by
compiler) or dynamically ( by hardware).
45Rolling back processor state
- After XBEGIN instruction we take a snapshot of
the rename table - To keep track of busy registers, we maintain an S
(saved) bit for each physical register to
indicate which registers are part of the active
transaction and it includes the S bits with every
renaming-table snapshot - An active transactions abort handler address,
nesting depth, and snapshot are part of its
transactional state. -
46Memory State
- UTM represents the set of active transactions
with a single data structure held in system
memory, the x-state (short for transaction
state).
47Xstate Implementation
- The x-state contains a transaction log for each
active transaction in the system. - Each log consists of
- A commit record maintains the transactions
status - pending
- committed
- aborted
- A vector of log entries corresponds to a memory
block that the transaction has read or written
to. The entry provides - pointer to the block
- The blocks old value (for rollback)
- A pointer to the commit record
- Pointers that form a linked list of all entries
in all transaction logs that refer to the same
block. (Reader List)
48Xstate Implementation (Cont)
- The final part of the x-state consists of
- log pointer
- read-write bit
- for each memory block
49X-state Data Structure
X-state
Application memory
Commit record
Old value
W
Block pointer
Reader list
Commit record pointer
log pointer
RW bit
Transaction log entry
R
Old value
block
Block pointer
Reader list
Commit record pointer
50More on x-state
- When a processor references a block that is
already part of a pending transaction, the system
checks the RW bit and log pointer to determine
the correct action - use the old value
- use the new value
- abort the transaction
51Commit action
X-state
Application memory
Commit record
Old value
W
Block pointer
Reader list
Commit record pointer
log pointer
RW bit
Transaction log entry
R
Old value
block
Block pointer
Reader list
Commit record pointer
52Cleanup action
X-state
Application memory
Commit record
Old value
W
Block pointer
Reader list
Commit record pointer
log pointer
RW bit
Transaction log entry
R
Old value
block
Block pointer
Reader list
Commit record pointer
53Abort action
X-state
Application memory
Commit record
Old value
W
Block pointer
Reader list
Commit record pointer
log pointer
RW bit
Transaction log entry
R
Old value
block
Block pointer
Reader list
Commit record pointer
54Transactions Conflict
- A conflict When two or more pending transactions
have accessed a block and at least one of the
accesses is for writing. - Performing a transaction load
- check that the log pointer refers to an entry in
the current transaction log or the RW bit is R. - Performing a transaction store
- check that the log pointer references no other
transaction - In case of a conflict, some of the conflicting
transactions are aborted. - Which transaction should be aborted?
55Caching
- For small transaction that fits in cache, UTM,
like earlier proposed HTM systems, uses cache
coherence protocol to identify conflicts - For transactions too big to fit in cache, the
x-state for the transaction overflows into the
ordinary memory hierarchy - Most log entries don't need to be created
- Only create transaction log when transaction is
run out of physical memory.
56UTMs Goal
- support transactions that run for an indefinite
length of time - migrate from one processor to another
- footprints bigger than the physical memory.
-
- The main technique we propose is to treat the
x-state as a systemwide data structure that uses
global virtual addresses
57Benefits and Limits of UTM
- Limits
- Complicated implementation
- Benefits
- Unlimited footprint
- Unlimited duration
- Migration possible
- Good performance in the common case (small
transactions)
58Outline
- Lock-Free
- Hardware Transactional Memory (HTM)
- Transactions
- Cache coherence protocol
- General Implementation
- Simulation
- UTM
- LTM
- TCC (briefly)
- Conclusions
59LTM Visible, Large, Frequent, Scalable
- Large Transactional Memory
- Not truly unbounded, but simple and cheap
- Minimal architectural changes, high performance
- Small modifications to cache and processor core
- No changes to main memory, cache coherence
protocol - Can be pin-compatible with conventional
processors
60LTMs Restrictions
- Limiting a transactions footprint to (nearly)
the size of physical memory. - Duration must be less than a time slice
- Transactions cannot migrate between processors.
- With these restrictions, we can implement LTM by
modifying only the cache and processor core
61LTM vs UTM
- Like UTM, LTM maintains data about pending
transactions in the cache and detects conflicts
using the cache coherency protocol - Unlike UTM, LTM does not treat the transaction as
a data structure. Instead, it binds a transaction
to a particular cache. - Transactional data overflows from the cache into
a hash table in main memory - LTM and UTM have similar semantics XBEGIN and
XEND instructions are the same - In LTM, the cache plays a major part
62Addition to Cache
- LTM adds a bit (T) per cache line to indicate
that the data has been accessed as part of a
pending transaction. - An additional bit (O) is added per cache set to
indicate that it has overflowed.
63Cache overflow mechanism
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
64Cache overflow mechanism
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
65Cache overflow recording reads
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
66Cache overflow recording writes
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
67Cache overflow spilling
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
68Cache overflow miss handling
O
T
Tag
Data
ST 1000, 55 XBEGIN L1 LD R1, 1000 ST 2000, 66 ST
3000, 77 LD R1, 1000 XEND
Overflow hashtable
Key
Data
69LTM - Summary
- Transactions as large as physical memory
- Scalable overflow and commit
- Easy to implement!
- Low overhead
70Outline
- Lock-Free
- Hardware Transactional Memory (HTM)
- Transactions
- Cache coherence protocol
- General Implementation
- Simulation
- UTM
- LTM
- TCC (briefly)
- Conclusions
71Transactional Memory Coherence and Consistency
(TCC)
- Hammond, Wong, Chen, Carlstrom, Davis (Jun
2004).Transactional Memory Coherence and
Consistency - All transactions, all the time! Code partitioned
into transactions by programmer or tools - Possibly at run-time, for legacy code!
- All writes buffered in caches, CPUs arbitrate
system-wide for which one gets to commit - Updates broadcast to all CPUs. CPUs detect
conflicts of their transactions and abort
72TCC Implementation
Loads stores
CPU Core
storesonly
Local cache hierarchy
r m ? V tag data
Write buffer
Commit control
snooping
commits
Broadcast bus or network
73Conclusions
- Unbounded, scalable, and efficient Transactional
Memory systems can be built. - Support large, frequent, and concurrent
transactions - Allow programmers to (finally!) use our parallel
systems! - Three architectures
- LTM easy to realize, almost unbounded
- UTM truly unbounded
- TCC high performance
74 THE END