Title: Log-Based Transactional Memory
1Log-Based Transactional Memory
2Motivation
- Chip-multiprocessors/Multi-core/Many-core are
here - Intel has 10 projects in the works that contain
four or more computing cores per chip -- Paul
Otellini, Intel CEO, Fall 05 - We must effectively program these systems
- But programming with locks is challenging
- Blocking on a mutex is a surprisingly delicate
dance -- OpenSolaris, mutex.c
3Locks are Hard
// WITH LOCKS void move(T s, T d, Obj key)
LOCK(s) LOCK(d) tmp s.remove(key)
d.insert(key, tmp) UNLOCK(d) UNLOCK(s)
Moreover Coarse-grain locking limits
concurrency Fine-grain locking difficult
DEADLOCK!
4Transactional Memory (TM)
void move(T s, T d, Obj key) atomic tmp
s.remove(key) d.insert(key, tmp)
- Programmer says
- I want this atomic
- TM system
- Makes it so
- Software TM (STM) Implementations
- Currently slower than locks
- Always slower than hardware?
- Hardware TM (HTM) Implementations
- Leverage cache coherence speculation
- Fast
- But hardware overheads and virtualization
challenges
5Goals for Transactional Memory
- Efficient Implementation
- Make the common case fast
- Cant justify expensive HW (yet)
- Virtualizing TM
- Dont limit programming model
- Allow transactions of any size and duration
6Implementing TM
- Version Management
- new values for commit
- old values for abort
- Must keep both
- Conflict Detection
- Find read-write, write-read or write-write
conflictsamong concurrent transactions - Allows multiple readers OR one writer
Large state (must be precise)
Checked often (must be fast)
7LogTM Log-Based Transactional Memory
- Combined Hardware/Software Transactional Memory
- Conservative hardware conflict detection
- Software version management (with some hardware
support) - Eager Version Management
- Stores new values in place
- Stores old values in user virtual memory (the
transaction log) - Eager Conflict Detection
- Detects transaction conflicts on each load and
store
8LogTM Publications
- HPCA 2006 LogTM Log-based Transactional
Memory - ASPLOS 2006 Supporting Nested Transactional
Memory in LogTM - HPCA 2007 LogTM-SE Decoupling Hardware
Transactional Memory from Caches - ISCA 2007 Performance Pathologies in
Hardware Transactional Memory
9Outline
- Introduction
- Background
- LogTM
- Implementing LogTM
- Evaluation
- Extending LogTM
- Related Work
- Conclusion
10LOGTM
11LogTM Log-Based Transactional Memory
- Eager Software-Based Version Management
- Store new values in place
- Store old values in the transaction log
- Undo failed transactions in software
- Eager All-Hardware Conflict Detection
- Isolate new values
- Fast conflict detection for all transactions
12LogTMs Eager Version Management
- New values stored in place
- Old values stored in the transaction log
- A per-thread linear (virtual) address space (like
the stack) - Filled by hardware (during transactions)
- Read by software (on abort)
Data Block
VA
R W
00
0
0
12
40
0
0
24
80
0
0
56
Log Base
C0
7c
90
34
23
C0
Transaction Log
E8
Log Ptr
00
15
100
1
TM count
ltexamplegt
13Eager Version Management Discussion
- Advantages
- No extra indirection (unlike STM)
- Fast Commits
- No copying
- Common case
- Disadvantages
- Slow/Complex Aborts
- Undo aborting transaction
- Relies on Eager Conflict Detection/Prevention
14LogTMs Eager Conflict Detection
- Requirements for Conflict Detection in LogTM
- Transactions Must Be Well Formed
- Each thread must obtain read isolation on all
memory locations read and write isolation on all
locations written - Isolation Must be Strict Two-Phase
- Any thread that acquires read or write isolation
on a memory location in a transaction must
maintain that isolation until the end of the
transaction - Isolation Must Be Released at the End of a
Transaction - Because conflicts may prevent transactions from
making progress, a thread completing a
transaction must release isolation when it aborts
or commits a transaction
15LogTMs Conflict Detection in Practice
- LogTM detects conflicts using coherence
- Requesting processor issues coherence request to
memory system - Coherence mechanism forwards to other
processor(s) - Responding processor detects conflict using local
state informs requesting processor of conflict - Requesting processor resolves conflict (discussed
later)
16Example Implementation (LogTM-Dir)
- P0 store
- P0 sends get exclusive (GETX) request
- Directory responds with data (old)
- P0 executes store
Directory
I old
M_at_P0 old
P1
P0
(W-)
(--)
Metadata
(--)
(--)
Metadata
I none
M old
M new
I none
17Example Implementation (LogTM-Dir)
- In-cache transaction conflict
- P1 sends get shared (GETS) request
- Directory forwards to P0
- P1 detects conflict and sends NACK
Directory
M_at_P0 old
GETS
Fwd_GETS
P1
P0
Metadata
(W-)
Metadata
(--)
I none
M new
M new
NACK
18Conflict Resolution
- Conflict Resolution
- Can wait risking deadlock
- Can abort risking livelock
- Wait/abort transaction at requesting or
responding proc? - LogTM resolves conflicts at requesting processor
- Requesting can wait (using coherence
nacks/retries) - But must abort if deadlock is possible
- Requester Stalls Policy
- Logically order transactions with timestamps
- On conflict notification, wait unless already
causing an older transaction to wait
19LogTM API
User System/Library Low-Level
begin_transaction() commit_transaction() abort_transaction() Initialize_logtm_transactions() Register_abort_handler(void () handler) Undo_log_entry() Complete_abort_with_restart() Complete_abort_wo_restart()
20IMPLEMENTING LOGTM
21Version Management Trade-offs
- Hardware vs. Software Register Checkpointing
- Implicit vs. Explicit Logging
- Buffered vs. Direct Logging
- Logging Granularity
- Logging Location
22Compiler-Supported Software Logging
- Software Register Checkpointing
- Compiler generates instructions to save registers
to transaction log - Software-only logging
- Compiler generates instructions to save old
values and to the transaction log - Lowest implementation cost
- All-software version management
- High overhead
- Slow to start transactions (save registers)
- Slow writes (extra load instructions)
23In-Cache Hardware Logging
- Hardware Register Checkpointing
- Bulk save architectural registers (like USIII)
- Hardware Logging
- Hardware saves old values and virtual address to
memory at the first level of writeback cache - Best Performance
- Little or no logging-induced delay
- Single-cycle transaction begin/commit
- Complex implementation
- Shadow register file
- Buffering and forwarding logic in caches
24In-Cache Hardware Logging
L1 D Cache
L2 Cache
VA
Log Target
ECC
Bank 0
Bank 1
VA
Log Target
ECC
Store Target
ECC
Store Target
ECC
Data
ECC
Data
ECC
CPU
Store Buffer
L1 D
L1 D
CPU
Store Buffer
CPU
25Hardware/Software Hybrid Buffered Logging
- Hardware Register Checkpointing
- Bulk save architectural registers (like USIII)
- Buffered Logging
- Hardware saves old values and virtual address to
a small buffer - Good Performance
- Little or no logging-induced delay for small
transactions - Single-cycle transaction begin/commit
- Reduces processor-to-cache memory traffic
- Less-complex implementation
- Shadow register file
- No changes to caches
26Hardware/Software Hybrid Buffered Logging
Cache
VA
Log Target
Store Target
CPU
Store Buffer
Store Buffer
Log Buffer
Transaction Execution
Buffer Spill
Register File
Register File
27Implementing Conflict Detection
- Existing cache coherence mechanisms can support
conflict detection for cached data by adding an R
(read) W (write) bit to each cache line - Challenges for detecting conflicts on un-cached
data differ for broadcast and directory systems - Broadcast
- Easy to find all possible conflicts
- Hard to filter false conflicts
- Directory
- Hard to find all possible conflicts
- Easy to filter false conflicts
28LogTM-Bcast
- Adds a Bloom Filter to track memory blocks
touched in a transaction, then evicted from the
cache - Allows any number of addresses to be added to the
filter - Detects all true conflicts
- Allows some false conflicts
L2 Cache
RW
Data
Tag
Overflow filter
R
W
L1 D
CPU
L1 I
29LogTM-Dir
- Extends a standard MESI directory with sticky
states - The directory continues to forward coherence
traffic for a memory location to processors that
touch that location in a transaction then evict
it from the cache - Removes most false conflicts with a single
overflow bit per cache
30Sticky States
Directory State
M S I
M M
E E
S S
I Sticky-M Sticky-S I
Cache State
31 LogTM-Dir Conflict Detection w/ Cache Overflow
- At overflow at processor P0
- Set P0s overflow bit (1 bit per processor)
- Allow writeback, but set directory state to
Sticky_at_P0 - At (potential) conflicting request by processor
P1 - Directory forwards P1s request to P0.
- P0 tells P1 no conflict if overflow is reset
- But asserts conflict if set (w/ small chance of
false positive) - At transaction end (commit or abort) at processor
P0 - Reset P0s overflow bit
- Clean sticky states lazily on next access
32LogTM-Dir
- Cache overflow
- P0 sends put exclusive (PUTX) request
- Directory acknowledges
- P0 writes data back to memory
Directory
M_at_P0 old
Msticky_at_P0 new
PUTX
ACK
DATA
P1
P0
TM count
TM count
0
0
1
(W-)
R/W
(--)
R/W
M new
I none
I none
33LogTM-Dir
- Out-of-cache conflict
- P1 sends GETS request
- Directory forwards to P0
- P0 detects a (possible) conflict
- P0 sends NACK
Directory
M_at_P0 old
Msticky_at_P0 new
GETS
Fwd_GETS
P1
P0
TM count
TM count
0
0
1
(-W)
Signature
(--)
Signature
I (--) none
I none
M (--) old
M (-W) new
I none
NACK
34LogTM-Dir
- Commit
- P0 clears TM count and
- Signature
Directory
M_at_P0 old
Msticky_at_P0 new
P1
P0
TM count
TM count
0
0
1
0
Signature
(--)
(-W)
Signature
(--)
I (--) none
I none
M (--) old
M (-W) new
I none
35LogTM-Dir
- Lazy cleanup
- P1 sends GETS request
- Directory forwards request to P0
- P0 detects no conflict, sends CLEAN
- Directory sends Data to P1
Directory
S(P1) new
Msticky_at_P0 new
GETS
DATA
CLEAN
Fwd_GETS
P1
P0
TM count
TM count
0
0
0
(--)
Signature
(--)
Signature
(R-)
I (--) none
I (--) none
M (--) old
M (-W) new
I none
S new
36EVALUATION
37System Model
- LogTM-Dir
- In-Cache Hardware Logging Hybrid Buffered
Logging
Component Settings
Processors 32, 1 GHz, single-issue, in-order, non-memory IPC1
L1 Cache 16 kB 4-way split, 1-cycle latency
L2 Cache 4 MB 4-way unified, 12-cycle latency
Memory 4 GB, 80-cycle latency
Directory Full-bit-vector sharers list, directory cache, 6-cycle latency
Interconnection Network Hierarchical switch topology, 14-cycle link latency
38Benchmarks
Benchmark Synchronization Inputs
Shared Counter Counter lock 2500 cycle random think time
B-Tree Transactions only 9-ary tree, 5 levels deep
Barnes Locks on tree nodes 512 bodies
Cholesky Task queue locks 14
Berkeley DB (BkDB) Locks on object lists 512 operations
MP3D Locks 4096 molecules
Radiosity Task queue locks Large room
Raytrace Work list and counter locks Car
39Read Set Size
40Write Set Size
41Microbenchmark Scalability
Btree 0, 10 and 20 Updates
Shared Counter LogTM vs. Locks
42Benchmark Scalability
Barnes
BkDB
43Benchmark Scalability
Cholesky
MP3D
44Benchmark Scalability
Radiosity
Raytrace
45Scalability Summary
- Benchmarks scale as well or better using LogTM
transactions - Performance is better for all benchmarks
- LogTM improves the scalability of some
benchmarks, but not others - Abort rates are low
- Next
- Write set prediction
- Buffered Logging
- Log Granularity
46Write Set Prediction
- Predicts if the target of each load will be
modified in this transaction - Eagerly acquires write isolation
- Reduces waits for cycles that force aborts in
LogTM - Four Predictors
- None -- Never predict
- 1-Entry -- Remembers a single address
- Load PC -- History based on PC of load
instruction - Always -- Acquire write isolation for all loads
and stores
47Abort Rate with Write Set Prediction
48Performance Impact of WSP
49Impact of Buffer-Spill Stalls
50Log Granularity
51Modeling Abort Penalty
- Abort penalty
- Delays coherence requests
- Delays transaction restart
- Penalty consists of
- Trap overhead (constant)
- Rollback overhead (per log entry)
- Measured performance for 3 settings
- Ideal -- single-cycle abort
- Medium -- 200 cycle trap, 40-cycle per undo
record - Slow -- 1000 cycle trap, 200-cycle per undo
record
52Sensitivity to Abort Penalty (no WSP)
53Sensitivity to Abort Penalty (with WSP)
54EXTENDING LOGTM
55Extending LogTM
- Supporting Nesting in LogTM
- Support nested VM by segmenting the transaction
log - Non-transactional escape actions facilitate OS
interactions - Virtualizing Conflict Detection with Signatures
LogTM-Signature Edition (LogTM-SE) tracks read
and write sets with signatures (like Bloom
Filters) - Supports thread switching and paging by saving,
restoring and manipulating signatures
56RELATED WORK
57Related Work
- Hardware Support for Database Transactions
- Early Transactional Memory Systems
- Hardware TM (HTM)
- Software TM (STM)
- Hybrid TM
- TM Virtualization
58Early Transactional Memory Systems
- Hardware Support for Database Transactions
- 801 Storage System
- Database-like transactions on 1-level store
(memory and disk) - Transactions are durable
- Early HTM
- Knight
- used transactions to parallelize code written in
mostly functional languages - Herlihy and Moss
- First HTM
- Implementation based on a separate transaction
cache - Transactions limited to cached data
59Unbounded Transactional Memory
- Uses Eager VM and Eager CD
- Supports unbounded transactions in hardware
- Complex hardware
- Pointer and state bits for each line in memory
- Hardware state machine for transaction rollback
- Global virtual address space
60Transactional Memory Coherence and Consistency
(TCC)
On-Chip Interconnect Broadcast-Based Communication
Write buffer 4 kB, Fully-Associative
L2 Cache Logically Shared
CPU
L1 D
R
L1 Cache tracks read set
61Bulk
- Encodes read and write sets in signatures (like
bloom filters) - Like TCC, uses lazy VM and lazy CD
- Can detect conflicts for non-cached data
62Hybrid Transactional Memory
- Combines HTM and STM
- Executes small transactions in hardware, large
transactions in software - Allows program execution on existing hardware
(without HTM support)
63Transaction Virtualization
- Virtual Transactional Memory (VTM)
- Rajwar and Herlihy
- Adds a virtualization mechanism to limited HTM
(e.g. Herlihy and Moss TM) - Implements CD and VM for transactions that exceed
hardware capabilities in micro-code - Page-granularity Transaction Virtualization
- PTM -- Chuang et al.
- XTM -- Chung et al.
64HTM Virtualization Mechanisms
Before Virtualization Before Virtualization Before Virtualization Before Virtualization After Virtualization After Virtualization After Virtualization After Virtualization
Miss Commit Abort Eviction Miss Commit Abort Eviction Paging Thread Switch
UTM - - - H H H HC H H H
VTM - - - S S SC S S S SWV
UnrestrictedTM - - - A B B B B AS AS
XTM XTM-g - - - - - - ASC SC - - SCV SCV S S SC SC SC SC AS AS
PTM-Copy PTM-Select - - - - - - SC S S H S S SC S SC S S S S S
LogTM-SE - - SC - - S SC - S S
Shaded virtualization event - handled in simple HW H complex hardware S handled in software A abort transaction C copy values W walk cache V validate read set B block other transactions
65Conclusion
- TM can make parallel programs faster and easier
to write - LogTM provides
- Hardware/Software Implementation
- Simple, flexible hardware
- Software-Based Eager Version Management
- Makes the common case (commit) fast
- Reduces hardware complexity
- Hardware-Based Eager Conflict Detection
- Allows blocking to reduce wasted work
66Thanks to my Collaborators
- Jayaram Bobba, Mark Hill, Derek Hower, Steve
Jackson, Nick Kidd, Ben Liblit, Mike Marty,
Michelle Moravan, Tom Reps, Mike Swift, Haris
Volos, David Wood, Luke Yen
67BACKUP SLIDES
68Database Locks and Cache Coherence States
Database Cache State
No Lock I
S E, O, S
X M
- Coherence states are analogous to short database
locks - Most protocols have no provision to hold long
locks
69Herlihy and Moss, ISCA 1993
- Transaction cache
- Stores all data accessed by transactions
- 2 copies of each updated cache line
- Fully associative
- Acts as a victim cache
- Long Locks
- Processors are allowed to refuse coherence
requests
Memory
M
S
S
XCommit
XAbort
Cache
Transaction Cache
70Transactions Limited by Cache Size and
Associativity
- Exposes the size of the transaction cache to the
architecture - Requires minimum associativity
- Difficult for dynamic transactions
71Transactional Lock Removal (TLR)
- Uses Speculative Lock Elision (SLE) to elide lock
operations in short critical sections - Extends SLE with lock-based concurrency control
- Long locks processors can defer coherence
responses during speculative transactions
72LogTM-SE Processor Hardware
- Segmented log, like LogTM
- Track R / W sets withR / W signatures
- Over-approximate R / W sets
- Tracks physical addresses
- Summary signature used for virtualization
- Conflict detection by coherence protocol
- Check signatures on every memory access for SMT
Registers
Register Checkpoint
LogFrame
TMcount
Read
LogPtr
Write
SummaryRead
SummaryWrite
SMT Thread Context
NO TM STATE
Data Caches
73Escape Actions
- Allow non-transactional escapes from a
transaction - (e.g., system calls, I/O)
- Similar to Zilless pause/unpause
- Escape actions never
- Abort
- Stall
- Cause other transactions to abort
- Cause other transactions to stall
- Commit and compensating actions
- similar to open nests
Not recommended for the average programmer!
74Case Study System Calls in Solaris
Category Examples
Read-Only 57 getpid, times, stat, access, mincore, sync, pread, gettimeofday
Undoable (without global side effects) 40 chdir, dup, umask, seteuid, nice, seek, mprotect
Undoable (with global side effects) 27 chmod, mkdir, link, mknod, stime
Calls not handled by escape actions 89 kill, fork, exec, umount
75Escape Actions in LogTM
- Loads and stores to non-transactional blocks
behave as normal coherent accesses - Loads return the latest value in coherent memory
- Loads to a transactionally modified cache block
triggers a writeback (sticky-M state) - Memory responds with an uncacheable copy of the
block - Stores modify coherent memory
- Stores to transactionally modified blocks trigger
writebacks (sticky-M) - Updates the value in memory (non-cacheable write
through)
76Thread Switching Support
- Why?
- Support long-running transactions
- What?
- Conflict Detection for descheduled transactions
- How?
- Summary Read / Write signatures
- If thread t of process P is scheduled to use an
active signature,the corresponding summary
signature holds the union of the saved signatures
from all descheduled threads from process P. -
- Updated using TLB-shootdown-like mechanism
77Handling Thread Switching
OS
T2
T3
T1
W
00000000
Summary
R
00000000
01001000
W
0100000
W
0100000
W
00000000
W
01010010
R
01010010
R
01000010
R
00000000
R
P1
P4
P2
P3
78Handling Thread Switching
00000000
W
01001000
Summary
OS
R
00000000
01010010
Deschedule
T2
T3
T1
W
00000000
Summary
R
00000000
01001000
W
0100000
W
0100000
W
00000000
W
01001000
01010010
R
01010010
R
01000010
R
00000000
R
01010010
P1
P4
P2
P3
79Handling Thread Switching
01001000
W
Summary
OS
R
01010010
Deschedule
T2
T3
T1
W
00000000
Summary
R
00000000
01001000
W
0100000
W
0100000
W
00000000
W
01010010
R
01010010
R
01000010
R
00000000
R
P1
P4
P2
P3
80Handling Thread Switching
01001000
W
Summary
OS
R
01010010
T1
T2
T3
W
W
00000000
00000000
Summary
Summary
R
R
00000000
00000000
00000000
W
0100000
W
0100000
W
00000000
W
00000000
R
01010010
R
01000010
R
00000000
R
P1
P4
P2
P3
81Thread Switching Support Summary
- Summary Read / Write signatures
- Summarizes descheduled threads with active
transactions - One OS structure per process
- Check summary signature on every memory access
- Updated on transaction deschedule
- Similar to TLB shootdown
Coherence
82Improving LogTM
83Comparing HTMs
84Multifacet Group Projects
- IEEE Computer - Simulating a 2M Commercial
Server on a 2K PC Alaa R. Alameldeen, Milo M.K.
Martin, Carl J. Mauer, Kevin E. Moore, Min Xu,
Daniel J. Sorin, Mark D. Hill and David A. Wood - ASPLOS 2000 - Timestamp Snooping An Approach for
Extending SMPs, Milo M. K. Martin, Daniel J.
Sorin, Anastassia Ailamaki, Alaa R. Alameldeen,
Ross M. Dickson, Carl J. Mauer, Kevin E. Moore,
Manoj Plakal, Mark D. Hill, and David A. Wood
85How Do Transactional Memory Systems Differ?
- (Data) Version Management
- Eager record old values elsewhere update in
place - Lazy update elsewhere keep old values in
place - (Data) Conflict Detection
- Eager detect conflict on every read/write
- Lazy detect conflict at end (commit/abort)
? Fastcommit
? Less wasted work
86Transaction Log Example
Data Block
VA
R W
- Initial State
- LogBase LogPointer
- R W bits are clear
12--------------
00
0
0
--------------23
40
0
0
34--------------
C0
0
0
1000
Log Base
1000
1040
Log Ptr
1000
1080
TM mode
1
87Transaction Log Example
Data Block
VA
R W
- Load r1, (00) / r1 gets 12 /
- Set R bit for block (00) (no changes to log)
12--------------
0
00
1
0
--------------23
40
0
0
34--------------
C0
0
0
1000
Log Base
1000
1040
Log Ptr
1000
1080
TM mode
1
88Transaction Log Example
Data Block
VA
R W
- Store r2, (c0) / r2 56 /
- Set W bit for block (c0)
- Store address (c0) and old data on the log
- Increment Log Ptr to 1048
- Update memory
12--------------
00
1
0
--------------23
40
0
0
34--------------
56--------------
C0
0
0
1
34------------
1000
c0
Log Base
1000
1040
--
1000
1048
Log Ptr
1080
TM mode
1
89Transaction Log Example
Data Block
VA
R W
- Load r3, (78)
- Set R bit for block (40)
- R3 r3 1
- Store r3, (78)
- Set W bit for block (40)
- Store address (40) and old data on the log
- Increment Log Ptr to 1090
- Update memory
12--------------
00
1
0
--------------23
40
--------------24
0
0
1
1
56--------------
C0
0
1
1000
34------------
c0
Log Base
1000
1040
40
------------
--
1048
Log Ptr
1090
1080
--23
TM mode
1
90Transaction Log Example
Data Block
VA
R W
- Commit transaction
- Clear R W for all blocks
- Reset Log Ptr to Log Base (1000)
- Clear TM mode
12--------------
00
0
0
--------------24
40
0
0
56--------------
C0
0
0
1000
34------------
c0
Log Base
1000
1040
40
--
------------
Log Ptr
1000
1090
1080
--23
TM mode
0
1
91Transaction Log Example
Data Block
VA
R W
- Abort transaction
- Replay log entries to undo the transaction
- Reset Log Ptr to Log Base (1000)
- Clear R W bits for all blocks
- Clear TM mode
12--------------
00
0
0
--------------23
40
--------------24
0
0
34--------------
C0
56--------------
0
0
1000
c0
Log Base
1000
1040
40
1090
Log Ptr
1048
1000
1080
1
TM mode
0
Back to Talk
92Primitive Logging
- Software defined log location (in virtual memory)
- Based on log pointer register
- Hardware copies old values and virtual address to
memory at log pointer - Overlaps logging with stores
- Allows logging with library calls
93Primitive Address Matching
- Software creates and activates multiple contexts
- Not strictly nested
- Many uses
- Hand-over-hand locking
- Pointer alias checks
- Transactional memory
94LogTM Interface
- User-Level
- Begin/commit/abort
- System/Library
- Initialize transactions
- Register conflict handler
- Low-Level
- Undo log entry
- Complete abort with/without restart
? currently, undo log to abort, but conflict
managers in future
95HTM (in general)
- Version Management
- New values in cache
- Old values in memory
- Conflict Detection
- Coherence protocol detects conflicts
- Invalidate
Memory
Cache
Cache
M NEW
S
S
I
S
M NEW
96Conflict Detection in Other TM Schemes
- Cache overflow of transactional data hard for
(Hardware) TM - Prohibit Herlihy/Moss TM
- Action at Overflow LTM, VTM, TCC
97Outline
- Background/Motivation
- Multicores are her
- We need to program them
- Need HW/SW solution
- HW primitives
- SW Control
- TM
- Clear, intuitive model
- Likely benfits
- But, all-hw wont work
- LogTM
- LogTM Family
- Eager Version Management
- Basic Log
- Segmented Log
- Eager Conflict Detection
- Signatures
- Coherence
- Sticky States
98Software Transactional Memory
- Transactional programming w/o hardware support
- Atomic swap of pointers to enforce atomicity
- Adds a level of indirection
99MSI Coherence 101 (per memory block)
- States
- M - one writer
- S - many readers
- I - no access
- Protocol detects orders data conflicts
- write-read
- read-write
- write-write
- E.g., Writer seeks M copy must invalidate S
copies
100Why Hardware Transactional Memory (HTM)?
- Speed HTMs faster than STMs
- Leverage cache coherence
- Mitigate extra indirection copying
- Speed HTMs faster than some lock regimes
- Auto-magical fine-grain
- Dont have to get lock
- Speed Whole reason for parallelism
- But HTM virtualization issues
- Cache size associativity, OS Calls
- Paging, process switching migration
101Conflict Detection in HTM
- Most Hardware TMs
- Do eager conflict detection (at read/writes)
- Leveraging invalidation-based cache coherence
- Most Hardware TMs add
- Add per-processor transactional write (W) read
(R) bits - Setting W bit requires M state setting R
requires S or M - Ensures coherence protocol detects transactional
data conflicts - E.g., Writer seeks M copy, seeks S copies,
finds R bit set
102The State of the World
- GHz race is over
- Frequency increase limited by heat and power
constraints - Size of processor limited by communication delay,
not transistors - Increasing wire delay on chip
- All high-performance processors will be CMP
- Software must become parallel
(in computer architecture)
103Parallel Programming is Hard!
- Data races cause subtle bugs
- Locks are a mess
- Deadlock
- Granularity problem
- Not composable
- Lock-free solutions still challenging
- We need a better way to write parallel software
104Solution Let the hardware help
- Provide a better interface for parallel software
- Plenty of transistors
- Access to run-time information
- Transactional Memory
- Intuitive interface -- serial execution
- High performance -- run transactions in parallel
when possible - Current cache coherence schemes already do much
of the work
105LogTM Overview
- Hardware Transactional Memory promising
- Most use lazy version management
- Old values in place
- New values elsewhere
- Commits slower than aborts
- But commits more common
- New LogTM Log-based Transactional Memory
- Uses eager version management (like most
databases) - Old values to log in thread-private virtual
memory - New values in place
- Makes common commits fast!
- Also allows cache overflow software abort
handling
106What is Transactional Memory?
void move(T s, T d, Obj key) atomic tmp
s.remove(key) d.insert(key, tmp) void
move(T s, T d, Obj key) LOCK(s) LOCK(d)
tmp s.remove(key) d.insert(key, tmp)
UNLOCK(d) UNLOCK(s)
- Atomic and isolated execution
- Replaces locks for many applications
- No lock granularity problem
- No deadlock
- Composable synchronization
DEADLOCK!
107Single-CMP System
Interconnect
L2
DRAM
108Methods
- Simulated Machine 32-way non-CMP
- 32 SPARC V9 processors running Solaris 9 OS
- 1 GHz in-order processors w/ ideal IPC1
private caches - 16 kB 4-way split L1 cache, 1 cycle latency
- 4 MB 4-way unified L2 cache, 12 cycle latency
- 4 GB main memory, 80-cycle access latency
- Full-bit vector directory w/ directory cache
- Hierarchical switch interconnect, 14-cycle
latency - Simulation Infrastructure
- Virtutech Simics for full-system function
- Multifacet GEMS for memory system timing (Ruby
only)GPL Release http//www.cs.wisc.edu/gems/ - Magic no-ops instructions for begin_transaction()e
tc.
109Microbenchmark Analysis
- Shared Counter
- All threads updatethe same counter
- High contention
- Small Transactions
- LogTM v. Locks
- EXP - Test-And-Test-And-Set Locks with
Exponential Backoff - MCS - Software Queue-Based Locks
BEGIN_TRANSACTION() new_total total.count
1 private_dataid.count total.count
new_total COMMIT_TRANSACTION()
110Shared Counter
- LogTM (like other HTMs) does not read/write lock
- LogTM has few aborts despite conflicts
111SPLASH2 Benchmarks
Benchmark Input Synchronization
Barnes 512 Bodies Locks on tree nodes
Cholesky 14 Task queue locks
Ocean Contiguous partitions, 258 Barriers
Radiosity Room Task queue and buffer locks
Raytrace Small image (teapot) Work list and counter locks
Raytrace-Opt Small image (teapot) Work list and counter locks
Water N-Squared 512 Molecules barriers
112SPLASH2 Benchmark Results
113SPLASH2 Benchmark Results
Benchmark Transactions Stalls Aborts R-M-W
Barnes 3,067 4.89 15.3 27.9
Cholesky 22,309 4.54 2.07 82.3
Ocean 6,693 .30 .52 100
Radiosity 279,750 3.96 1.03 82.7
Raytrace-Base 48,285 24.7 1.24 99.9
Raytrace-Opt 47,884 2.04 .41 99.9
Water 35,398 0 .11 99.6
? Conflicts Less Common ?
? Aborts ?
Most trans. data read before written
114LogTM
Virtual Memory
- No limit on transaction size
- New values stored in place (even in main memory)
- All-hardware conflict detection using sticky
states - Aborts processed in software
New Values
Transaction Logs
Old Values
HPCA 2006 - LogTM Log-Based Transactional
Memory, Kevin E. Moore, Jayaram Bobba, Michelle
J. Moravan, Mark D. Hill and David A. Wood
115Nested LogTM
Transaction Log
- Supports closed and open nesting by
- Splitting log into frames (like a stack of
activation records) - Replicating R/W bits
- Escape actions provide non-transactional
execution for system calls and I/O
Header
Level 0
Undo record
Undo record
Header
Undo record
Level 1
Undo record
ASPLOS 2006 - Supporting Nested Transactional
Memory in LogTM, Michelle J. Moravan, Jayaram
Bobba, Kevin E. Moore, Luke Yen, Mark D. Hill,
Ben Liblit, Michael M. Swift and David A. Wood
116LogTM-SE Signature Edition
- Nested LogTM has several implementation issues
- Nesting depth limited by hardware
- Multiple R and W bits per cache block
- SMT makes this worse
- Mucks with latency critical L1 cache
- Not easy to virtualize
- Decouple conflict detection from L1 cache array
- Use Signatures to conservatively detect conflicts
- E.g., Bloom filters
- Small filters sufficient for most transactions
117LogTM-SE and Nesting
- Single hardware signature
- Save current signature on nested begin
- On conflict, abort inner transaction and reload
signature - Check if conflict resolved, if not repeat
- Closed nested commit
- No change to hardware signature
- Child merges with parent
- Open nested commit
- Restore saved signature from log
118Virtualizing LogTM-SE
- Cache overflow
- Sticky-states or broadcast coherence
- Ensures conflict detection
- Filter (conservatively) checks for conflicts
- Thread suspension/migration
- Second hardware signature
- Summarizes suspended transactions
- OS manages on scheduling events
- Paging
- Pageout checks for (potential) conflict, OS saves
state - Pagein updates filters with new physical address
Skip Other gtgt
119Characterization of Java Middleware
- ICPP 2005 - Exploring Processor Design Options
for Java Based Middleware - HPCA 2003 - Memory System Behavior of Java-Based
Middleware
Martin Karlsson, Kevin E. Moore, Erik Hagersten
and David A. Wood
120Closed Nesting in LogTM
- Conflict Detection
- Nested LogTM replicates R/W bits for each level
- Flash-Or circuit merges child and parent R/W
bits - Version Management
- Nested LogTM segments the log into frames
- (similar to a stack of activation records)
R
W
Tag
Data
1
1
1
Data Caches
Registers
Register Checkpoint
LogFrame
LogBase
TMcount
LogPtr
Processor
121Hardware State
- R and W bit per cache line
- track read and write sets
- Overflow bit
- Register checkpoint
- Fast save/restore
- Log Base and Log Pointer registers
- TM nesting count
R W Tag Data
Overflow
Data Cache
Registers
Register Checkpoint
LogBase
TMcount
LogPtr
Processor
122How Do Transactional Memory Systems Differ?
Lazy Version Management Eager Version Management
Lazy Conflict Detection
Eager Conflict Detection
Databases withOptimistic Conc. Ctrl.
Not done (yet)
Stanford TCC UIUC Bulk
Databases withConservative C. Ctrl.
Herlihy/Moss TM MIT LTM Intel/Brown VTM
MIT UTM
Wisconsin LogTM
123Virtualization Challenge
- Hardware TM Implementations
- Finite Hardware Signatures
- Mutiplexed Thread Switching, Virtual Memory
- LogTM-SE
- Version Management
- Transaction Log
- Virtual Memory
- Conflict Detection
- Signatures
- Physical Addresses
Already Virtualized
Coming up
124Open Nesting
Child transaction exposes state on commit
(i.e., before the parent commits)
- Raise level of abstraction for isolation and
abort - Eliminates semantically unnecessary conflicts
- Increases concurrency
- Higher-level isolation
- Release memory-level isolation
- Programmer enforce isolation at higher level
(e.g., locks) - Use commit action to release isolation at parent
commit - Higher-level abort
- Childs memory updates not undone if parent
aborts - Use compensating action to undo the childs
forward action at a higher-level of abstraction - E.g., malloc() compensated by free()
125Commit and Compensating Actions
- Commit Actions
- Execute when innermost open ancestor commits
- Outermost transaction is considered open
- Use to release isolation at higher-level of
abstraction - Compensating Actions
- Discard when innermost open ancestor commits
- Execute in LIFO order when ancestor aborts
- Execute in the state that held when its forward
action commited Moss, TRANSACT 06
126Open Nested Example
- insert(int key, int value)
- open_begin
- leaf find_leaf(key)
- entry insert_into_leaf(key, value)
- // lock entry to isolate node
- entry-gtlock 1
- open_commit(abort_action(delete(key)),
- commit_action(unlock(key)))
-
- insert_set(set S)
- open_begin
- while ((key,value) next(S))
- insert(key, value)
- open_commit(abort_action(delete_set(S)))
-
? Isolate entry at higher-level of abstraction
? Delete entry if ancestor aborts
? Release high-level isolation on ancestor commit
? Replace compensating
action with higher-level action on commit
127Condition O1
Condition O1 An open nested child transaction
never modifies a memory location that has been
modified by any ancestor.
- If condition O1 holds programmers need not reason
about the interaction between compensation and
undo - All implementations of nesting (so far) agree on
semantics when O1 holds
128Timing of Compensating Actions
- LogTM behaves correctly
- Compensating action sees the state of the counter
when the open transaction committed (2) - Decrement restores the value to what it was
before the open nest executed (1) - Undo of the parent restores the value back to (0)
- TCC doesnt
- Counter ends up at 1
// initialize to 0 counter 0
transaction_begin() // top-level 1
counter // counter gets 1 open_begin()
// level 2 counter // counter gets 2
open_commit(abort_action(counter--))
... // Abort and run compensating action //
Expect counter to be restored to
0 ... transaction_commit() // not executed
Condition O1 No writes to blocks written by an
ancestor transaction.
129Open Nesting in LogTM
- Conflict Detection
- R/W bits cleared on open commit
- (no flash or)
- Version Management
- Open commit pops the most recent frame off the
log - (Optionally) add commit and compensating action
records - Compensating actions are run by the software
abort handler - Software handler interleaves restoration of
memory state and compensating action execution
130Open Nested Commit
- Discard childs log frame
- (Optionally) append commit and compensating
actions to log
Header
LogFrame
Undo record
LogPtr
Undo record
TM count
Header
Commit Action
2
1
Undo record
Comp Action
Undo record
131Paging Support
- Why?
- Support Large Transactions.
- What?
- Physical Relocation of Virtual Pages
- How?
- Update Signatures on paging activity
132Updating Signatures
Suppose Virtual Page (VP) 0x40000 -gt Physical
Frame(PP) 0x1000
0x1040,0x1080, 0x30c0
Signature A
At Page Out Remember 0x40000-gt0x1000
At Page In Suppose 0x40000-gt0x2000
Signature A
0x1040,0x1080, 0x2040, 0x2080,0x30c0
133Paging Support Summary
- Problem
- Changing page frames
- Need to maintain isolation on transactional
blocks - Solution
- On Page-Out
- Save Virtual -gt Physical mapping
- On Page-In
- If different page frame, update signatures with
physical address of transactional blocks in new
page frame. -
134The State of the World
- Chip-multiprocessors/Multi-core/Many-core are
here - Intel has 10 projects in the works that contain
four or more computing cores per chip -- Paul
Otellini, Intel CEO, Fall 05 - GHz race is over
- Frequency increase limited by heat and power
constraints - Size of processor limited by communication delay,
not transistors - Increasing wire delay on chip
- All high-performance processors will be CMP
- Software must become parallel
(in computer architecture)
135Parallel Programming is Hard!
- Data races cause subtle bugs
- Locks are a mess
- Deadlock
- Granularity problem
- Not composable
- Lock-free solutions still challenging
- We need a better way to write parallel software
136Solution Let the hardware help
- Provide a better interface for parallel software
- Plenty of transistors
- Access to run-time information
- Transactional Memory
- Intuitive interface -- serial execution
- High performance -- run transactions in parallel
when possible - Current cache coherence schemes already do much
of the work
137LogTM Log-Based Transactional Memory
- Combined Hardware/Software Implementation
- Conflicts detected in hardware
- Aborts processed in software
- Policy-Free Hardware
- Simple hardware primitives
- Software-accessible state
- Supports Transactions with
- Large memory footprints
- Thread switching
- Unbounded nesting
- Paging
138Transactional Memory
- Promising programming technique
- begin_transaction atomic execution
end_transaction - Good first step
- Likely benefits
- Can be integrated into current hardware and
programming languages - Will not save the world
139Nested Transactions for Software Composition
- Modules expose interfaces, NOT implementations
- Example
- Insert() calls getID() from within a transaction
- The getID() transaction is nested inside the
insert() transaction
- int getID()
- // child TX
- begin_transaction() id global_id
- commit_transaction()
- return id
void insert(object o) // parent TX
begin_transaction() t.insert(getID(), o)
commit_transaction()
140Closed Nesting
Child transactions remain isolated until parent
commits
- On Commit child transaction is merged with its
parent - Flat
- Nested transactions flattened into a single
transaction - Only outermost begins/commits are meaningful
- Any conflict aborts to outermost transaction
- Partial rollback
- Child transaction can be aborted independently
- Can avoid costly re-execution of parent
transaction - But child merges transaction state with parent on
commit - So most conflicts with child end up affecting the
parent
141Thesis We need new hardware and software
- Architects should devote resources to support
parallelism - Manycore will succeed only if we find a way to
program it (only if software is parallel) - Using resources to facilitate parallelism is less
risky - Hardware Primitives Software Solutions
- HW Implements difficult functions
- Coordinated by SW
142Segmented Transaction Log for Nesting
- LogTMs log is a stack of frames
- A frame contains
- Header (including saved registers and pointer to
parents frame) - Undo records (block address, old value pairs)
- Garbage headers (headers of committed closed
transactions) - Commit action records
- Compensating action records
Header
LogFrame
Undo record
LogPtr
Undo record
TM count
Header
0
2
1
Undo record
Undo record
143Closed Nested Commit
- Merge childs log frame with parents
- Mark childs header as dummy header
- Copy pointer from childs header to LogFrame
Header
LogFrame
Undo record
LogPtr
Undo record
TM count
Header
2
1
Undo record
Undo record
144LogTM-SE Signatures
- Conflict-detection signatures
- Summarize read and write sets
- Similar to Bulk ISCA 2006
- Aliasing is a performance issue
- Results in false conflicts
- Rare for current apps
- Version-management signatures
- Prevent redundant entries in the log
- Aliasing is a functional issue
- Results in incorrect abort
- Use small full-address filter
- Some redundant log entries
145LogTM-SE Unbounded Nesting Support
- Why?
- Composability libraries
- Software Constructs Retry, OrElse Harris,
PPoPP 05 - What?
- Signatures for each nesting level
- How?
- One R / W signature set per SMT thread
- Save / Restore signatures using Transaction Log
146Nested Begin
Transaction Log
Program
Processor State
xbegin LD ST xbegin
01001000
01001000
00000000
R
01010010
01010010
Xact header
00000000
W
Undo entry
Undo entry
1
TMCount
Undo entry
Log Frame
Xact header
Log Ptr
147Nested Begin
Transaction Log
Program
Processor State
xbegin LD ST xbegin
01001000
R
01010010
Xact header
W
Undo entry
Undo entry
2
TMCount
Undo entry
Log Frame
Xact header
01001000
01010010
Log Ptr
148Partial Abort
Transaction Log
Program
Processor State
xbegin LD ST xbegin LD ST
ABORT!
01001001
01001000
R
01010010
01110110
Xact header
W
Undo entry
Undo entry
2
1
TMCount
Undo entry
Log Frame
Xact header
01001000
01010010
Log Ptr
Undo entry
Undo entry
149Nested Commit
Transaction Log
Program
Processor State
xbegin LD ST xbegin LD ST
xend
01001001
01001000
R
Xact header
01110110
01010010
W
Undo entry
Undo entry
2
1
TMCount
Undo entry
Log Frame
Xact header
01001000
01010010
Log Ptr
Undo entry
Undo entry
150Unbounded Nesting Support Summary
- Closed nesting
- Begin save signatures
- Abort restore signatures
- Commit No signature action
- Open nesting
- Begin save signatures
- Abort restore signatures
- Commit restore signatures
151Terminology
- Transaction A transformation of state that is
- Atomic (all or nothing),
- Consistent,
- Isolated (serializable) and
- Durable (permanent)
Commit Successful completion of a transaction
Abort Unsuccessful termination of a transaction,
requiring that all updates from the transaction
are undone
ConflictTwo transactions conflict if both access
the same object and at least one of the accesses
is an update