Title: Hardware Transactional Memory for GPU Architectures*
1Hardware Transactional Memory for GPU
Architectures
- Wilson W. L. Fung
- Inderpeet Singh
- Andrew Brownsword
- Tor M. Aamodt
- University of British Columbia
- In Proc. 2011 ACM/IEEE Intl Symp.
Microarchitecture (MICRO-44)
2Motivation
- Lifetime of GPU Application Development
?
Wilson Fung, Inderpeet Singh, Andrew Brownsword,
Tor Aamodt
3Talk Outline
- What we mean by GPU in this work.
- Data Synchronization on GPUs.
- What is Transactional Memory (TM)?
- TM is compatible with OpenCL.
- but is TM compatible with GPU hardware?
- KILO TM A Hardware TM for GPUs.
- Results
4What is a GPU (in this work)?
- GPU is NVIDIA/AMD-like, Compute Accelerator
- SIMD HW Aggressive Memory Subsystem gt High
Compute Throughput and Efficiency - Non-Graphics API OpenCL, DirectCompute, CUDA
- Programming Model Hierarchy of scalar threads
- Today Limited Communication Synchronization
Kernel
Blocks
Blocks
Work Group / Thread Blocks
Global Memory
Shared (Local) Memory
Barrier
5Baseline GPU Architecture
Memory Partition
Memory Partition
Memory Partition
Atomic Op. Unit
Interconnection Network
Last-Level Cache Bank
Off-Chip DRAM Channel
6Stack-Based SIMD Reconvergence (SIMT)
(Levinthal SIGGRAPH84, Fung MICRO07)
A/1111
B/1111
C/1001
D/0110
E/1111
G/1111
17
7Data Synchronizations on GPUs
- Motivation
- Solve wider range of problems on GPU
- Data Race ? Data Synchronization
- Current Solution Atomic read-modify-write
(32-bit/64-bit). - Best Soln?
- Why Transactional Memory?
- E.g. N-Body with 5M bodies (traditional sync, not
TM)CUDA SDK O(n2) 1640 s (barrier)Barnes
Hut O(nLogn) 5.2 s (atomics, harder to get
right) - Easier to Write/Debug Efficient Algorithms
- Practical efficiency. Want efficiency of GPU
with reasonable (not superhuman) effort and time.
8Data Synchronizations on GPUs
- Deadlock-free code with fine-grained locks and
10,000 hardware scheduled threads is hard
- Other general problems with lock based
synchronization - Implicit relationship between locks and objects
being protected - Code is not composable
9Data Synchronization Problems Specific to GPUs
- Interaction between locks and SIMT control flow
can cause deadlocks
A done 0 B while(!done) C
if(atomicCAS(lock,0,1)1) D // Critical
Section E lock 0 F done 1 G H
10Transactional Memory
- Program specifies atomic code blocks called
transactions Herlihy93
TM Version atomic Xc XaXb
Lock Version Lock(Xa) Lock(Xb) Lock(Xc)
Xc XaXb Unlock(Xc) Unlock(Xb) Un
lock(Xa)
11Transactional Memory
Programmers View
TX1
TX2
Time
Time
OR
TX2
TX1
12Transactional Memory
- Each transaction has 3 phases
- Execution
- Track all memory accesses (Read-Set and
Write-Set) - Validation
- Detect any conflicting accesses between
transactions - Resolve conflict if needed (abort/stall)
- Commit
- Update global memory
13Transactional Memory on OpenCL
- A natural extension to OpenCL Programming Model
- Program can launch many more threads than the
hardware can execute concurrently - GPU-TM? Current threads running transactions do
not need to wait for future unscheduled threads
GPU HW
14Are TM and GPUs Incompatible?
- The problem with GPUs (from TM perspective)
- 1000s of concurrent threads
- Inter-thread spatial locality common
- No cache coherence
- No private cache for each thread (Buffering?)
- Tx Abort ? Control flow divergence
15Hardware TM for GPUs ChallengeConflict Detection
Private Data Cache
Signature
TX1
Scalable Coherence
No coherence on GPUs? Each scalar thread needs
own cache?
TX2
TX3
TX4
16Hardware TM for GPUs ChallengeTransaction
Rollback
2MB Total On-Chip Storage
17Hardware TM for GPUs ChallengeAccess
Granularity and Write Buffer
GPU Core (SM)
L1 Data Cache
CPU Core
L1 Data Cache
TX
Problem 384 lines / 1536 threads lt 1 line per
thread!
18Hardware TM on GPUs ChallengeSIMT Hardware
- On GPUs, scalar threads in a warp/wavefront
execute in lockstep
A Warp with 8 Scalar Threads
... TxBegin LD r2,B ADD r2,r2,2 ST
r2,A TxCommit ...
Reconvergence?
19Goal
- We take it as a given that most programmers
trying lock based programming on a GPU will give
up before they manage to get their application
working. - Hence, our goal was to find the most efficient
approach to implement TM on GPU. -
20KILO TM
- Supports 1000s of concurrent transactions
- Transaction-aware SIMT stack
- No cache coherence protocol dependency
- Word-level conflict detection
- Captures 59 of FG Lock Performance
- 128X Faster than Serialized Tx Exec.
21KILO TM Design Highlights
- Value-Based Conflict Detection
- Self-Validation Abort Simple Communication
- No Cache Coherence Dependence
- Speculative Validation
- Increase Commit Parallelism
22High Level GPU Architecture KILO TM
Implementation Overview
23KILO TM SIMT Core Changes
- SW Register Checkpoint
- Observation Most overwritten registers not used
- Compiler analysis can identify what to checkpoint
- Transaction Abort
- Do-While Loop
- Extend SIMT Stack with special entries to
trackaborted transactionsin each warp
TxBegin LD r2,B ADD r2,r2,2 ST r2,A TxCommit
24Transaction-Aware SIMT Stack
25KILO TM Value-Based Conflict Detection
Global Memory
A1
A1
TX1 atomicBA1
Private Memory
B0
B2
TxBegin LD r1,A ADD r1,r1,1 ST r1,B TxCommit
A1
B2
B2
TxBegin LD r2,B ADD r2,r2,2 ST r2,A TxCommit
B0
- Self-Validation Abort
- Only detects existence of conflict (not identity)
- gt No Tx to Tx Msg Simple Communication
A2
26Parallel Validation?
Data Race!?!
Global Memory
A1
A1
TX1 atomicBA1
Private Memory
B0
B0
OR
A1
B2
B2
B0
A2
A2
27Serialize Validation?
TX1
TX2
Time
V C
Stall
- Benefit 1 No Data Race
- Benefit 2 No Live Lock (generic lazy TM prob.)
- Drawback Serializes Non-Conflicting Transactions
- (collateral damage)
28Identifying Non-conflicting Tx Step 1 Leverage
Parallelism
Global Memory Partition
Commit Unit
Global Memory Partition
TX1
Commit Unit
TX2
Global Memory Partition
Commit Unit
29Solution Speculative Validation
- Key Idea Split Validation into two parts
- Part 1 Check recently committed transactions
- Part 2 Check concurrently committing transactions
30KILO TM Speculative Validation
- Memory subsystem is deeply pipelined and highly
parallel
Read-Log Write-Log
Commit Unit
TX1
TX3
Validation Queue
Global Memory Partition
TX2
R(C),W(D)
Log Transfer Spec. Validation
TX1
TX2
Hazard Detection
C
R(A),W(B)
Validation Wait
A
D
TX3
Finalize Outcome
R(D),W(E)
Commit
31KILO TM Speculative Validation
TX1
TX2
TX3
R(C),W(D)
R(A),W(B)
R(D),W(E)
Commit Unit
Validation Queue
Global Memory Partition
TX3
Log Transfer Spec. Validation
TX2
Hazard Detection
TX1
C
Validation Wait
A
D
STALL
Finalize Outcome
Commit
32Log Storage
- Transaction logs are stored at the private memory
of each thread - Located in DRAM, cached in L1 and L2 caches
Wavefront
Read-Log Ptr
Write-Log Ptr
33Log Transfer
- Entries heading to same memory partition can be
grouped into a larger packet
Read-Log Ptr
Write-Log Ptr
34Distributed Commit / HW Org.
35ABA Problem?
- Classic Example Linked List Based Stack
- Thread 0 pop()
while (true) t top Next
t-gtNext // thread 2 pop A, pop B, push A
if (atomicCAS(top, t, next) t) break //
succeeds!
36ABA Problem?
- atomicCAS protects only a single word
- Only part of the data structure
- Value-based conflict detection protects all
relevant parts of the data structure
while (true) t top Next t-gtNext
if (atomicCAS(top, t, next) t) break //
succeeds!
37Evaluation Methodology
- GPGPU-Sim 3.0 (BSD license)
- Detailed IPC Correlation of 0.93 vs GT 200
- KILO TM (Timing-Driven Memory Accesses)
- GPU TM Applications
- Hash Table (HT-H, HT-L)
- Bank Account (ATM)
- Cloth Physics (CL)
- Barnes Hut (BH)
- CudaCuts (CC)
- Data Mining (AP)
38- GPGPU-Sim 3.0.x running SASS (decuda)
0.976 correlation on subset of CUDA SDK that
decuda correctly Disassembles Note Rest of
data uses PTX instead of SASS (0.93 correlation)
(We believe GPGPU-Sim is reasonable proxy.)
39Performance (vs. Serializing Tx)
40Absolute Performance (IPC)
IPC
- TM on GPU performs well for applications with low
contention. - Poorly Memory divergence, low parallelism, high
conflict rate - (tackle through alg. design/tuning?)
- CPU vs GPU?
- CC FG-Lock version 400X faster than its CPU
version - BH FG-Lock version 2.5X faster than its CPU
version
41Performance (Exec. Time)
Captures 59 of FG Lock Performance 128X Faster
than Serialized Tx Exec.
42KILO TM Scaling
43Abort Commit Ratio
Increasing number of TXs gt increase probability
of conflict Two possible solutions (future
work) Solution 1 Application performance
tuning (easier with TM vs. FG Lock) Solution 2
Transaction schedule
44Thread Cycle Breakdown
- Status of a thread at each cycle
- Categories
- TC In a warp stalled by concurrency control
- TO In a warp committing its transactions
- TW Have passed commit, and waiting for other
threads in the warp to pass - TA Executing an eventually aborted transaction
- TU Executing an eventually committed transaction
(Useful work) - AT Acquiring a lock or doing an Atomic Operation
- BA Waiting at a Barrier
- NL Doing non-transactional (Normal) work
45Thread Cycle Breakdown
KL
KL
KL
FGL
FGL
FGL
KL-UC
KL-UC
IDEAL
IDEAL
KL-UC
IDEAL
KL-UC
IDEAL
HT-H HT-L ATM CL
BH CC AP
46Core Cycle Breakdown
- Action performed by a core at each cycle
- Categories
- EXEC Issuing a warp for execution
- STALL Stalled by a downstream warp
- SCRB All warps blocked by the scoreboard, due to
data hazards, concurrency control, pending
commits (or any combination thereof) - IDLE None of the warps are ready in the
instruction buffer.
47Core Cycle Breakdown
KL
KL
FGL
FGL
FGL
KL-UC
KL-UC
IDEAL
IDEAL
KL-UC
IDEAL
KL-UC
IDEAL
48Read-Write Buffer Usage
49 In-Flight Buffers
50Implementation Complexity
- Logs in Private Memory _at_ L1 Data Cache
- Commit Unit
- 5kB Last Writer History Unit
- 19kB Transaction Status
- 32kB Read-Set and Write-Set Buffer
- CACTI 5.3 _at_ 40nm
- 0.40mm2 x 6 Memory Partition
- 0.5 of 520mm2
51Summary
- KILO TM
- 1000s of Concurrent Transactions
- Value-Based Conflict Detection
- Speculative Validation for Commit Parallelism
- 59 Fine-Grained Locking Performance
- 0.5 Area Overhead
52Backup Slides
53Logical Stage Organization
54Execution Time Breakdown