Hardware Transactional Memory for GPU Architectures* - PowerPoint PPT Presentation

About This Presentation
Title:

Hardware Transactional Memory for GPU Architectures*

Description:

Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia – PowerPoint PPT presentation

Number of Views:303
Avg rating:3.0/5.0
Slides: 55
Provided by: Aamod
Category:

less

Transcript and Presenter's Notes

Title: Hardware Transactional Memory for GPU Architectures*


1
Hardware Transactional Memory for GPU
Architectures
  • Wilson W. L. Fung
  • Inderpeet Singh
  • Andrew Brownsword
  • Tor M. Aamodt
  • University of British Columbia
  • In Proc. 2011 ACM/IEEE Intl Symp.
    Microarchitecture (MICRO-44)

2
Motivation
  • Lifetime of GPU Application Development

?
Wilson Fung, Inderpeet Singh, Andrew Brownsword,
Tor Aamodt
3
Talk Outline
  • What we mean by GPU in this work.
  • Data Synchronization on GPUs.
  • What is Transactional Memory (TM)?
  • TM is compatible with OpenCL.
  • but is TM compatible with GPU hardware?
  • KILO TM A Hardware TM for GPUs.
  • Results

4
What is a GPU (in this work)?
  • GPU is NVIDIA/AMD-like, Compute Accelerator
  • SIMD HW Aggressive Memory Subsystem gt High
    Compute Throughput and Efficiency
  • Non-Graphics API OpenCL, DirectCompute, CUDA
  • Programming Model Hierarchy of scalar threads
  • Today Limited Communication Synchronization

Kernel
Blocks
Blocks
Work Group / Thread Blocks
Global Memory
Shared (Local) Memory
Barrier
5
Baseline GPU Architecture
Memory Partition
Memory Partition
Memory Partition
Atomic Op. Unit
Interconnection Network
Last-Level Cache Bank
Off-Chip DRAM Channel
6
Stack-Based SIMD Reconvergence (SIMT)
(Levinthal SIGGRAPH84, Fung MICRO07)
A/1111
B/1111
C/1001
D/0110
E/1111
G/1111
17
7
Data Synchronizations on GPUs
  • Motivation
  • Solve wider range of problems on GPU
  • Data Race ? Data Synchronization
  • Current Solution Atomic read-modify-write
    (32-bit/64-bit).
  • Best Soln?
  • Why Transactional Memory?
  • E.g. N-Body with 5M bodies (traditional sync, not
    TM)CUDA SDK O(n2) 1640 s (barrier)Barnes
    Hut O(nLogn) 5.2 s (atomics, harder to get
    right)
  • Easier to Write/Debug Efficient Algorithms
  • Practical efficiency. Want efficiency of GPU
    with reasonable (not superhuman) effort and time.

8
Data Synchronizations on GPUs
  • Deadlock-free code with fine-grained locks and
    10,000 hardware scheduled threads is hard
  • Other general problems with lock based
    synchronization
  • Implicit relationship between locks and objects
    being protected
  • Code is not composable

9
Data Synchronization Problems Specific to GPUs
  • Interaction between locks and SIMT control flow
    can cause deadlocks

A done 0 B while(!done) C
if(atomicCAS(lock,0,1)1) D // Critical
Section E lock 0 F done 1 G H
10
Transactional Memory
  • Program specifies atomic code blocks called
    transactions Herlihy93

TM Version atomic Xc XaXb
Lock Version Lock(Xa) Lock(Xb) Lock(Xc)
Xc XaXb Unlock(Xc) Unlock(Xb) Un
lock(Xa)
11
Transactional Memory
Programmers View
TX1
TX2
Time
Time
OR
TX2
TX1
12
Transactional Memory
  • Each transaction has 3 phases
  • Execution
  • Track all memory accesses (Read-Set and
    Write-Set)
  • Validation
  • Detect any conflicting accesses between
    transactions
  • Resolve conflict if needed (abort/stall)
  • Commit
  • Update global memory

13
Transactional Memory on OpenCL
  • A natural extension to OpenCL Programming Model
  • Program can launch many more threads than the
    hardware can execute concurrently
  • GPU-TM? Current threads running transactions do
    not need to wait for future unscheduled threads

GPU HW
14
Are TM and GPUs Incompatible?
  • The problem with GPUs (from TM perspective)
  • 1000s of concurrent threads
  • Inter-thread spatial locality common
  • No cache coherence
  • No private cache for each thread (Buffering?)
  • Tx Abort ? Control flow divergence

15
Hardware TM for GPUs ChallengeConflict Detection
Private Data Cache
Signature
TX1
Scalable Coherence
No coherence on GPUs? Each scalar thread needs
own cache?
TX2
TX3
TX4
16
Hardware TM for GPUs ChallengeTransaction
Rollback
2MB Total On-Chip Storage
17
Hardware TM for GPUs ChallengeAccess
Granularity and Write Buffer
GPU Core (SM)
L1 Data Cache
CPU Core
L1 Data Cache
TX
Problem 384 lines / 1536 threads lt 1 line per
thread!
18
Hardware TM on GPUs ChallengeSIMT Hardware
  • On GPUs, scalar threads in a warp/wavefront
    execute in lockstep

A Warp with 8 Scalar Threads
... TxBegin LD r2,B ADD r2,r2,2 ST
r2,A TxCommit ...
Reconvergence?
19
Goal
  • We take it as a given that most programmers
    trying lock based programming on a GPU will give
    up before they manage to get their application
    working.
  • Hence, our goal was to find the most efficient
    approach to implement TM on GPU.

20
KILO TM
  • Supports 1000s of concurrent transactions
  • Transaction-aware SIMT stack
  • No cache coherence protocol dependency
  • Word-level conflict detection
  • Captures 59 of FG Lock Performance
  • 128X Faster than Serialized Tx Exec.

21
KILO TM Design Highlights
  • Value-Based Conflict Detection
  • Self-Validation Abort Simple Communication
  • No Cache Coherence Dependence
  • Speculative Validation
  • Increase Commit Parallelism

22
High Level GPU Architecture KILO TM
Implementation Overview
23
KILO TM SIMT Core Changes
  • SW Register Checkpoint
  • Observation Most overwritten registers not used
  • Compiler analysis can identify what to checkpoint
  • Transaction Abort
  • Do-While Loop
  • Extend SIMT Stack with special entries to
    trackaborted transactionsin each warp

TxBegin LD r2,B ADD r2,r2,2 ST r2,A TxCommit
24
Transaction-Aware SIMT Stack
25
KILO TM Value-Based Conflict Detection
Global Memory
A1
A1
TX1 atomicBA1
Private Memory
B0
B2
TxBegin LD r1,A ADD r1,r1,1 ST r1,B TxCommit
A1
B2
B2
TxBegin LD r2,B ADD r2,r2,2 ST r2,A TxCommit
B0
  • Self-Validation Abort
  • Only detects existence of conflict (not identity)
  • gt No Tx to Tx Msg Simple Communication

A2
26
Parallel Validation?
Data Race!?!
Global Memory
A1
A1
TX1 atomicBA1
Private Memory
B0
B0
OR
A1
B2
B2
B0
A2
A2
27
Serialize Validation?
TX1
TX2
Time
V C
Stall
  • Benefit 1 No Data Race
  • Benefit 2 No Live Lock (generic lazy TM prob.)
  • Drawback Serializes Non-Conflicting Transactions
  • (collateral damage)

28
Identifying Non-conflicting Tx Step 1 Leverage
Parallelism
Global Memory Partition
Commit Unit
Global Memory Partition
TX1
Commit Unit
TX2
Global Memory Partition
Commit Unit
29
Solution Speculative Validation
  • Key Idea Split Validation into two parts
  • Part 1 Check recently committed transactions
  • Part 2 Check concurrently committing transactions

30
KILO TM Speculative Validation
  • Memory subsystem is deeply pipelined and highly
    parallel

Read-Log Write-Log
Commit Unit
TX1
TX3
Validation Queue
Global Memory Partition
TX2
R(C),W(D)
Log Transfer Spec. Validation
TX1
TX2
Hazard Detection
C
R(A),W(B)
Validation Wait
A
D
TX3
Finalize Outcome
R(D),W(E)
Commit
31
KILO TM Speculative Validation
TX1
TX2
TX3
R(C),W(D)
R(A),W(B)
R(D),W(E)
Commit Unit
Validation Queue
Global Memory Partition
TX3
Log Transfer Spec. Validation
TX2
Hazard Detection
TX1
C
Validation Wait
A
D
STALL
Finalize Outcome
Commit
32
Log Storage
  • Transaction logs are stored at the private memory
    of each thread
  • Located in DRAM, cached in L1 and L2 caches

Wavefront
Read-Log Ptr
Write-Log Ptr
33
Log Transfer
  • Entries heading to same memory partition can be
    grouped into a larger packet

Read-Log Ptr
Write-Log Ptr
34
Distributed Commit / HW Org.
35
ABA Problem?
  • Classic Example Linked List Based Stack
  • Thread 0 pop()

while (true) t top Next
t-gtNext // thread 2 pop A, pop B, push A
if (atomicCAS(top, t, next) t) break //
succeeds!
36
ABA Problem?
  • atomicCAS protects only a single word
  • Only part of the data structure
  • Value-based conflict detection protects all
    relevant parts of the data structure

while (true) t top Next t-gtNext
if (atomicCAS(top, t, next) t) break //
succeeds!
37
Evaluation Methodology
  • GPGPU-Sim 3.0 (BSD license)
  • Detailed IPC Correlation of 0.93 vs GT 200
  • KILO TM (Timing-Driven Memory Accesses)
  • GPU TM Applications
  • Hash Table (HT-H, HT-L)
  • Bank Account (ATM)
  • Cloth Physics (CL)
  • Barnes Hut (BH)
  • CudaCuts (CC)
  • Data Mining (AP)

38
  • GPGPU-Sim 3.0.x running SASS (decuda)

0.976 correlation on subset of CUDA SDK that
decuda correctly Disassembles Note Rest of
data uses PTX instead of SASS (0.93 correlation)
(We believe GPGPU-Sim is reasonable proxy.)
39
Performance (vs. Serializing Tx)
40
Absolute Performance (IPC)
IPC
  • TM on GPU performs well for applications with low
    contention.
  • Poorly Memory divergence, low parallelism, high
    conflict rate
  • (tackle through alg. design/tuning?)
  • CPU vs GPU?
  • CC FG-Lock version 400X faster than its CPU
    version
  • BH FG-Lock version 2.5X faster than its CPU
    version

41
Performance (Exec. Time)
Captures 59 of FG Lock Performance 128X Faster
than Serialized Tx Exec.
42
KILO TM Scaling
43
Abort Commit Ratio
Increasing number of TXs gt increase probability
of conflict Two possible solutions (future
work) Solution 1 Application performance
tuning (easier with TM vs. FG Lock) Solution 2
Transaction schedule
44
Thread Cycle Breakdown
  • Status of a thread at each cycle
  • Categories
  • TC In a warp stalled by concurrency control
  • TO In a warp committing its transactions
  • TW Have passed commit, and waiting for other
    threads in the warp to pass
  • TA Executing an eventually aborted transaction
  • TU Executing an eventually committed transaction
    (Useful work)
  • AT Acquiring a lock or doing an Atomic Operation
  • BA Waiting at a Barrier
  • NL Doing non-transactional (Normal) work

45
Thread Cycle Breakdown
KL
KL
KL
FGL
FGL
FGL
KL-UC
KL-UC
IDEAL
IDEAL
KL-UC
IDEAL
KL-UC
IDEAL
HT-H HT-L ATM CL
BH CC AP
46
Core Cycle Breakdown
  • Action performed by a core at each cycle
  • Categories
  • EXEC Issuing a warp for execution
  • STALL Stalled by a downstream warp
  • SCRB All warps blocked by the scoreboard, due to
    data hazards, concurrency control, pending
    commits (or any combination thereof)
  • IDLE None of the warps are ready in the
    instruction buffer.

47
Core Cycle Breakdown
KL
KL
FGL
FGL
FGL
KL-UC
KL-UC
IDEAL
IDEAL
KL-UC
IDEAL
KL-UC
IDEAL
48
Read-Write Buffer Usage
49
In-Flight Buffers
50
Implementation Complexity
  • Logs in Private Memory _at_ L1 Data Cache
  • Commit Unit
  • 5kB Last Writer History Unit
  • 19kB Transaction Status
  • 32kB Read-Set and Write-Set Buffer
  • CACTI 5.3 _at_ 40nm
  • 0.40mm2 x 6 Memory Partition
  • 0.5 of 520mm2

51
Summary
  • KILO TM
  • 1000s of Concurrent Transactions
  • Value-Based Conflict Detection
  • Speculative Validation for Commit Parallelism
  • 59 Fine-Grained Locking Performance
  • 0.5 Area Overhead

52
Backup Slides
53
Logical Stage Organization
54
Execution Time Breakdown
Write a Comment
User Comments (0)
About PowerShow.com