ME964%20High%20Performance%20Computing%20for%20Engineering%20Applications - PowerPoint PPT Presentation

About This Presentation
Title:

ME964%20High%20Performance%20Computing%20for%20Engineering%20Applications

Description:

(fetching floats from. global memory, leads to one. memory transaction) 3. Thread 15. Thread 14 ... identical address for a fetch operation, there is no bank ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 25
Provided by: sbel3
Learn more at: http://sbel.wisc.edu
Category:

less

Transcript and Presenter's Notes

Title: ME964%20High%20Performance%20Computing%20for%20Engineering%20Applications


1
ME964High Performance Computing for Engineering
Applications
  • CUDA Memory Spaces and Access Overhead
  • Gauging Kernel Performance
  • Oct. 2, 2008

2
Before we get started
  • Last Time
  • Details on the CUDA memory spaces and access
    related overhead
  • Relevant for getting good performance out of your
    GPU application
  • Covered registers, constant memory, global memory
  • Today
  • Wrap up discussion on CUDA memory spaces
  • Discuss the shared memory
  • Gauging the extent to which you use HW resources
    in CUDA
  • HW5, matrix convolution, to be posted on the
    class website. It also requires some reading of
    the parallel programming patterns book.
  • NOTE Next Tu, Michael Garland, Senior Researcher
    at NVIDIA is going to be our guest lecturer.

2
3


Address 120
Address 120
Address 124
Address 124
Thread 0
Address 128
Thread 0
Address 128
Thread 1
Address 132
Thread 1
Address 132
Thread 2
Address 136
Thread 2
Address 136
Thread 3
Address 140
Thread 3
Address 140
Thread 4
Address 144
Thread 4
Address 144
Thread 5
Address 148
Thread 5
Address 148
Thread 6
Address 152
Thread 6
Address 152
Thread 7
Address 156
Thread 7
Address 156
64B segment
64B segment
Thread 8
Address 160
Thread 8
Address 160
Thread 9
Address 164
Thread 9
Address 164
Thread 10
Address 168
Thread 10
Address 168
Thread 11
Address 172
Thread 11
Address 172
Thread 12
Address 176
Thread 12
Address 176
Thread 13
Address 180
Thread 13
Address 180
Thread 14
Address 184
Thread 14
Address 184
Thread 15
Address 188
Thread 15
Address 188
Address 192
Address 192
Address 196
Address 196
Address 200
Address 200
Examples of Coalesced Memory Access
Patterns (fetching floats from global memory,
leads to one memory transaction)
Address 204
Address 204
Address 208
Address 208
Address 212
Address 212
Address 214
Address 214
Address 218
Address 218
Address 222
Address 222
3
4



Address 120
Address 120
Address 96
Address 124
Address 124
Address 100
Thread 0
Address 128
Thread 0
Address 128
Address 104
Thread 1
Address 132
Thread 1
Address 132
Address 108
32B segment
Thread 2
Thread 2
Address 136
Address 136
Address 112
Thread 3
Address 140
Thread 3
Address 140
Address 116
Thread 4
Thread 4
Address 144
Address 144
Address 120
Thread 5
Address 148
Thread 5
Address 148
Address 124
Thread 0
Thread 6
Address 152
Thread 6
Address 152
Address 128
Thread 1
Thread 7
Address 156
Thread 7
Address 156
Address 132
64B segment
Thread 2
Thread 8
Address 160
Thread 8
Address 160
Address 136
Thread 3
Thread 9
Address 164
Thread 9
Address 164
Address 140
Thread 4
Thread 10
Address 168
Thread 10
Address 168
Address 144
Thread 5
Thread 11
Address 172
Thread 11
Address 172
Address 148
128B segment
Thread 6
Thread 12
Address 176
Thread 12
Address 176
Address 152
Thread 7
Thread 13
Address 180
Thread 13
Address 180
Address 156
64B segment
Thread 8
Thread 14
Address 184
Thread 14
Address 184
Address 160
Thread 9
Thread 15
Address 188
Thread 15
Address 188
Address 164
Thread 10
Address 192
Address 192
Address 168
Thread 11
Address 196
Address 196
Address 172
Thread 12
Address 200
Address 200
Address 176
NOTE All of theseare coalesced
memorytransactions in CUDA 2.0 (released in
summer 2008) result inone or two memory
transactions.
Thread 13
Address 204
Address 204
Address 180
Thread 14
Address 208
Address 184
Example of float non-coalesced memory access,
16memory transactionsin CUDA 1.1
...
Thread 15
Address 212
Address 188
Address 214
Address 192
Address 218
Address 252
Address 196
Address 222
Address 256
Address 200
4
5
Coalesced Global Memory Access(Concluding
Remarks)
  • Happens when half warp (16 threads) accesses
    contiguous region of device memory
  • 16 data elements loaded in one instruction
  • int, float 64 bytes (fastest)
  • int2, float2 128 bytes
  • int4, float4 256 bytes (2 transactions)
  • If un-coalesced, issues 16 sequential loads
  • CUDA 2.0 became more lax with these requirements,
    its simpler to get coalesced memory operations
  • NOTE when you have 2D (Dx, Dy) and 3D (Dx, Dy,
    Dz) blocks, count on this indexing scheme of your
    threads when considering memory coalescence
  • 2D thread ID in the block for thread of index
    (x,y) is x Dxy
  • 3D thread ID in the block for thread of index
    (x,y,z) is x Dx(y Dyz)
  • To conclude, the x thread id runs the fastest,
    followed by the y, and then by the z.

5
6
Exercise coalesced memory access
  • Suppose b is of type int and lives in the
    global memory space
  • Suppose a is of type int and is a register
    variable
  • Consider the two lines below, which are supposed
    to be each part of a kernel with a 1D grid
  • a bthreadIdx.x
  • a b2threadIdx.x
  • Are these loads leading to coalesced or
    non-coalesced memory transactions?

6
7
Standard Trick Load/Store (Memory read/write)
Clustering/Batching
  • Use LD to hide LD latency (non-dependent LD ops
    only)
  • Use same thread to help hide own latency
  • Instead of
  • LD 0 (long latency)
  • Dependent MATH 0
  • LD 1 (long latency)
  • Dependent MATH 1
  • Do
  • LD 0 (long latency)
  • LD 1 (long latency - hidden)
  • MATH 0
  • MATH 1
  • Compiler typically handles this on your behalf
  • But, you must have enough non-dependent LDs and
    Math
  • This is where loop unrolling comes into play and
    can have a significant impact

7
8
Shared Memory
  • Each SM has 16 KB of Shared Memory
  • Physically organized as 16 banks of 4 byte words
  • Note that shared memory can store less data than
    the registers (16 vs. 32 KB)
  • The 16 banks of the Shared Memory are organized
    like benches in a movie theater
  • You have 256 rows of benches. Each row has 16
    benches, in each bench you can seat a family of
    four (bytes). Note that a bank represents a
    column of benches in the movie theater
  • CUDA uses Shared Memory as shared storage visible
    to all threads in a thread block
  • All threads in the block have read write access

I

L
1
Multithreaded
Instruction Buffer
R
C

Shared
F
L
1
Mem
Operand Select
MAD
SFU
8
9
Q Is 16K of Shared Memory Enough?Revisit the
Matrix Multiplication Example
  • One block computes one square sub-matrix Csub of
    size Block_Size
  • One thread computes one element of Csub
  • Assume that the dimensions of A and B are
    multiples of Block_Size and square shape
  • Doesnt have to be like this, but keeps example
    simpler and focused on the concepts of interest

tx
B
Block_Size
wA
Block_Size
A
C
Csub
hA
Block_Size
ty

Block_Size
Block_Size
Block_Size
wB
wA
9
10
Matrix Multiplication Shared Memory Usage
  • Each Block requires 2 WIDTH2 4 bytes of shared
    memory storage
  • For WIDTH 16, each BLOCK requires 2KB, up to 8
    Blocks can fit into the Shared Memory of an SM
  • Since each SM can only take 768 threads, each SM
    can only take 3 Blocks of 256 threads each
  • Shared memory size is not a limitation for our
    implementation of the Matrix Multiplication

10
11
Shared Memory Architecture
  • Common sense observation in a parallel machine
    many threads access memory at the same time
  • To service more than one thread, memory is
    divided into banks
  • Essential to achieve high bandwidth
  • Each bank can service one address per cycle
  • A memory can service as many simultaneous
    accesses as it has banks
  • Multiple simultaneous accesses to a bankresult
    in a bank conflict
  • Conflicting accesses are serialized

11
12
Bank Addressing Examples
  • No Bank Conflicts
  • Linear addressing stride 1
  • No Bank Conflicts
  • Random 11 Permutation

12
13
Bank Addressing Examples
  • 2-way Bank Conflicts
  • Linear addressing stride 2
  • 8-way Bank Conflicts
  • Linear addressing stride 8

13
14
Shared Memory Bank Conflicts
  • Shared memory is as fast as registers if there
    are no bank conflicts
  • The fast case
  • If all threads of a half-warp access different
    banks, there is no bank conflict
  • If all threads of a half-warp access and
    identical address for a fetch operation, there is
    no bank conflict (broadcast)
  • The slow case
  • Bank Conflict multiple threads in the same
    half-warp access the same bank
  • Must serialize the accesses
  • Cost max of simultaneous accesses to a single
    bank

14
15
How addresses map to banks on G80
  • Each bank has a bandwidth of 32 bits per clock
    cycle
  • Successive 32-bit words are assigned to
    successive banks
  • G80 has 16 banks
  • Bank you work with address 16
  • Same as the number of threads in a half-warp
  • NOTE There is no such thing as bank conflicts
    between threads belonging to different
    half-warps this issue only relevant for threads
    from within a single half-warp

15
16
Linear Addressing
  • Given
  • __shared__ float sharedM256
  • float foo sharedMbaseIndex s
    threadIdx.x
  • This is bank-conflict-free if s shares no common
    factors with the number of banks
  • 16 on G80, so s must be odd

s1
s3
16
17
The Math Beyond Bank Conflicts
  • We are in a half-warp, and the question is if
    thread t1 and thread t2 gt t1 might access the
    same bank of shared memory
  • Let b be the base of the array (the shareM
    pointer on previous slide)
  • How should you not choose s?
  • If s2, take k1, and then any threads t1 and t2
    which are eight apart satisfy the condition above
    and will have a bank conflict (0,8, 1,9,
    etc.) two way conflict
  • If s4, take k2, any threads t1 and t2 which are
    four apart will have a bank conflict (0,4,8,12,
    1,5,9,13, etc.) four way conflict
  • NOTE you cant get a bank conflict is s is odd
    (no quartet k, s, t1, t2 satisfies the bank
    conflict condition above). So take stride
    s1,3,5, etc.

17
18
Data types and bank conflicts
  • This has no conflicts if type of shared is
    32-bits
  • foo sharedbaseIndex threadIdx.x
  • But not if the data type is smaller
  • 4-way bank conflicts
  • __shared__ char shared
  • foo sharedbaseIndex threadIdx.x
  • 2-way bank conflicts
  • __shared__ short shared
  • foo sharedbaseIndex threadIdx.x

18
19
Structs and Bank Conflicts
  • Struct assignments compile into as many memory
    accesses as there are struct members
  • struct vector float x, y, z
  • struct myType
  • float f
  • int c
  • __shared__ struct vector vectors64
  • __shared__ struct myType myTypes64
  • This has no bank conflicts for vector struct
    size is 3 words
  • 3 accesses per thread, contiguous banks (no
    common factor with 16)
  • struct vector v vectorsbaseIndex
    threadIdx.x
  • This has 2-way bank conflicts for my Type (2
    accesses per thread)
  • struct myType m myTypesbaseIndex
    threadIdx.x

Bank 0
Thread 0
Bank 1
Thread 1
Bank 2
Thread 2
Bank 3
Thread 3
Bank 4
Thread 4
Bank 5
Thread 5
Bank 6
Thread 6
Bank 7
Thread 7
Bank 15
Thread 15
19
20
Common Array Bank Conflict Patterns 1D
  • Each thread loads 2 elements into shared memory
  • 2-way-interleaved loads result in 2-way bank
    conflicts
  • int tid threadIdx.x
  • shared2tid global2tid
  • shared2tid1 global2tid1
  • This makes sense for traditional CPU threads,
    locality in cache line usage and reduced sharing
    traffic.
  • Not in shared memory usage where there is no
    cache line effects but banking effects

20
21
A Better Array Access Pattern
  • Each thread loads one element in every
    consecutive group of bockDim elements.
  • sharedtid globaltid
  • sharedtid blockDim.x globaltid
    blockDim.x

21
22
Vector Reduction with Bank Conflicts(assume 1024
vector entries)
Array elements (floats)
0
1
2
3
4
5
7
6
10
9
8
11
1
2
3
22
23
No Bank Conflicts
0
1
2
3

13
15
14
18
17
16
19
1
2
3
23
24
Common Bank Conflict Patterns (2D)
Bank Indices without Padding
  • Operating on 2D array of floats in shared memory
  • e.g. image processing
  • Example 16x16 block
  • Threads in a block access the elements in each
    column simultaneously (example bank 1 in purple)
  • 16-way bank conflicts
  • Solution 1) pad the rows
  • Add one float to the end of each row
  • Solution 2) transpose before processing
  • Suffer bank conflicts during transpose
  • But possibly save them later

Bank Indices with Padding
24
Write a Comment
User Comments (0)
About PowerShow.com