Title: ME964 High Performance Computing for Engineering Applications
1ME964High Performance Computing for Engineering
Applications
- Gauging Kernel Performance Control Flow in CUDA
- Oct. 9, 2008
2Before we get started
- Last Time
- Guest Lecturer Michael Garland, Researcher at
NVIDIA - Writing Efficient CUDA Algorithms
- Today
- Gauging the extent to which you use hardware
resources in CUDA - Control Flow in CUDA
- Homework related
- HW6 available for download (exclusive scan
operation) - HW5, 2D matrix convolution, due at 1159 PM today
2
3Exercise Does Matrix Multiplication Incur Shared
Memory Bank Conflicts?
Scenario A. The tile matrix is computed as
follows one half warp computes one row of the
tile at a time.
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
In scenario A, all threads in a half-warp access
the same shared memory entry leading to
broadcast. Below whats highlighted is the
second step of computing the 5th row of the tile.
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
In scenario A, all threads in a half-warp access
elements in neighboring banks as they walk
through the computation
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
Scenario B. The tile matrix is computed as
follows one half warp computes one column of the
tile at a time (do it yourself).
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
3
0
1
2
3
4
5
6
7
15
4Final Comments, Memory Access
- Given the GPU memory spaces and their latency, a
typical programming pattern emerges at the thread
level - Load data from device memory into shared memory
(coalesced if possible) - Synchronize with al the other threads of the
block to avoid data access hazards - Process the data that you just brought over in
shared memory - Synchronize as needed
- Write the results back to global memory
(coalesced if possible) - NOTE for CUDA computing, always try hard to make
your computation fit this model
4
5CUDA Programming Common Sense Advice
- Keep this in mind
- Allocating memory on device or host is expensive
- Moving data back and forth between the host and
device is a killer - Global memory accesses are going to be slow
- If they are not coalesced they are even slower
- Make sure that you keep the SM
- Occupied (currently, 24 warps can be managed
concurrently) - Busy (avoid data starvation, have it crunch
numbers) - If you can, avoid bank conflicts. Not that big
of a deal tough.
5
6Gauging the level of HW use
- In order to gauge how well your code uses the HW,
you need to use the CUDA occupancy calculator
(google it)
http//developer.download.nvidia.com/compute/cuda/
CUDA_Occupancy_calculator.xls
6
7Gauging the level of HW use (cont.)
- Three things are asked of you
- Number of threads per block (this is trivial to
provide) - Number of registers per thread
- Number of bytes of shared memory used by each
block - The last two quantities, you get them by adding
the ptxas-options v to the compile command
line
(CUDA_BIN_PATH)\nvcc.exe -cuda --ptxas-options
-v -I"(CUDA_INC_PATH)" -I./ -I../../common/inc
-I"(VCInstallDir)\include" -I"(VCInstallDir)\Pl
atformSDK\include" -o (ConfigurationName)\matrixm
ul.gen.c matrixmul.cu
- In Visual Studio, right-click the main .cu file,
go to properties, and edit the Custom Build Step
by adding ptxas-options v
7
8Gauging the level of HW use (cont.)
- Open in a text editor the object file to find, in
a pile of stuff that doesnt make any sense, a
chunk of text that looks like this
code name _Z15MatrixMulKernel6MatrixS_S_ lm
em 0 smem 2112 reg 14 bar 0 bincode
0x3004c815 0xe43007c0 0x10008025 0x00000003
0xd0800205 0x00400780 0xa000000d 0x04000780
0xa0000205 0x04000780 0x10056003 0x00000100
0x30040601 0xc4100780 0x20000001 0x04004780
- This is telling you that MatrixMulKernel (which
is the name I gave my kernel) uses 2112 bytes in
shared memory, 14 registers per thread, and that
there is no use of the local memory (lmem)
8
9Alternatively, in Developer Studio
This is what you are interested in smem 2672
bytes registers 9
9
10End Discussion on Memory Spaces (Access and
Latency Issues) Begin Control Flow
10
11Objective
- Understand the implications of control flow on
- Branch divergence overhead
- SM execution resource utilization
- Learn better ways to write code with control flow
- Understand compiler/HW predication
- An idea meant to reduce the impact of control
flow - There is a cost involved with this process.
11
HK-UIUC
12Quick terminology review
- Thread concurrent code executed and an
associated state on the CUDA device (in parallel
with other threads) - The unit of parallelism in CUDA
- Number of threads used controlled by user
- Warp a group of threads executed physically in
parallel in G80 - Number of threads in warp not controlled by user
- Block a group of threads that are executed
together and form the unit of resource assignment - Number of blocks used controlled by user
- Grid a group of thread blocks that must all
complete before the next phase of the program can
begin
12
HK-UIUC
13How thread blocks are partitioned
- Each thread block is partitioned into warps
- Thread IDs within a warp are consecutive and
increasing - Remember In multidimensional blocks, the x
thread index runs first, followed by the y thread
index, and finally followed by the z thread index
- Warp 0 starts with Thread ID 0
- Partitioning is always the same
- Thus you can use this knowledge in control flow
- However, the exact size of warps may change from
release to release - While you can rely on ordering among threads, DO
NOT rely on any ordering among warps - Remember, the concept of warp is not something
you control through CUDA - If there are any dependencies between threads,
you must __syncthreads() to get correct results
13
HK-UIUC
14Control Flow Instructions
- Main performance concern with branching is
divergence - Threads within a single warp take different paths
- Different execution paths are serialized in G80
- The control paths taken by the threads in a warp
are traversed one at a time until there is no
more. - NOTE Dont forget that divergence can manifest
only at the warp level. You can not discuss this
concept in relation to code executed by threads
in different warps
14
HK-UIUC
15Control Flow Instructions (cont.)
- A common case avoid divergence when branch
condition is a function of thread ID - Example with divergence
- If (threadIdx.x gt 2)
- This creates two different control paths for
threads in a block - Branch granularity lt warp size threads 0 and 1
follow different path than the rest of the
threads in the first warp - Example without divergence
- If (threadIdx.x / WARP_SIZE gt 2)
- Also creates two different control paths for
threads in a block - Branch granularity is a whole multiple of warp
size all threads in any given warp follow the
same path
15
HK-UIUC
16Illustration Parallel Reduction
- Use the Parallel Reduction algorithm as a
vehicle to discuss the issue of control flow - Given an array of values, reduce them in
parallel to a single value - Examples
- Sum reduction sum of all values in the array
- Max reduction maximum of all values in the array
- Typically parallel implementation
- Recursively halve the number of threads, add two
values per thread - Takes log(n) steps for n elements, requires n/2
threads
16
HK-UIUC
17A Vector Reduction Example
- Assume an in-place reduction using shared memory
- We are in the process of summing up a 512 element
array - The shared memory used to hold a partial sum
vector - Each iteration brings the partial sum vector
closer to the final sum - The final sum will be stored in element 0
17
HK-UIUC
18A simple implementation
- Assume we have already loaded array into
- __shared__ float partialSum
unsigned int t threadIdx.x for (unsigned int
stride 1 stride lt blockDim.x stride 2)
__syncthreads() if (t (2stride) 0)
partialSumt partialSumtstride
18
HK-UIUC
19The Bank Conflicts Aspect
Array elements
0
1
2
3
4
5
7
6
10
9
8
11
01
23
45
67
1011
89
1
0...3
4..7
8..11
2
0..7
8..15
3
iterations
19
HK-UIUC
20The Branch Divergence Aspect
Thread 0
Thread 8
Thread 2
Thread 4
Thread 6
Thread 10
0
1
2
3
4
5
7
6
10
9
8
11
01
23
45
67
1011
89
1
0...3
4..7
8..11
2
0..7
8..15
3
iterations
Array elements
20
HK-UIUC
21Some Observations
- In each iterations, two control flow paths will
be sequentially traversed for each warp - Threads that perform addition and threads that do
not - Threads that do not perform addition may cost
extra cycles depending on the implementation of
divergence
21
HK-UIUC
22Some Observations (cont.)
- No more than half of the threads will be
executing at any time - All odd index threads are disabled right from the
beginning! - On average, less than ΒΌ of the threads will be
activated for all warps over time. - After the 5th iteration, entire warps in each
block will be disabled, poor resource utilization
but no divergence. - This can go on for a while, up to 4 more
iterations (512/3216 24), where each iteration
only has one thread activated until all warps
retire
22
HK-UIUC
23Shortcomings of the implementation
- Assume we have already loaded array into
- __shared__ float partialSum
BAD Divergence due to interleaved branch
decisions
unsigned int t threadIdx.x for (unsigned int
stride 1 stride lt blockDim.x stride 2)
__syncthreads() if (t (2stride) 0)
partialSumt partialSumtstride
BAD Bank conflicts due to stride
23
HK-UIUC
24A better implementation
- Assume we have already loaded array into
- __shared__ float partialSum
unsigned int t threadIdx.x for (unsigned int
stride blockDim.x stride gt 1 stride gtgt 1)
__syncthreads() if (t lt stride)
partialSumt partialSumtstride
24
HK-UIUC
25No Divergence until lt 16 sub-sums
Thread 0
0
1
2
3
13
15
14
18
17
16
19
016
1531
1
3
4
25
HK-UIUC
26Some Observations About the New Implementation
- Only the last 5 iterations will have divergence
- Entire warps will be shut down as iterations
progress - For a 512-thread block, 4 iterations to shut down
all but one warp in the block - Better resource utilization, will likely retire
warps and thus block executes faster - Recall, no bank conflicts either
26
HK-UIUC
27A Potential Further Refinement but Bad Idea
- For last 6 loops only one warp active (i.e. tids
0..31) - Shared reads writes SIMD synchronous within a
warp - So skip __syncthreads() and unroll last 5
iterations
unsigned int tid threadIdx.x for (unsigned int
d ngtgt1 d gt 32 d gtgt 1) __syncthreads() i
f (tid lt d) sharedtid sharedtid
d __syncthreads() if (tid lt 32) //
unroll last 6 predicated steps sharedtid
sharedtid 32 sharedtid sharedtid
16 sharedtid sharedtid
8 sharedtid sharedtid 4 sharedtid
sharedtid 2 sharedtid sharedtid
1
27
HK-UIUC
28A Potential Further Refinement but bad idea
- Concluding remarks on the further refinement
- This would not work properly is warp size
decreases. - Also doesnt look that attractive if the warp
size increases. - Finally you need __synchthreads() between each
statement! - Having __synchthreads() in an if-statement is
problematic.
28
HK-UIUC
29Control Flow Instructions
- if, switch, for, while can significantly impact
the effective instruction throughput when threads
of the same warp diverge - If this happens, the execution is serialized
- This increases the number of instructions
executed for this warp - When all the different execution paths have
completed, the threads converge back to the same
execution path - Not only that you execute more instructions, but
you also need logic associated with this process
(book-keeping)
29
30Predicated Execution Concept
- The thread divergence can be avoided in some
cases by using the concept of predication - ltp1gt LDR r1,r2,0
- If p1 is TRUE, the assembly code instruction
above executes normally - If p1 is FALSE, instruction treated as NOP
30
HK-UIUC
31Predication Example
if (x 10) c c 1
LDR r5, X p1 lt- r5 eq
10 ltp1gt LDR r1 lt- C ltp1gt ADD r1, r1, 1 ltp1gt STR
r1 -gt C
31
HK-UIUC
32Predication very helpful for if-else
A
A B C D
B
C
D
32
HK-UIUC
33If-else example
p1,p2 lt- r5 eq 10 ltp1gt inst 1 from
B ltp1gt inst 2 from B ltp1gt ltp2gt inst 1 from
C ltp2gt inst 2 from C
p1,p2 lt- r5 eq 10 ltp1gt inst 1 from
B ltp2gt inst 1 from C ltp1gt inst 2 from B ltp2gt
inst 2 from C ltp1gt
This is what gets scheduled
The cost is extra instructions will be issued
each time the code is executed. However, there is
no branch divergence.
33
HK-UIUC
34Instruction Predication in G80
- Your comparison instructions set condition codes
(CC) - Instructions can be predicated to write results
only when CC meets criterion (CC ! 0, CC gt 0,
etc.) - The compiler tries to predict if a branch
condition is likely to produce many divergent
warps - If thats the case, go ahead and predicate if the
branch has lt7 instructions - If thats not the case, only predicate if the
branch has lt4 instructions - Note its pretty bad if you predicate when it
was obvious that there would have been no
divergence
34
HK-UIUC
35Instruction Predication in G80 (cont.)
- ALL predicated instructions take execution cycles
- Those with false conditions dont write their
output, and do not evaluate addresses or read
operands - Saves branch instructions, so can be cheaper than
serializing divergent paths - If all this business is confusing, remember this
- Avoid thread divergence
- Its not 100 clear to me, but I believe that
there is no cost if a subset of threads belonging
to a warp sits there and does nothing while the
other warp threads are all running the same
instruction
35
HK-UIUC