ME964 High Performance Computing for Engineering Applications - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

ME964 High Performance Computing for Engineering Applications

Description:

Moving data back and forth between the host and device is a killer ... On average, less than of the threads will be activated for all warps over time. ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 36
Provided by: sbel3
Learn more at: http://sbel.wisc.edu
Category:

less

Transcript and Presenter's Notes

Title: ME964 High Performance Computing for Engineering Applications


1
ME964High Performance Computing for Engineering
Applications
  • Gauging Kernel Performance Control Flow in CUDA
  • Oct. 9, 2008

2
Before we get started
  • Last Time
  • Guest Lecturer Michael Garland, Researcher at
    NVIDIA
  • Writing Efficient CUDA Algorithms
  • Today
  • Gauging the extent to which you use hardware
    resources in CUDA
  • Control Flow in CUDA
  • Homework related
  • HW6 available for download (exclusive scan
    operation)
  • HW5, 2D matrix convolution, due at 1159 PM today

2
3
Exercise Does Matrix Multiplication Incur Shared
Memory Bank Conflicts?
Scenario A. The tile matrix is computed as
follows one half warp computes one row of the
tile at a time.
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
In scenario A, all threads in a half-warp access
the same shared memory entry leading to
broadcast. Below whats highlighted is the
second step of computing the 5th row of the tile.
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
In scenario A, all threads in a half-warp access
elements in neighboring banks as they walk
through the computation
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
Scenario B. The tile matrix is computed as
follows one half warp computes one column of the
tile at a time (do it yourself).
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
3
0
1
2
3
4
5
6
7
15
4
Final Comments, Memory Access
  • Given the GPU memory spaces and their latency, a
    typical programming pattern emerges at the thread
    level
  • Load data from device memory into shared memory
    (coalesced if possible)
  • Synchronize with al the other threads of the
    block to avoid data access hazards
  • Process the data that you just brought over in
    shared memory
  • Synchronize as needed
  • Write the results back to global memory
    (coalesced if possible)
  • NOTE for CUDA computing, always try hard to make
    your computation fit this model

4
5
CUDA Programming Common Sense Advice
  • Keep this in mind
  • Allocating memory on device or host is expensive
  • Moving data back and forth between the host and
    device is a killer
  • Global memory accesses are going to be slow
  • If they are not coalesced they are even slower
  • Make sure that you keep the SM
  • Occupied (currently, 24 warps can be managed
    concurrently)
  • Busy (avoid data starvation, have it crunch
    numbers)
  • If you can, avoid bank conflicts. Not that big
    of a deal tough.

5
6
Gauging the level of HW use
  • In order to gauge how well your code uses the HW,
    you need to use the CUDA occupancy calculator
    (google it)

http//developer.download.nvidia.com/compute/cuda/
CUDA_Occupancy_calculator.xls
6
7
Gauging the level of HW use (cont.)
  • Three things are asked of you
  • Number of threads per block (this is trivial to
    provide)
  • Number of registers per thread
  • Number of bytes of shared memory used by each
    block
  • The last two quantities, you get them by adding
    the ptxas-options v to the compile command
    line

(CUDA_BIN_PATH)\nvcc.exe -cuda --ptxas-options
-v -I"(CUDA_INC_PATH)" -I./ -I../../common/inc
-I"(VCInstallDir)\include" -I"(VCInstallDir)\Pl
atformSDK\include" -o (ConfigurationName)\matrixm
ul.gen.c matrixmul.cu
  • In Visual Studio, right-click the main .cu file,
    go to properties, and edit the Custom Build Step
    by adding ptxas-options v

7
8
Gauging the level of HW use (cont.)
  • Open in a text editor the object file to find, in
    a pile of stuff that doesnt make any sense, a
    chunk of text that looks like this

code name _Z15MatrixMulKernel6MatrixS_S_ lm
em 0 smem 2112 reg 14 bar 0 bincode
0x3004c815 0xe43007c0 0x10008025 0x00000003
0xd0800205 0x00400780 0xa000000d 0x04000780
0xa0000205 0x04000780 0x10056003 0x00000100
0x30040601 0xc4100780 0x20000001 0x04004780
  • This is telling you that MatrixMulKernel (which
    is the name I gave my kernel) uses 2112 bytes in
    shared memory, 14 registers per thread, and that
    there is no use of the local memory (lmem)

8
9
Alternatively, in Developer Studio
This is what you are interested in smem 2672
bytes registers 9
9
10
End Discussion on Memory Spaces (Access and
Latency Issues) Begin Control Flow
10
11
Objective
  • Understand the implications of control flow on
  • Branch divergence overhead
  • SM execution resource utilization
  • Learn better ways to write code with control flow
  • Understand compiler/HW predication
  • An idea meant to reduce the impact of control
    flow
  • There is a cost involved with this process.

11
HK-UIUC
12
Quick terminology review
  • Thread concurrent code executed and an
    associated state on the CUDA device (in parallel
    with other threads)
  • The unit of parallelism in CUDA
  • Number of threads used controlled by user
  • Warp a group of threads executed physically in
    parallel in G80
  • Number of threads in warp not controlled by user
  • Block a group of threads that are executed
    together and form the unit of resource assignment
  • Number of blocks used controlled by user
  • Grid a group of thread blocks that must all
    complete before the next phase of the program can
    begin

12
HK-UIUC
13
How thread blocks are partitioned
  • Each thread block is partitioned into warps
  • Thread IDs within a warp are consecutive and
    increasing
  • Remember In multidimensional blocks, the x
    thread index runs first, followed by the y thread
    index, and finally followed by the z thread index
  • Warp 0 starts with Thread ID 0
  • Partitioning is always the same
  • Thus you can use this knowledge in control flow
  • However, the exact size of warps may change from
    release to release
  • While you can rely on ordering among threads, DO
    NOT rely on any ordering among warps
  • Remember, the concept of warp is not something
    you control through CUDA
  • If there are any dependencies between threads,
    you must __syncthreads() to get correct results

13
HK-UIUC
14
Control Flow Instructions
  • Main performance concern with branching is
    divergence
  • Threads within a single warp take different paths
  • Different execution paths are serialized in G80
  • The control paths taken by the threads in a warp
    are traversed one at a time until there is no
    more.
  • NOTE Dont forget that divergence can manifest
    only at the warp level. You can not discuss this
    concept in relation to code executed by threads
    in different warps

14
HK-UIUC
15
Control Flow Instructions (cont.)
  • A common case avoid divergence when branch
    condition is a function of thread ID
  • Example with divergence
  • If (threadIdx.x gt 2)
  • This creates two different control paths for
    threads in a block
  • Branch granularity lt warp size threads 0 and 1
    follow different path than the rest of the
    threads in the first warp
  • Example without divergence
  • If (threadIdx.x / WARP_SIZE gt 2)
  • Also creates two different control paths for
    threads in a block
  • Branch granularity is a whole multiple of warp
    size all threads in any given warp follow the
    same path

15
HK-UIUC
16
Illustration Parallel Reduction
  • Use the Parallel Reduction algorithm as a
    vehicle to discuss the issue of control flow
  • Given an array of values, reduce them in
    parallel to a single value
  • Examples
  • Sum reduction sum of all values in the array
  • Max reduction maximum of all values in the array
  • Typically parallel implementation
  • Recursively halve the number of threads, add two
    values per thread
  • Takes log(n) steps for n elements, requires n/2
    threads

16
HK-UIUC
17
A Vector Reduction Example
  • Assume an in-place reduction using shared memory
  • We are in the process of summing up a 512 element
    array
  • The shared memory used to hold a partial sum
    vector
  • Each iteration brings the partial sum vector
    closer to the final sum
  • The final sum will be stored in element 0

17
HK-UIUC
18
A simple implementation
  • Assume we have already loaded array into
  • __shared__ float partialSum

unsigned int t threadIdx.x for (unsigned int
stride 1 stride lt blockDim.x stride 2)
__syncthreads() if (t (2stride) 0)
partialSumt partialSumtstride
18
HK-UIUC
19
The Bank Conflicts Aspect
Array elements
0
1
2
3
4
5
7
6
10
9
8
11
01
23
45
67
1011
89
1
0...3
4..7
8..11
2
0..7
8..15
3
iterations
19
HK-UIUC
20
The Branch Divergence Aspect
Thread 0
Thread 8
Thread 2
Thread 4
Thread 6
Thread 10
0
1
2
3
4
5
7
6
10
9
8
11
01
23
45
67
1011
89
1
0...3
4..7
8..11
2
0..7
8..15
3
iterations
Array elements
20
HK-UIUC
21
Some Observations
  • In each iterations, two control flow paths will
    be sequentially traversed for each warp
  • Threads that perform addition and threads that do
    not
  • Threads that do not perform addition may cost
    extra cycles depending on the implementation of
    divergence

21
HK-UIUC
22
Some Observations (cont.)
  • No more than half of the threads will be
    executing at any time
  • All odd index threads are disabled right from the
    beginning!
  • On average, less than ΒΌ of the threads will be
    activated for all warps over time.
  • After the 5th iteration, entire warps in each
    block will be disabled, poor resource utilization
    but no divergence.
  • This can go on for a while, up to 4 more
    iterations (512/3216 24), where each iteration
    only has one thread activated until all warps
    retire

22
HK-UIUC
23
Shortcomings of the implementation
  • Assume we have already loaded array into
  • __shared__ float partialSum

BAD Divergence due to interleaved branch
decisions
unsigned int t threadIdx.x for (unsigned int
stride 1 stride lt blockDim.x stride 2)
__syncthreads() if (t (2stride) 0)
partialSumt partialSumtstride
BAD Bank conflicts due to stride
23
HK-UIUC
24
A better implementation
  • Assume we have already loaded array into
  • __shared__ float partialSum

unsigned int t threadIdx.x for (unsigned int
stride blockDim.x stride gt 1 stride gtgt 1)
__syncthreads() if (t lt stride)
partialSumt partialSumtstride
24
HK-UIUC
25
No Divergence until lt 16 sub-sums
Thread 0
0
1
2
3

13
15
14
18
17
16
19
016
1531
1
3
4
25
HK-UIUC
26
Some Observations About the New Implementation
  • Only the last 5 iterations will have divergence
  • Entire warps will be shut down as iterations
    progress
  • For a 512-thread block, 4 iterations to shut down
    all but one warp in the block
  • Better resource utilization, will likely retire
    warps and thus block executes faster
  • Recall, no bank conflicts either

26
HK-UIUC
27
A Potential Further Refinement but Bad Idea
  • For last 6 loops only one warp active (i.e. tids
    0..31)
  • Shared reads writes SIMD synchronous within a
    warp
  • So skip __syncthreads() and unroll last 5
    iterations

unsigned int tid threadIdx.x for (unsigned int
d ngtgt1 d gt 32 d gtgt 1) __syncthreads() i
f (tid lt d) sharedtid sharedtid
d __syncthreads() if (tid lt 32) //
unroll last 6 predicated steps sharedtid
sharedtid 32 sharedtid sharedtid
16 sharedtid sharedtid
8 sharedtid sharedtid 4 sharedtid
sharedtid 2 sharedtid sharedtid
1
27
HK-UIUC
28
A Potential Further Refinement but bad idea
  • Concluding remarks on the further refinement
  • This would not work properly is warp size
    decreases.
  • Also doesnt look that attractive if the warp
    size increases.
  • Finally you need __synchthreads() between each
    statement!
  • Having __synchthreads() in an if-statement is
    problematic.

28
HK-UIUC
29
Control Flow Instructions
  • if, switch, for, while can significantly impact
    the effective instruction throughput when threads
    of the same warp diverge
  • If this happens, the execution is serialized
  • This increases the number of instructions
    executed for this warp
  • When all the different execution paths have
    completed, the threads converge back to the same
    execution path
  • Not only that you execute more instructions, but
    you also need logic associated with this process
    (book-keeping)

29
30
Predicated Execution Concept
  • The thread divergence can be avoided in some
    cases by using the concept of predication
  • ltp1gt LDR r1,r2,0
  • If p1 is TRUE, the assembly code instruction
    above executes normally
  • If p1 is FALSE, instruction treated as NOP

30
HK-UIUC
31
Predication Example
if (x 10) c c 1
LDR r5, X p1 lt- r5 eq
10 ltp1gt LDR r1 lt- C ltp1gt ADD r1, r1, 1 ltp1gt STR
r1 -gt C
31
HK-UIUC
32
Predication very helpful for if-else
A
A B C D
B
C
D
32
HK-UIUC
33
If-else example
p1,p2 lt- r5 eq 10 ltp1gt inst 1 from
B ltp1gt inst 2 from B ltp1gt ltp2gt inst 1 from
C ltp2gt inst 2 from C
p1,p2 lt- r5 eq 10 ltp1gt inst 1 from
B ltp2gt inst 1 from C ltp1gt inst 2 from B ltp2gt
inst 2 from C ltp1gt
This is what gets scheduled
The cost is extra instructions will be issued
each time the code is executed. However, there is
no branch divergence.
33
HK-UIUC
34
Instruction Predication in G80
  • Your comparison instructions set condition codes
    (CC)
  • Instructions can be predicated to write results
    only when CC meets criterion (CC ! 0, CC gt 0,
    etc.)
  • The compiler tries to predict if a branch
    condition is likely to produce many divergent
    warps
  • If thats the case, go ahead and predicate if the
    branch has lt7 instructions
  • If thats not the case, only predicate if the
    branch has lt4 instructions
  • Note its pretty bad if you predicate when it
    was obvious that there would have been no
    divergence

34
HK-UIUC
35
Instruction Predication in G80 (cont.)
  • ALL predicated instructions take execution cycles
  • Those with false conditions dont write their
    output, and do not evaluate addresses or read
    operands
  • Saves branch instructions, so can be cheaper than
    serializing divergent paths
  • If all this business is confusing, remember this
  • Avoid thread divergence
  • Its not 100 clear to me, but I believe that
    there is no cost if a subset of threads belonging
    to a warp sits there and does nothing while the
    other warp threads are all running the same
    instruction

35
HK-UIUC
Write a Comment
User Comments (0)
About PowerShow.com