Title: A short introduction to nVidias CUDA
1A short introduction to nVidias CUDA
- Alexander Heinecke
- Technical University of Munich
http//home.in.tum.de/heinecke/fa2007
2Overview
- Differences CPU GPU 3
- General CPU/GPU properties
- Compare specifications
- CUDA Programming Model 10
- Application stack
- Thread implementation
- Memory Model
- CUDA API 13
- Extension of the C/C Programming Lang.
- Example structure of a CUDA application
- Examples 15
- Matrix Addition
- Matrix Multiplication
- Jacobi Gauß Seidel
- Benchmark Results 21
3Differences between CPU and GPU
- GPU nearly all transistors are ALUs
- CPU most of the transistors are Cache
(taken from NV1)
4AMD Opteron Dieshot
5Intel Itanium2 Dual-Core Dieshot
6Intel Core Architecture Pipeline / Simple Example
(taken from IN1)
Pipeline
RET 1
RET 2
RET 3
Step 5
EXEC 1
EXEC 2
EXEC 3
EXEC 4
Step 4
OFETCH 1
OFETCH 2
OFETCH 3
OFETCH 4
OFETCH 5
Step 3
IDEC 1
IDEC 2
IDEC 3
IDEC 4
IDEC 5
IDEC 6
Step 2
IFETCH 1
IFETCH 2
IFETCH 3
IFETCH 4
IFETCH 5
IFETCH 6
IFETCH 7
Step 1
cycle
1
2
3
4
5
6
7
7nVidia G80 Pipeline
8Properties of CPU and GPU
9History Power of GPUs in the last four years
(taken from NV1)
10Application stack of CUDA
(taken from NV1)
11Thread organization in CUDA
(taken from NV1)
12Memory organization in CUDA
(taken from NV1)
13Extensions to C (functions and varaible)
- CUDA Code is saved in special files (.cu)
- These are precompiled by nvcc (nvidia compiler)
- There are some function type qualifiers, which
decide the execution place - __host__ (CPU only, called by CPU)
- __global__ (GPU only, called by CPU)
- __device__ (GPU only, called by GPU)
- For varaibles __device__, __constant__,
__shared__
14Example structure of a CUDA application
- min. two functions to isolate CUDA Code from your
app. - First function
- Init CUDA
- Copy data to device
- Call kernel with execution settings
- Copy data to host and shut down (automatic)
- Second function (kernel)
- Contains problem for ONE thread
15Tested Algorithms (2D Arrays)
- All tested algorithms operate on 2D Arrays
- Matrix Addtion
- Matrix Multiplication
- Jacobi Gauß-Seidel (iterative solver)
16Example Matrix Addition (Init function)
- CUT_DEVICE_INIT()
- // allocate device memory
- float d_A
- CUDA_SAFE_CALL(cudaMalloc((void) d_A,
mem_size)) -
- // copy host memory to device
- CUDA_SAFE_CALL(cudaMemcpy(d_A, ma_a, mem_size,
cudaMemcpyHostToDevice) ) -
- cudaBindTexture(0, texRef_MaA, d_A, mem_size)
// texture binding -
- dim3 threads(BLOCK_SIZE_GPU, BLOCK_SIZE_GPU)
- dim3 grid(n_dim / threads.x, n_dim / threads.y)
- // execute the kernel
- cuMatrixAdd_kernelltltlt grid, threads gtgtgt(d_C,
n_dim) - cudaUnbindTexture(texRef_MaA) // texture
unbinding -
- // copy result from device to host
- CUDA_SAFE_CALL(cudaMemcpy(ma_c, d_C, mem_size,
cudaMemcpyDeviceToHost) ) -
17Example Matrix Addition (kernel)
- // Block index
- int bx blockIdx.x
- int by blockIdx.y
- // Thread index
- int tx threadIdx.x
- int ty threadIdx.y
- int start (n_dim by BLOCK_SIZE_GPU) bx
BLOCK_SIZE_GPU - Cstart (n_dim ty) tx
- tex1Dfetch(texRef_MaA, start (n_dim ty)
tx) tex1Dfetch(texRef_MaB, start (n_dim ty)
tx)
18Example Matrix Multiplication (kernel)
- int tx2 tx BLOCK_SIZE_GPU
- int ty2 n_dim ty
- float Csub1 0.0 float Csub2 0.0
- int b bBegin
- for (int a aBegin a lt aEnd a aStep)
-
- __shared__ float AsBLOCK_SIZE_GPUBLOCK_SIZE_GP
U - AS(ty, tx) Aa ty2 tx
- __shared__ float B1sBLOCK_SIZE_GPUBLOCK_SIZE_G
PU2 - B1S(ty, tx) Bb ty2 tx
- B1S(ty, tx2) Bb ty2 tx2
- __syncthreads()
- Csub1 AS(ty, 0) B1S(0, tx)
- // more calcs
- b bStep
-
- __syncthreads()
- // Write result back
19Example Jacobi (kernel), no internal loops
- // Block index
- int bx blockIdx.x int by blockIdx.y
- // Thread index
- int tx threadIdx.x1 int ty threadIdx.y1
- int ustart ((by BLOCK_SIZE_GPU) n_dim )
(bx BLOCK_SIZE_GPU) - float res tex1Dfetch(texRef_MaF, ustart (ty
n_dim) tx) qh - res tex1Dfetch(texRef_MaU, ustart (ty
n_dim) tx - 1) tex1Dfetch(texRef_MaU, ustart
(ty n_dim) tx 1) - res tex1Dfetch(texRef_MaU, ustart ((ty1)
n_dim) tx) tex1Dfetch(texRef_MaU, ustart
((ty-1) n_dim) tx) - res 0.25f res
- ma_uustart (ty n_dim) tx res
20Example Jacobi (kernel), internal loops
- int tx threadIdx.x1 int ty threadIdx.y1
- // some more inits
- // load to calc u_ij
- __shared__ float UsBLOCK_SIZE_GPU2BLOCK_SIZE_G
PU2 - US(ty, tx) tex1Dfetch(texRef_MaU, ustart (ty
n_dim) tx) - // init edge u
-
- for (unsigned int i 0 i lt n_intern_loops i)
-
- res funk
- res US(ty, tx - 1) US(ty, tx 1)
- res US(ty - 1, tx) US(ty 1, tx)
- res 0.25f res
- __syncthreads() // not used in parallel jacobi
- US(ty, tx) res
21Performance Results (1)
22Performance Results (2)
23Performance Results (3)
24Performance Results (4)
25Conclusion (Points to take care of)
- Be care of / you should use
- min. number of memory accesses
- use unrolling instead of for loops
- use blocking algorithms
- only algorithms, which are not extremly memory
bounded (NOT matrix addition) should be
implemented with CUDA - try to do not use the if statement, or other
programmecontrolling statements (slow)
26Appendix - References
- NV1 NVIDIA CUDA Compute Unified Device
Architecture, Programming Guide nVidia
Corporation, Version 1.0, 23.06.2007 - IN1/2/3 Intel Architecture Handbook, Version
November 2006 - NR Numerical receipies (online generated
pdf)
http//home.in.tum.de/heinecke/fa2007