A short introduction to nVidias CUDA - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

A short introduction to nVidias CUDA

Description:

Floats in register. 128. 4 #Cores / SPEs. nVidia G80 (8800 ... Example structure of a CUDA application. min. two functions to isolate CUDA Code from your app. ... – PowerPoint PPT presentation

Number of Views:196

Avg rating:3.0/5.0

Slides: 27

Provided by: alexander124

Category:

more less

Transcript and Presenter's Notes

Title: A short introduction to nVidias CUDA

1
A short introduction to nVidias CUDA

Alexander Heinecke
Technical University of Munich

http//home.in.tum.de/heinecke/fa2007
2
Overview

Differences CPU GPU 3
General CPU/GPU properties
Compare specifications
CUDA Programming Model 10
Application stack
Thread implementation
Memory Model
CUDA API 13
Extension of the C/C Programming Lang.
Example structure of a CUDA application
Examples 15
Matrix Addition
Matrix Multiplication
Jacobi Gauß Seidel
Benchmark Results 21

3
Differences between CPU and GPU

GPU nearly all transistors are ALUs
CPU most of the transistors are Cache

(taken from NV1)
4
AMD Opteron Dieshot
5
Intel Itanium2 Dual-Core Dieshot
6
Intel Core Architecture Pipeline / Simple Example
(taken from IN1)
Pipeline
RET 1
RET 2
RET 3
Step 5
EXEC 1
EXEC 2
EXEC 3
EXEC 4
Step 4
OFETCH 1
OFETCH 2
OFETCH 3
OFETCH 4
OFETCH 5
Step 3
IDEC 1
IDEC 2
IDEC 3
IDEC 4
IDEC 5
IDEC 6
Step 2
IFETCH 1
IFETCH 2
IFETCH 3
IFETCH 4
IFETCH 5
IFETCH 6
IFETCH 7
Step 1
cycle
1
2
3
4
5
6
7
7
nVidia G80 Pipeline
8
Properties of CPU and GPU
9
History Power of GPUs in the last four years
(taken from NV1)
10
Application stack of CUDA
(taken from NV1)
11
Thread organization in CUDA
(taken from NV1)
12
Memory organization in CUDA
(taken from NV1)
13
Extensions to C (functions and varaible)

CUDA Code is saved in special files (.cu)
These are precompiled by nvcc (nvidia compiler)
There are some function type qualifiers, which
decide the execution place
__host__ (CPU only, called by CPU)
__global__ (GPU only, called by CPU)
__device__ (GPU only, called by GPU)
For varaibles __device__, __constant__,
__shared__

14
Example structure of a CUDA application

min. two functions to isolate CUDA Code from your
app.
First function
Init CUDA
Copy data to device
Call kernel with execution settings
Copy data to host and shut down (automatic)
Second function (kernel)
Contains problem for ONE thread

15
Tested Algorithms (2D Arrays)

All tested algorithms operate on 2D Arrays
Matrix Addtion
Matrix Multiplication
Jacobi Gauß-Seidel (iterative solver)

16
Example Matrix Addition (Init function)

CUT_DEVICE_INIT()
// allocate device memory
float d_A
CUDA_SAFE_CALL(cudaMalloc((void) d_A,
mem_size))
// copy host memory to device
CUDA_SAFE_CALL(cudaMemcpy(d_A, ma_a, mem_size,
cudaMemcpyHostToDevice) )
cudaBindTexture(0, texRef_MaA, d_A, mem_size)
// texture binding
dim3 threads(BLOCK_SIZE_GPU, BLOCK_SIZE_GPU)
dim3 grid(n_dim / threads.x, n_dim / threads.y)
// execute the kernel
cuMatrixAdd_kernelltltlt grid, threads gtgtgt(d_C,
n_dim)
cudaUnbindTexture(texRef_MaA) // texture
unbinding
// copy result from device to host
CUDA_SAFE_CALL(cudaMemcpy(ma_c, d_C, mem_size,
cudaMemcpyDeviceToHost) )

17
Example Matrix Addition (kernel)

// Block index
int bx blockIdx.x
int by blockIdx.y
// Thread index
int tx threadIdx.x
int ty threadIdx.y
int start (n_dim by BLOCK_SIZE_GPU) bx
BLOCK_SIZE_GPU
Cstart (n_dim ty) tx
tex1Dfetch(texRef_MaA, start (n_dim ty)
tx) tex1Dfetch(texRef_MaB, start (n_dim ty)
tx)

18
Example Matrix Multiplication (kernel)

int tx2 tx BLOCK_SIZE_GPU
int ty2 n_dim ty
float Csub1 0.0 float Csub2 0.0
int b bBegin
for (int a aBegin a lt aEnd a aStep)
__shared__ float AsBLOCK_SIZE_GPUBLOCK_SIZE_GP
U
AS(ty, tx) Aa ty2 tx
__shared__ float B1sBLOCK_SIZE_GPUBLOCK_SIZE_G
PU2
B1S(ty, tx) Bb ty2 tx
B1S(ty, tx2) Bb ty2 tx2
__syncthreads()
Csub1 AS(ty, 0) B1S(0, tx)
// more calcs
b bStep
__syncthreads()
// Write result back

19
Example Jacobi (kernel), no internal loops

// Block index
int bx blockIdx.x int by blockIdx.y
// Thread index
int tx threadIdx.x1 int ty threadIdx.y1
int ustart ((by BLOCK_SIZE_GPU) n_dim )
(bx BLOCK_SIZE_GPU)
float res tex1Dfetch(texRef_MaF, ustart (ty
n_dim) tx) qh
res tex1Dfetch(texRef_MaU, ustart (ty
n_dim) tx - 1) tex1Dfetch(texRef_MaU, ustart
(ty n_dim) tx 1)
res tex1Dfetch(texRef_MaU, ustart ((ty1)
n_dim) tx) tex1Dfetch(texRef_MaU, ustart
((ty-1) n_dim) tx)
res 0.25f res
ma_uustart (ty n_dim) tx res

20
Example Jacobi (kernel), internal loops

int tx threadIdx.x1 int ty threadIdx.y1
// some more inits
// load to calc u_ij
__shared__ float UsBLOCK_SIZE_GPU2BLOCK_SIZE_G
PU2
US(ty, tx) tex1Dfetch(texRef_MaU, ustart (ty
n_dim) tx)
// init edge u
for (unsigned int i 0 i lt n_intern_loops i)
res funk
res US(ty, tx - 1) US(ty, tx 1)
res US(ty - 1, tx) US(ty 1, tx)
res 0.25f res
__syncthreads() // not used in parallel jacobi
US(ty, tx) res

21
Performance Results (1)
22
Performance Results (2)
23
Performance Results (3)
24
Performance Results (4)
25
Conclusion (Points to take care of)

Be care of / you should use
min. number of memory accesses
use unrolling instead of for loops
use blocking algorithms
only algorithms, which are not extremly memory
bounded (NOT matrix addition) should be
implemented with CUDA
try to do not use the if statement, or other
programmecontrolling statements (slow)

26
Appendix - References

NV1 NVIDIA CUDA Compute Unified Device
Architecture, Programming Guide nVidia
Corporation, Version 1.0, 23.06.2007
IN1/2/3 Intel Architecture Handbook, Version
November 2006
NR Numerical receipies (online generated
pdf)

http//home.in.tum.de/heinecke/fa2007

Write a Comment

User Comments (0)