A short introduction to nVidia - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

A short introduction to nVidia

Description:

... 00 46991.00 24061.00 82939.00 82818.00 47958.00 24731.00 86450.00 82487.00 48929.00 23868.00 86811.00 83473.00 50119.00 24357.00 88180.00 83588.00 45689.00 ... – PowerPoint PPT presentation

Number of Views:198
Avg rating:3.0/5.0
Slides: 27
Provided by: AlexanderH150
Category:

less

Transcript and Presenter's Notes

Title: A short introduction to nVidia


1
A short introduction to nVidias CUDA
  • Alexander Heinecke
  • Technical University of Munich

http//home.in.tum.de/heinecke/fa2007
2
Overview
  1. Differences CPU GPU 3
  2. General CPU/GPU properties
  3. Compare specifications
  4. CUDA Programming Model 10
  5. Application stack
  6. Thread implementation
  7. Memory Model
  8. CUDA API 13
  9. Extension of the C/C Programming Lang.
  10. Example structure of a CUDA application
  11. Examples 15
  12. Matrix Addition
  13. Matrix Multiplication
  14. Jacobi Gauß Seidel
  15. Benchmark Results 21

3
Differences between CPU and GPU
  • GPU nearly all transistors are ALUs
  • CPU most of the transistors are Cache

(taken from NV1)
4
AMD Opteron Dieshot
5
Intel Itanium2 Dual-Core Dieshot
6
Intel Core Architecture Pipeline / Simple Example
(taken from IN1)
Pipeline
RET 1
RET 2
RET 3
Step 5
EXEC 1
EXEC 2
EXEC 3
EXEC 4
Step 4
OFETCH 1
OFETCH 2
OFETCH 3
OFETCH 4
OFETCH 5
Step 3
IDEC 1
IDEC 2
IDEC 3
IDEC 4
IDEC 5
IDEC 6
Step 2
IFETCH 1
IFETCH 2
IFETCH 3
IFETCH 4
IFETCH 5
IFETCH 6
IFETCH 7
Step 1
cycle
1
2
3
4
5
6
7
7
nVidia G80 Pipeline
8
Properties of CPU and GPU
Intel Xeon X5355 nVidia G80 (8800 GTX)
Clock Speed 2,66 GHz 575 MHz
Cores / SPEs 4 128
Floats in register 96 131072
Max. GFlop/s (float) 84 (prac) 85 (theo) 460 (prac) 500 (theo)
Max. Instructions RAM limited 2 Million G80 ASM Instr.
typ. dur. Inst. 1-2 cycles (SSE) min. 4 cycles
Price () 800 500
9
History Power of GPUs in the last four years
(taken from NV1)
10
Application stack of CUDA
(taken from NV1)
11
Thread organization in CUDA
(taken from NV1)
12
Memory organization in CUDA
(taken from NV1)
13
Extensions to C (functions and varaible)
  • CUDA Code is saved in special files (.cu)
  • These are precompiled by nvcc (nvidia compiler)
  • There are some function type qualifiers, which
    decide the execution place
  • __host__ (CPU only, called by CPU)
  • __global__ (GPU only, called by CPU)
  • __device__ (GPU only, called by GPU)
  • For varaibles __device__, __constant__,
    __shared__

14
Example structure of a CUDA application
  • min. two functions to isolate CUDA Code from your
    app.
  • First function
  • Init CUDA
  • Copy data to device
  • Call kernel with execution settings
  • Copy data to host and shut down (automatic)
  • Second function (kernel)
  • Contains problem for ONE thread

15
Tested Algorithms (2D Arrays)
  • All tested algorithms operate on 2D Arrays
  • Matrix Addtion
  • Matrix Multiplication
  • Jacobi Gauß-Seidel (iterative solver)

16
Example Matrix Addition (Init function)
  • CUT_DEVICE_INIT()
  • // allocate device memory
  • float d_A
  • CUDA_SAFE_CALL(cudaMalloc((void) d_A,
    mem_size))
  • // copy host memory to device
  • CUDA_SAFE_CALL(cudaMemcpy(d_A, ma_a, mem_size,
    cudaMemcpyHostToDevice) )
  • cudaBindTexture(0, texRef_MaA, d_A, mem_size)
    // texture binding
  • dim3 threads(BLOCK_SIZE_GPU, BLOCK_SIZE_GPU)
  • dim3 grid(n_dim / threads.x, n_dim / threads.y)
  • // execute the kernel
  • cuMatrixAdd_kernelltltlt grid, threads gtgtgt(d_C,
    n_dim)
  • cudaUnbindTexture(texRef_MaA) // texture
    unbinding
  • // copy result from device to host
  • CUDA_SAFE_CALL(cudaMemcpy(ma_c, d_C, mem_size,
    cudaMemcpyDeviceToHost) )

17
Example Matrix Addition (kernel)
  • // Block index
  • int bx blockIdx.x
  • int by blockIdx.y
  • // Thread index
  • int tx threadIdx.x
  • int ty threadIdx.y
  • int start (n_dim by BLOCK_SIZE_GPU) bx
    BLOCK_SIZE_GPU
  • Cstart (n_dim ty) tx
  • tex1Dfetch(texRef_MaA, start (n_dim ty)
    tx) tex1Dfetch(texRef_MaB, start (n_dim ty)
    tx)

18
Example Matrix Multiplication (kernel)
  • int tx2 tx BLOCK_SIZE_GPU
  • int ty2 n_dim ty
  • float Csub1 0.0 float Csub2 0.0
  • int b bBegin
  • for (int a aBegin a lt aEnd a aStep)
  • __shared__ float AsBLOCK_SIZE_GPUBLOCK_SIZE_GP
    U
  • AS(ty, tx) Aa ty2 tx
  • __shared__ float B1sBLOCK_SIZE_GPUBLOCK_SIZE_G
    PU2
  • B1S(ty, tx) Bb ty2 tx
  • B1S(ty, tx2) Bb ty2 tx2
  • __syncthreads()
  • Csub1 AS(ty, 0) B1S(0, tx)
  • // more calcs
  • b bStep
  • __syncthreads()
  • // Write result back

19
Example Jacobi (kernel), no internal loops
  • // Block index
  • int bx blockIdx.x int by blockIdx.y
  • // Thread index
  • int tx threadIdx.x1 int ty threadIdx.y1
  • int ustart ((by BLOCK_SIZE_GPU) n_dim )
    (bx BLOCK_SIZE_GPU)
  • float res tex1Dfetch(texRef_MaF, ustart (ty
    n_dim) tx) qh
  • res tex1Dfetch(texRef_MaU, ustart (ty
    n_dim) tx - 1) tex1Dfetch(texRef_MaU, ustart
    (ty n_dim) tx 1)
  • res tex1Dfetch(texRef_MaU, ustart ((ty1)
    n_dim) tx) tex1Dfetch(texRef_MaU, ustart
    ((ty-1) n_dim) tx)
  • res 0.25f res
  • ma_uustart (ty n_dim) tx res

20
Example Jacobi (kernel), internal loops
  • int tx threadIdx.x1 int ty threadIdx.y1
  • // some more inits
  • // load to calc u_ij
  • __shared__ float UsBLOCK_SIZE_GPU2BLOCK_SIZE_G
    PU2
  • US(ty, tx) tex1Dfetch(texRef_MaU, ustart (ty
    n_dim) tx)
  • // init edge u
  • for (unsigned int i 0 i lt n_intern_loops i)
  • res funk
  • res US(ty, tx - 1) US(ty, tx 1)
  • res US(ty - 1, tx) US(ty 1, tx)
  • res 0.25f res
  • __syncthreads() // not used in parallel jacobi
  • US(ty, tx) res

21
Performance Results (1)
22
Performance Results (2)
23
Performance Results (3)
24
Performance Results (4)
25
Conclusion (Points to take care of)
  • Be care of / you should use
  • min. number of memory accesses
  • use unrolling instead of for loops
  • use blocking algorithms
  • only algorithms, which are not extremly memory
    bounded (NOT matrix addition) should be
    implemented with CUDA
  • try to do not use the if statement, or other
    programmecontrolling statements (slow)

26
Appendix - References
  • NV1 NVIDIA CUDA Compute Unified Device
    Architecture, Programming Guide nVidia
    Corporation, Version 1.0, 23.06.2007
  • IN1/2/3 Intel Architecture Handbook, Version
    November 2006
  • NR Numerical receipies (online generated
    pdf)

http//home.in.tum.de/heinecke/fa2007
Write a Comment
User Comments (0)
About PowerShow.com