General Purpose Computing using GPU GPGPU - PowerPoint PPT Presentation

1 / 16

About This Presentation

Title:

General Purpose Computing using GPU GPGPU

Description:

Grid. Global Memory. Block (0, 0)? Shared Memory. Thread (0, 0)? Registers. Thread (1, 0) ... Matrix multiplication on the (CPU) host ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 17

Provided by: sundar

Category:

more less

Transcript and Presenter's Notes

Title: General Purpose Computing using GPU GPGPU

1
General PurposeComputing using GPU (GPGPU)?
2
GeForce 6 Series Architecture
Courtesy GPGPU - Sam Adams
3
GeForce 8800 GPU

Built around programmable units
Unified Shader

4
GPU Architecture Flow
8 series
6 series
5
CUDA C with no shader limitations!

Integrated host device app C program
Serial or modestly parallel parts in host C code
Highly parallel parts in device kernel C code

Serial Code (host)?
Parallel Kernel (device)? KernelAltltlt nBlk, nTid
gtgtgt(args)
Serial Code (host)?
Parallel Kernel (device)? KernelBltltlt nBlk, nTid
gtgtgt(args)
6
(No Transcript)
7
Block IDs and Thread IDs

Each thread uses IDs to decide what data to work
on
Block ID 1D or 2D
Thread ID 1D, 2D, or 3D
Simplifies memoryaddressing when
processingmultidimensional data

8
CUDA Memory Model Overview

Global memory
Main means of communicating R/W Data between host
and device
Contents visible to all threads
Long latency access
Constant memory
Shared Memory
Local Registers

Grid
Block (0, 0)?
Block (1, 0)?
Shared Memory
Shared Memory
Registers
Registers
Registers
Registers
Thread (0, 0)?
Thread (1, 0)?
Thread (0, 0)?
Thread (1, 0)?
Host
Global Memory
Courtesy Nvidia Corp.
9
CUDA Example Matrix Multiplication - A Simple
Host Version in C
// Matrix multiplication on the (CPU) host void
MatrixMulOnHost(float M, float N, float P, int
Width)? for (int i 0 i lt Width
i)? for (int j 0 j lt Width j)
double sum 0 for (int k
0 k lt Width k) double a
Mi width k double b Nk
width j sum a b
Pi Width j sum

N
k
j
WIDTH
M
P
i
WIDTH

k
WIDTH
WIDTH
Courtesy David Kirk
10
CUDA Example Matrix Multiplication - The GPU
Version
void MatrixMulOnDevice(float M, float N, float
P, int Width)? int size Width Width
sizeof(float) float Md, Nd, Pd //
Allocate and Load M, N to device memory
cudaMalloc(Md, size) cudaMemcpy(Md, M,
size, cudaMemcpyHostToDevice)
cudaMalloc(Nd, size) cudaMemcpy(Nd, N,
size, cudaMemcpyHostToDevice) // Allocate P
on the device cudaMalloc(Pd, size) dim3
dimGrid(1, 1) dim3 dimBlock(Width, Width)
// Launch the device computation threads!
MatrixMulKernelltltltdimGrid, dimBlockgtgtgt(Md, Nd,
Pd, Width) cudaMemcpy(P, Pd, size,
cudaMemcpyDeviceToHost) // Read P from device
cudaFree(Md) cudaFree(Nd) cudaFree (Pd)
11
CUDA Example Matrix Multiplication - The GPU
Version
void MatrixMulOnDevice(float M, float N, float
P, int Width)? int size Width Width
sizeof(float) float Md, Nd, Pd //
Allocate and Load M, N to device memory
cudaMalloc(Md, size) cudaMemcpy(Md, M,
size, cudaMemcpyHostToDevice)
cudaMalloc(Nd, size) cudaMemcpy(Nd, N,
size, cudaMemcpyHostToDevice) // Allocate P
on the device cudaMalloc(Pd, size) dim3
dimGrid(1, 1) dim3 dimBlock(Width, Width)
// Launch the device computation threads!
MatrixMulKernelltltltdimGrid, dimBlockgtgtgt(Md, Nd,
Pd, Width) cudaMemcpy(P, Pd, size,
cudaMemcpyDeviceToHost) // Read P from device
cudaFree(Md) cudaFree(Nd) cudaFree (Pd)
Allocate memory on device
12
CUDA Example Matrix Multiplication - The GPU
Version
void MatrixMulOnDevice(float M, float N, float
P, int Width)? int size Width Width
sizeof(float) float Md, Nd, Pd //
Allocate and Load M, N to device memory
cudaMalloc(Md, size) cudaMemcpy(Md, M,
size, cudaMemcpyHostToDevice)
cudaMalloc(Nd, size) cudaMemcpy(Nd, N,
size, cudaMemcpyHostToDevice) // Allocate P
on the device cudaMalloc(Pd, size) dim3
dimGrid(1, 1) dim3 dimBlock(Width, Width)
// Launch the device computation threads!
MatrixMulKernelltltltdimGrid, dimBlockgtgtgt(Md, Nd,
Pd, Width) cudaMemcpy(P, Pd, size,
cudaMemcpyDeviceToHost) // Read P from device
cudaFree(Md) cudaFree(Nd) cudaFree (Pd)
Do Matrix multiplication on device
13
CUDA Example Matrix Multiplication - The GPU
Version
void MatrixMulOnDevice(float M, float N, float
P, int Width)? int size Width Width
sizeof(float) float Md, Nd, Pd //
Allocate and Load M, N to device memory
cudaMalloc(Md, size) cudaMemcpy(Md, M,
size, cudaMemcpyHostToDevice)
cudaMalloc(Nd, size) cudaMemcpy(Nd, N,
size, cudaMemcpyHostToDevice) // Allocate P
on the device cudaMalloc(Pd, size) dim3
dimGrid(1, 1) dim3 dimBlock(Width, Width)
// Launch the device computation threads!
MatrixMulKernelltltltdimGrid, dimBlockgtgtgt(Md, Nd,
Pd, Width) cudaMemcpy(P, Pd, size,
cudaMemcpyDeviceToHost) // Read P from device
cudaFree(Md) cudaFree(Nd) cudaFree (Pd)
Free device memory
14
CUDA Example Matrix Multiplication - The GPU
Version continued
// Matrix multiplication kernel per thread
code __global__ void MatrixMulKernel(float Md,
float Nd, float Pd, int Width)? //
Pvalue is used to store the element of the
matrix // that is computed by the thread
float Pval 0 for (int k 0 k lt Width
k)? float Melement
MdthreadIdx.yWidthk float Nelement
NdkWidththreadIdx.x Pval Melement
Nelement PdthreadIdx.yWidththreadIdx
.x Pval
Nd
k
WIDTH
tx
Md
Pd
ty
ty
WIDTH
tx

k
WIDTH
WIDTH
Courtesy David Kirk
15
Problem description

Solve delay differential equations which describe
the dynamics of the genetic processes
Solve multiple equations in parallel
Optimize for memory access.

16
Bibliography

A Performance Study of General-Purpose
Applications on Graphic Processors using CUDA
Shui che, Micheal Boyer, David Tarjan, Kevin
Skadron
CUDA Textbook David Kirk from Nvidia and Prof.
Wen-mei Hwu from UIUC
GPU Gems 2 Nvidia Corporation

Write a Comment

User Comments (0)