LAPACK on the NVIDIA G80 Processor

About This Presentation

Title:

LAPACK on the NVIDIA G80 Processor

Description:

The NVIDIA G80 Processor. CUDA (Compute ... C Interface for Performing Operations on the NVIDIA Processor ... NVIDIA's CUDA Based Implementation of BLAS ... – PowerPoint PPT presentation

Number of Views:153

Avg rating:3.0/5.0

Slides: 25

Provided by: rober79

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: LAPACK on the NVIDIA G80 Processor

1
LAPACK on the NVIDIA G80 Processor

Robert Liao
Tracy Wang
CS252 Spring 2007

2
Overview

Traditional GPU Architecture
The NVIDIA G80 Processor
CUDA (Compute Unified Device Architecture)
LAPACK
Performance and Issues

3
A Quick Note on Naming

G80 is the codename for the GPU found in the
following graphics cards.
NVIDIA GeForce 8 Series Graphics Cards
NVIDIA Quadro FX 4600
NVIDIA Quadro FX 5600

4
Traditional GPUs
From Intel Corporation
5
Traditional GPUs

GPUs talk Polygons

Vertex Processor
Pixel Fragmenting Creation
Merge Output
Process Fragments
From CPU
Display
6
Traditional GPUs

OpenGL and DirectX abstract this away.

Vertex Processor
Pixel Fragmenting Creation
Merge Output
Process Fragments
From CPU
Display
7
The NVIDIA G80 Architecture

Reconfigurable Processor Pipeline

From NVIDIA
8
G80 History and Specifications

Project Started in Summer of 2002.
128 Compute Cores
1.35 GHz in the GeForce 8800
Floating Point Ops
Stream Processor Architecture
One Computing Unit Streams into another
Computing Unit

9
The CUDA Interface to the G80

Compute Unified Device Architecture
C Interface for Performing Operations on the
NVIDIA Processor
Contains traditional C memory semantics with the
context of a GPU

10
Working with CUDA

Custom compiler provided to compile C code that
the GPU can understand.
The API functions provide a whole host of ways to
interface with the GPU.
CUDA Libraries are provided for common tasks.
CUDA Runtime helps management of memory
No DirectX or OpenGL knowledge needed!

11
Working with CUDA

Running C on the CPU

Running C on the GPU

malloc
free
CPU Code

cudaMalloc
cudaFree
GPU Code

Pointers on one side stay on one side. This will
create issues for existing applications
12
LAPACK

Linear Algebra PACKage
Implemented in Fortran 77
Interfaces with BLAS (Basic Linear Algebra
Subprograms)
Professor James Demmel involved in Project

13
CLAPACK

An F2Ced version of LAPACK.
Very ugly!
s_rsle(io___8)
do_lio(c__3, c__1, (char )nm,
(ftnlen)sizeof(integer))
e_rsle()
if (nm lt 1)
s_wsfe(io___10)
do_fio(c__1, " NM ", (ftnlen)4)
do_fio(c__1, (char )nm, (ftnlen)sizeof(integer
))
do_fio(c__1, (char )c__1, (ftnlen)sizeof(integ
er))
e_wsfe()
nm 0
fatal TRUE_
else if (nm gt 12)
s_wsfe(io___11)
do_fio(c__1, " NM ", (ftnlen)4)
do_fio(c__1, (char )nm, (ftnlen)sizeof(integer
))
do_fio(c__1, (char )c__12, (ftnlen)sizeof(inte
ger))
e_wsfe()

14
CUBLAS

NVIDIAs CUDA Based Implementation of BLAS
Many functions are similar, but argument
signatures are slightly different
Adds some other functions as well
cublasAlloc
cublasFree
CUBLAS lives in the GPU world

15
CLAPACK and CUBLAS

Putting them together is not as easy as just
linking CLAPACK to CUBLAS.
Matrices and data structures must be moved into
GPU memory space.
CLAPACK executes on the CPU.
CUBLAS executes on the GPU.

CLAPACK Function
CUBLAS
Memory copy CPU-gtGPU
Memory copy GPU-gtCPU
16
CLAPACK Concentration

General Solve
sgesv
Computes solution to linear system of equationsA
X B
To Solve, A is factored into three matrices, P,
L, and U.
P Permutation Matrix
L Lower Triangular
U Upper Triangular
Currently, our results cover the triangular
factoring step

17
Performance Results
18
Performance Results
19
Performance Issues

Much copying must be done from the CPU to GPU and
GPU to CPU to communicate results.
Why not convert all pointers into GPU pointers?
Requires CLAPACK to run in GPU memory.
Could be someones research paper

20
Other Issues

Floating Point Behaves Differently
Section 5.2 of the CUDA Programming Guide
Discusses Deviations from IEEE-754
No support for denormalized numbers
Underflowed numbers are flushed to zero
We noticed some results appearing as 0.0001
instead of 0, for example

21
Current State

Investigating some interesting memory issues on
the GPU side.
Allocations Mysteriously Fail.

22
Conclusions To Date

Small data sets are better left off on the CPU.
GPU calculations may not be appropriate for
scientific computing depending on needs.

23
Future Directions

Moving all of LAPACK into GPU
Resolving the copying issue
Perhaps resolved by unifying the CPU and GPU?
Want to give it a try?
Cant find Quadro FX 5600 on Market (MSRP 2,999)
GeForce 8 Series have the G80 Processor
GeForce 8500GT (99.99)
GeForce 8800GTX (939.99)

24
Questions

Write a Comment

User Comments (0)

About PowerShow.com

LAPACK on the NVIDIA G80 Processor - PowerPoint PPT Presentation

LAPACK on the NVIDIA G80 Processor

The NVIDIA G80 Processor. CUDA (Compute ... C Interface for Performing Operations on the NVIDIA Processor ... NVIDIA's CUDA Based Implementation of BLAS ... – PowerPoint PPT presentation