LAPACK on the NVIDIA G80 Processor - PowerPoint PPT Presentation

About This Presentation
Title:

LAPACK on the NVIDIA G80 Processor

Description:

The NVIDIA G80 Processor. CUDA (Compute ... C Interface for Performing Operations on the NVIDIA Processor ... NVIDIA's CUDA Based Implementation of BLAS ... – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 25
Provided by: rober79
Category:

less

Transcript and Presenter's Notes

Title: LAPACK on the NVIDIA G80 Processor


1
LAPACK on the NVIDIA G80 Processor
  • Robert Liao
  • Tracy Wang
  • CS252 Spring 2007

2
Overview
  • Traditional GPU Architecture
  • The NVIDIA G80 Processor
  • CUDA (Compute Unified Device Architecture)
  • LAPACK
  • Performance and Issues

3
A Quick Note on Naming
  • G80 is the codename for the GPU found in the
    following graphics cards.
  • NVIDIA GeForce 8 Series Graphics Cards
  • NVIDIA Quadro FX 4600
  • NVIDIA Quadro FX 5600

4
Traditional GPUs
From Intel Corporation
5
Traditional GPUs
  • GPUs talk Polygons

Vertex Processor
Pixel Fragmenting Creation
Merge Output
Process Fragments
From CPU
Display
6
Traditional GPUs
  • OpenGL and DirectX abstract this away.

Vertex Processor
Pixel Fragmenting Creation
Merge Output
Process Fragments
From CPU
Display
7
The NVIDIA G80 Architecture
  • Reconfigurable Processor Pipeline

From NVIDIA
8
G80 History and Specifications
  • Project Started in Summer of 2002.
  • 128 Compute Cores
  • 1.35 GHz in the GeForce 8800
  • Floating Point Ops
  • Stream Processor Architecture
  • One Computing Unit Streams into another
    Computing Unit

9
The CUDA Interface to the G80
  • Compute Unified Device Architecture
  • C Interface for Performing Operations on the
    NVIDIA Processor
  • Contains traditional C memory semantics with the
    context of a GPU

10
Working with CUDA
  • Custom compiler provided to compile C code that
    the GPU can understand.
  • The API functions provide a whole host of ways to
    interface with the GPU.
  • CUDA Libraries are provided for common tasks.
  • CUDA Runtime helps management of memory
  • No DirectX or OpenGL knowledge needed!

11
Working with CUDA
  • Running C on the CPU
  • Running C on the GPU
  • malloc
  • free
  • CPU Code
  • cudaMalloc
  • cudaFree
  • GPU Code

Pointers on one side stay on one side. This will
create issues for existing applications
12
LAPACK
  • Linear Algebra PACKage
  • Implemented in Fortran 77
  • Interfaces with BLAS (Basic Linear Algebra
    Subprograms)
  • Professor James Demmel involved in Project

13
CLAPACK
  • An F2Ced version of LAPACK.
  • Very ugly!
  • s_rsle(io___8)
  • do_lio(c__3, c__1, (char )nm,
    (ftnlen)sizeof(integer))
  • e_rsle()
  • if (nm lt 1)
  • s_wsfe(io___10)
  • do_fio(c__1, " NM ", (ftnlen)4)
  • do_fio(c__1, (char )nm, (ftnlen)sizeof(integer
    ))
  • do_fio(c__1, (char )c__1, (ftnlen)sizeof(integ
    er))
  • e_wsfe()
  • nm 0
  • fatal TRUE_
  • else if (nm gt 12)
  • s_wsfe(io___11)
  • do_fio(c__1, " NM ", (ftnlen)4)
  • do_fio(c__1, (char )nm, (ftnlen)sizeof(integer
    ))
  • do_fio(c__1, (char )c__12, (ftnlen)sizeof(inte
    ger))
  • e_wsfe()

14
CUBLAS
  • NVIDIAs CUDA Based Implementation of BLAS
  • Many functions are similar, but argument
    signatures are slightly different
  • Adds some other functions as well
  • cublasAlloc
  • cublasFree
  • CUBLAS lives in the GPU world

15
CLAPACK and CUBLAS
  • Putting them together is not as easy as just
    linking CLAPACK to CUBLAS.
  • Matrices and data structures must be moved into
    GPU memory space.
  • CLAPACK executes on the CPU.
  • CUBLAS executes on the GPU.

CLAPACK Function
CUBLAS
Memory copy CPU-gtGPU
Memory copy GPU-gtCPU
16
CLAPACK Concentration
  • General Solve
  • sgesv
  • Computes solution to linear system of equationsA
    X B
  • To Solve, A is factored into three matrices, P,
    L, and U.
  • P Permutation Matrix
  • L Lower Triangular
  • U Upper Triangular
  • Currently, our results cover the triangular
    factoring step

17
Performance Results
18
Performance Results
19
Performance Issues
  • Much copying must be done from the CPU to GPU and
    GPU to CPU to communicate results.
  • Why not convert all pointers into GPU pointers?
  • Requires CLAPACK to run in GPU memory.
  • Could be someones research paper

20
Other Issues
  • Floating Point Behaves Differently
  • Section 5.2 of the CUDA Programming Guide
    Discusses Deviations from IEEE-754
  • No support for denormalized numbers
  • Underflowed numbers are flushed to zero
  • We noticed some results appearing as 0.0001
    instead of 0, for example

21
Current State
  • Investigating some interesting memory issues on
    the GPU side.
  • Allocations Mysteriously Fail.

22
Conclusions To Date
  • Small data sets are better left off on the CPU.
  • GPU calculations may not be appropriate for
    scientific computing depending on needs.

23
Future Directions
  • Moving all of LAPACK into GPU
  • Resolving the copying issue
  • Perhaps resolved by unifying the CPU and GPU?
  • Want to give it a try?
  • Cant find Quadro FX 5600 on Market (MSRP 2,999)
  • GeForce 8 Series have the G80 Processor
  • GeForce 8500GT (99.99)
  • GeForce 8800GTX (939.99)

24
Questions
Write a Comment
User Comments (0)
About PowerShow.com