Data Analysis and Visualization - PowerPoint PPT Presentation

About This Presentation
Title:

Data Analysis and Visualization

Description:

Data Analysis and Visualization Numerical Simulations Using Programmable GPUs Stan Tomov September 5, 2003 Ising model Percolation model Implementation Performance ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 17
Provided by: webEecsU98
Learn more at: https://icl.utk.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Analysis and Visualization


1
Data Analysis and Visualization
Numerical Simulations Using Programmable GPUs
Stan Tomov
September 5, 2003
2
Outline
  • Motivation
  • Literature review
  • The graphics pipeline
  • Programmable GPUs
  • Block diagram of nVidia's GeForce FX
  • Some probability based simulations - Monte
    Carlo simulations
  • - Ising model
  • - Percolation model
  • Implementation
  • Performance results and analysis
  • Extensions and future work
  • Conclusions

3
Motivation
The GPUs have
  • High flops count (nVidia has listed 200Gflops
    theoretical speed for NV30)
  • Compatible price performance (0.1 cents per M
    flop)
  • High rate of performance increase over time
    (doubling every 6 months)

Table 1. GPU vs CPU in rendering polygons.
The GPU (Quadro2 Pro) is approximately 30 times
faster than the CPU (Pentium III, 1 GHz) in
rendering polygonal data of various sizes.

Explore the possibility of extending GPUs' use to
non-graphics applications
4
Literature review
Using graphics hardware for non-graphics
applications
  • Cellular automata
  • Reaction-diffusion simulation (Mark Harris,
    University of North Carolina)
  • Matrix multiply (E. Larsen and D. McAllister,
    University of North Carolina)
  • Lattice Boltzmann computation (Wei Li, Xiaoming
    Wei, and Arie Kaufman, Stony Brook)
  • CG and multigrid (J. Bolz et al, Caltech, and N.
    Goodnight et al, University of Virginia)
  • Convolution (University of Stuttgart)

Performance results
  • Significant speedup of GPU vs CPU are reported
    if the GPU performs
  • low precision computations (30 to 60 times
    depends on the configuration)
  • The fact that the operations are low precision
    is often skipped which may be confusing
  • - NCSA, University of Illinois assembled a
    50,000 supercomputer out of 70 PlayStation 2
  • consoles, which could theoretically
    deliver 0.5 trillion operations/second
  • - also, currently 200 GPUs are capable of
    1.2 trillion op/s
  • GPUs flops performance is comparable to the
    CPUs

5
The graphics pipeline
6
Programmable GPUs
(in particular NV30)
  • Support floating point operations
  • Vertex program
  • - Replaces fixed-function pipeline for
    vertices
  • - Manipulates single vertex data
  • - Executes for every vertex
  • Fragment program
  • - Similar to vertex program but for pixels
  • Programming in Cg
  • - High level language
  • - Looks like C
  • Portable
  • Compiles Cg programs to assembly code

7
Block diagram of GeForce FX
  • AGP 8x graphics bus bandwidth 2.1GB/s
  • Local memory bandwidth 16 GB/s
  • Chip officially clocked at 500 MHz
  • Vertex processor - execute vertex shaders
    or emulate fixed transfor- mations and
    lighting (TL)
  • Pixel processor - execute pixel shaders or
    emulate fixed shaders - 2 int 1 float ops or
    2 texture accesses/clock circle
  • Texture color interpolators - interpolate
    texture coordinates and color values
  • Performance (on processing 4D vectors)
  • Vertex ops/sec - 1.5 Gops
  • Pixel ops/sec - 8 Gops (int), or 4 Gops
    (float)

Hardware at Digit-Life.com, NVIDIA GeForce FX, or
"Cinema show started", November 18, 2002.
8
Monte Carlo simulations
  • Used in variety of simulations in physics,
    finance, chemistry, etc.
  • Based on probability statistics and use random
    numbers
  • A classical example compute area of a circle
  • Computation of expected values
  • N can be very large on a 1024 x 1024
    lattice of particles, every
  • particle modeled to have k states, N
  • Random number generation. We used linear
    congruential type
  • generator

(1)
9
Ising model
  • Simplified model for magnets (introduced by
    Wilhelm Lenz in 1920,
  • further studied by his student Ernst Ising)
  • Modeled on 2D lattice with a spin
    (corresponding to orientation of electrons)
  • at every cell pointing up or down
  • Uses temperature to couple 2 opposing physical
    principles
  • - minimization of the system's energy
  • - entropy maximization
  • Want to compute
  • - expected magnetization
  • - expected energy
  • Evolve the system into higher probability
    states and compute
  • expected values as average over those states
  • - evolving from state to state, based on
    certain probability decision, is related to so
    called Markov chains
  • W.Gilks, S.Richardson, and D.Spiegelhalter
    (Editors), Markov chain Monte Carlo in Practice,
    ChapmanHall, 1996.

10
Ising model computational procedure
  • Choose an absolute temperature of interest T (in
    Kelvin)
  • Color lattice in a checkerboard manner
  • Start consecutive black and white sweeps
  • Change the spin at a site based on the procedure
  • 1. Denote current state as S, the state with
    flipped spin as S'
  • 2. Compute
  • 3. If accept S'
  • else generate and
    accept S' if,
  • where P(S) is given by the Boltzmann
    probability distribution function

11
Percolation model
  • First studied by Broadbent and Hemmercley in
    1957
  • Used in studies of disordered medium (usually
    specified by a probability distribution)
  • Applied in studies of various phenomena such as
    spread of diseases, flow in porous media,
    forest fire propagation, clustering, etc.
  • Of particular interest are
  • - media modeling threshold after which there
    exists a spanning cluster
  • - relations between different media models
  • - time to reach steady state spanning cluster

12
Implementation
Approaches
  • Pure OpenGL (simulations using the
    fixed-function pipeline)
  • Shaders in assembly
  • Shaders in Cg

Dynamic texturing
  • Create a texture T (think of a 2D lattice)
  • Loop
  • - Render an image using T (in an off-screen
    buffer)
  • - Update T from the resulting image

13
Performance results and analysis
  • Time in s. (approximate) for different vector
    flops on the GPU

256x256 512x512
traffic 0.00063 0.0024
, -, , / 0.00010 0.0003
cos, sin 0.00026 0.0010
log, exp 0.00045 0.0015
if, ? 0.00016 0.0008
  • 48 B per node speed limited by
  • GPUs memory speed (16 GB/s)

? 3.5 Gflops
? 20 x faster then CPU but the operations are of
low accuracy
  • Time in s. (approximate) including traffic for
    different vector flops on the CPU

256x256 512x512 1024x1024
, -, , / 0.0011 0.0046 0.017
cos, sin 0.0540 0.0650 0.267
log, exp 0.0609 0.1100 0.426
?32 B per node speed limited by CPUs
memory speed (4.2 GB/s)
14
Performance results and analysis
  • GPU and CPU (2.8 GHz) performance on the Ising
    model

Lattice size (not necessary power of 2) Lattice size (not necessary power of 2) Lattice size (not necessary power of 2) Lattice size (not necessary power of 2) Lattice size (not necessary power of 2)
64x64 128x128 256x256 512x512 1024x1024
GPU sec/frame 0.0006 0.0023 0.0081 0.033 0.14
CPU no opt. 0.0009 0.0024 0.0083 0.032 0.13
CPU with O4 0.0008 0.0020 0.0069 0.026 0.10
GPU instr./sec 0.55 G 0.57 G 0.66 G 0.63 G 0.61 G
  • ? 2.64 Gflops, i.e. 15 GPU theoretical power
    utilization (too many ifs) - if (flag)
    exec. time time to compute the block even if
    flag 0
  • Performance compatible with visualization
    related sample shaders from nVidia
  • Cg assembly
  • - Performance is the same for using runtime Cg
    or the generated assembly code
  • - The assembly code generated is not optimal
    we found cases where the code could be
    optimized and performance increased

15
Extensions and future work
  • Code optimization (through optimization of Cg
    generated assembly)
  • More applications
  • - QCD ?
  • - Fluid flow ?
  • Parallel algorithms (or just as a coprocessor)
  • - domain decomposition type in cluster
    environment
  • - Motivation communication rates CPU
    GPU for lattices of different sizes in seconds

64x64 128x128 256x256 512x512 ? speed
Read bdr (glReadPixels) 0.00016 0.0002 0.0006 0.0024 14 MB/s
Read all (glReadPixels) 0.00040 0.0015 0.0062 0.0250 167 MB/s
Write bdr (glDrawPixels) 0.00022 0.0003 0.0007 0.0024 14 MB/s
Write all (glTexSubImage2D) 0.00020 0.0008 0.0032 0.0120 350 MB/s
Write bdr (glTexSubImage2D) 0.00050 0.0020 0.0071 0.0250 1.3 MB/s
Not a bottleneck in cluster with 1Gbit network
  • Other ideas?

16
Conclusions
  • GPUs have higher rate of performance increase
    over time than CPUs
  • - always appealing as research for the
    future
  • In certain applications GPUs are 30 to 60 times
    faster than CPUs
  • for low precision computations (depending on
    configuration)
  • For certain floating point applications GPUs
    and CPUs performance is comparable
  • - can be used as coprocessor
  • GPUs are often constrained in memory, but
  • Preliminary results show it is feasible to use
    GPUs in parallel
  • Cg is a convenient tool (but cgc could be
    optimized)
  • It is feasible to use GPUs for numerical
    simulations
  • - we demonstrated it by implementing 2 models
    (with many applications), and
  • - used the implementation in benchmarking
    NV30 and Cg
Write a Comment
User Comments (0)
About PowerShow.com