ME964 High Performance Computing for Engineering Applications - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

ME964 High Performance Computing for Engineering Applications

Description:

I will be in Milwaukee on Th, Mik will be the lecturer - tracing the evolution ... On grid computing (the Condor project) On the N-body problem. Midterm exam on 11/20 ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 49

Provided by: sbel3

Learn more at: https://sbel.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: ME964 High Performance Computing for Engineering Applications

1
ME964High Performance Computing for Engineering
Applications

Parallel Programming Patterns
Discussion of Midterm Project
Oct. 21, 2008

2
Before we get started

Last Time
Parallel Programming patterns
The Finding Concurrency Design Space
The Algorithm Structure Design Space
Today
Parallel Programming patterns
Discuss the Supporting Structures Design Space
Discussion of the Midterm Project
Assigned today, due on 11/18 or 11/09 (based on
level of difficulty)
Other issues
I will be in Milwaukee on Th, Mik will be the
lecturer - tracing the evolution of the hardware
support for the graphics pipeline

2
3
Key Parallel Programming Steps

Find the concurrency in the problem
Structure the algorithm so that concurrency can
be exploited
Implement the algorithm in a suitable programming
environment
Execute and tune the performance of the code on a
parallel system

3
4
Whats Comes Next?
Focus on thisfor a while
4
5
Before We Dive In

You hover in this design space when you ponder
whether an MPI or an SPMD or a event driven
simulation environment is what it takes to
optimally design your algorithm
We are not going to focus on this process too
much since its clear that we are stuck with SPMD
All that well be doing is to compare SPMD with
other parallel programming paradigms to
understand when SPMD is good and when it is bad
Also, we will not cover the Implementation
Mechanisms design space since we dont need to
wrestle with how to spawn and join threads, how
to do synchronization, etc.
This would be Chapter 6 of the book

5
6
Supporting Structures Program and Data Models
Program Models
Data Models
SPMD
Shared Data
Master/Worker
Shared Queue
Loop Parallelism
Distributed Array
Fork/Join
These are not necessarily mutually exclusive.
6
7
Program Structuring Patterns

SPMD (Single Program, Multiple Data)
All PEs (Processor Elements) execute the same
program in parallel
Each PE has its own data
Each PE uses a unique ID to access its portion of
data
Different PE can follow different paths through
the same code
This is essentially the CUDA Grid model
Master/Worker
Loop Parallelism
Fork/Join

7
8
Program Structuring Patterns(the rest of the
pack)

Master/Worker
A Master thread sets up a pool of worker threads
and a bag of tasks
Workers execute concurrently, removing tasks
until done
Loop Parallelism
Loop iterations execute in parallel
FORTRAN do-all (truly parallel), do-across (with
dependence)
Very common in OpenMP
Fork/Join
Most general way of creation of threads (the
POSIX standard)
Can be regarded as a very low level approach in
which you use the OS to manage parallelism

8
9
Review Algorithm Structure
Algorithm Design
Organize by Task
Organize by Data
Organize by Data Flow
Linear
Recursive
Linear
Recursive
Regular
Irregular
Task Parallelism
Divide and Conquer
Geometric Decomposition
Recursive Data
Pipeline
Event Driven
9
10
Algorithm Structures vs. Supporting Structures
Task Parallel. Divide/Conquer Geometric Decomp. Recursive Data Pipeline Event-based
SPMD ???? ??? ???? ?? ??? ??
Loop Parallel ???? ?? ???
Master/ Worker ???? ?? ? ? ? ?
Fork/ Join ?? ???? ?? ???? ????
Four smilies is the best (Source Mattson, et al.)
10
11
Supporting Structures vs. Program Models
OpenMP MPI CUDA
SPMD ??? ???? ????
Loop Parallel ???? ?
Master/ Slave ?? ???
Fork/Join ???
11
12
More on SPMD

Dominant coding style of scalable parallel
computing
MPI code is mostly developed in SPMD style
Many OpenMP code is also in SPMD (next to loop
parallelism)
Particularly suitable for algorithms based on
data parallelism, geometric decomposition, divide
and conquer.
Main advantage
Tasks and their interactions visible in one piece
of source code, no need to correlated multiple
sources

SPMD is by far the most commonly used pattern for
structuring parallel programs.
12
13
Typical SPMD Program Phases

Initialize
Establish localized data structure and
communication channels
Obtain a unique identifier
Each thread acquires a unique identifier,
typically in the range from 0 to N-1, where N is
the number of threads.
Both OpenMP and CUDA have built-in support for
this.
Distribute Data
Decompose global data into chunks and localize
them, or
Sharing/replicating major data structure using
thread ID to associate subset of the data to
threads
Run the core computation
More details in next slide
Finalize
Reconcile global data structure, prepare for the
next major iteration

13
14
Core Computation Phase

Thread IDs are used to differentiate behavior of
threads
Use thread ID in loop index calculations to split
loop iterations among threads
Use conditions based on thread ID to branch to
their specific actions

Both can have very different performance results
and code complexity depending on the way they are
done.
14
15
A Simple Example

Assume
The computation being parallelized has 1,000,000
iterations.
Sequential code

Num_steps 1000000 for (i0 ilt num_steps,
i)
15
16
SPMD Code Version 1

Assign a chunk of iterations to each thread
The last thread also finishes up the remainder
iterations

num_steps 1000000 i_start my_id
(num_steps/num_threads) i_end i_start
(num_steps/num_threads) if (my_id
(num_threads-1)) i_end num_steps for (i
i_start i lt i_end i) . Reconciliation of
results across threads if necessary.
16
17
Problems with Version 1

The last thread executes more iterations than
others
The number of extra iterations is up to the total
number of threads 1
This is not a big problem when the number of
threads is small
When there are thousands of threads, this can
create serious load imbalance problems.
Also, the extra if statement is a typical source
of branch divergence in CUDA programs.

17
18
SPMD Code Version 2

Assign one more iterations to some of the threads

int rem num_steps num_threads i_start
my_id (num_steps/num_threads) i_end i_start
(num_steps/num_threads) if (rem ! 0) if
(my_id lt rem) i_start my_id i_end
(my_id 1) else i_start rem i_end
rem .
Less load imbalance More branch divergence.
18
19
SPMD Code Version 3

Use cyclic distribution of iteration

num_steps 1000000 for (i my_id i lt
num_steps i num_threads) .
Less load imbalance Loop branch divergence in
the last Warp Data padding further eliminates
divergence.
19
20
Comparing the Three Versions
Version 1
ID0
ID1
ID3
ID2
Version 2
ID0
ID1
ID3
ID2
Version 3
ID0
ID1
ID3
ID2
Padded version1 1 may be best for some data
access patterns.
20
21
Data Styles

Shared Data
All threads share a major data structure
This is what CUDA supports
Shared Queue
All threads see a thread safe queue that
maintains ordering of data communication
Very relevant in conjunction with the
Master/Worker scenarios
Distributed Array
Decomposed and distributed among threads
Limited support in CUDA Shared Memory
Typically an issue in NUMA layouts

21
22
End Parallel Programming Patterns Begin
Overview Reminder of Semester
22
23
The Reminder of the Semester

There will be one more lecture on CUDA (in
November)
There will be three lectures on MPI
There will be about four to five guest lectures
On massively parallel hardware design
On grid computing (the Condor project)
On the N-body problem
Midterm exam on 11/20
Midterm Project (assigned today, due on 11/17 or
sooner)
Ill explain why sooner shortly
Final Project (due on last Friday, Dec. 19, at
1159 PM)

23
24
Midterm Project

The topic Produce a Collision Detection engine
Deal with spheres or ellipsoids
There will be two levels of difficulty
The entry level
You implement a simple brute force approach
Deal only with spheres of various radii
Due Date November 9, 1159 PM
The advanced level
Apply a sophisticated design to produce a
parallel algorithm that is very efficient
Deal with ellipsoids
Note that spheres are a particular case of
ellipsoids
Due Date November 17, 1159 PM
Most likely make this engine a part of your Final
Project
Independent of the level embraced, youll be
presenting your work on November 18, during
regular class hours
Youll have 10 minutes to make a PowerPoint
presentation. The presentations are part of the
project (to be submitted for grading)

24
25
Final Project

Work on something related to your research
I can assist you with selecting a topic
I will help you with the design
There is a default Final Project, in case you
cant choose

26
The Final Project Further Comments

There is a connection between the Midterm and
Final Projects
It is up to you to decide what you want to work
on for the Final Project
However, if you dont know what to work on, Ill
assign you a Granular Dynamics problem (this is
the default Final Project)
An important part of this problem is the
collision detection stage

Granular Dynamics Problem
26
27
The Final Project Concluding Remarks

To conclude, the default Final Project will look
like this

Final Project Midterm Project Some Extra
Headache
Granular Material Dynamics

Keep in mind
If you plan to do the default final project,
think ahead and implement a collision detection
engine that is suitable to a scenario where the
bodies slightly change their location each time
you have to do the contact detection
This is because in granular flow the particles
move in space
Maybe you can run a cheaper collision detection
after an initial exhaustive one
Youll need to look at problems with millions of
colliding bodies

27
28
End Overview Reminder of Semester Begin
Midterm Project
28
29
The Collision Problem(The Midterm Project)

Problem Statement
A large set of spheres or ellipsoids is given to
you
Provide collision state information for each pair
of bodies that are colliding
The bodies that you are dealing with are of
spherical shape
Spheres are of arbitrary radius R, where Rlow lt R
lt Rhigh
The bound values Rlow and Rhigh are given to you
Their location is given to you
NOTE The ellipsoid case is similar (skipped here
the discussion)
You will have to validate your results against
Bullet, which is an open source game
development tool.

29
30
Proposed Workflow
The proposed workflow below will be captured in a
Visual Studio Solution (with three Projects)
that will be provided to you (zipped file). You
are responsible for the part in red font below.

Generate an initial distribution of the spheres
(part of Project 1)
Load data (part of Project 2)
Run collision detection on the device (part of
Project 2)
Save results in a file (part of Project 2)
Run collision detection using the Bullet library
(part of Project 3)
The utility will tell you if you passed or
failed (part of Project 3)

30
31
Project template

Solution contains three projects
First project is used to generate the input data
(distribution of the spheres along with their
radius)
Second project loads input data and checks for
collisions
This is the project you modify
The main driver calls a function that invokes the
collision detection kernel
Pretty much like youve had to do in your HW
Third project does the Bullet part followed up by
the validation
Uses C function notation
Main file uses C classes and objects
Allows C and C code to work together

31
32
Spheres Possible Contact Cases

Convention
Color code for object A bluish
Color code for object B greenish
Sphere contacts sphere
One contact point
Normal - from center of B to center of A
Sphere interpenetrates sphere
Two contact points
Points on objects furthest in
Normal - from center of B to center of A

A
B
A
B
32
33
Possible Contact Cases (Contd.)

Sphere exactly on top of sphere
Two points returned
Points on surface that intersect the x-axis
Normal is on x axis
Sphere inside sphere
Two points returned
Points on surface that intersect the x-axis
Normal is on x axis
Spheres are not in contact
Nothing to report, ignore

B
A
A
B
B
A
33
34
Bullet Physics Library

Bullet is a cross platform open source physics
library.
Can simulate many types of objects
Rigid bodies (concave, convex, triangular mesh,
etc.)
Cloth, soft bodies
Supports joints and constraints
Geared towards game development.
Speed vs. accuracy comes on the side of speed
Bullets collision detection will be used to
verify your results

34
35
Screenshot, Bullet
35
36
Relevant Variables

All variables retuned per contact by Bullet
btVector3 localPointA
btVector3 localPointB
btVector3 positionWorldOnB
btVector3 positionWorldOnA
btVector3 normalWorldOnB
btScalar distance1
btVector3 lateralFrictionDir1
btVector3 lateralFrictionDir2
int partId0
int partId1
int index0
int index1
btScalar combinedFriction
btScalar combinedRestitution
void userPersistentData
btScalar appliedImpulse
bool lateralFrictionInitialized
int lifeTime

Variables you need to return and calculate
btVector3 positionWorldOnA
btVector3 positionWorldOnB
btVector3 normalWorldOnB
int objectIdA (note this is the Id associated
with A)
int objectIdB (note this is the Id associated
with B)
Additional parameters for Final Project
localPointA
localPointB
btVector3 lateralFrictionDir1
btVector3 lateralFrictionDir2
distance1

36
37
Variable Description

positionWorldOnB (float3) Absolute Collision
location on object A
positionWorldOnA (float3) Absolute Collision
location on object B
normalWorldOnB (float3) Collision Normal Vector
from B to A
objectIdA (int) Object A ID
objectIdB (int) Object B ID (Id starts from 0)
Other parameters that need to be implemented for
Final Project (if you take this route)
localPointA (float3) Collision point w/ respect
to center of object A
localPointB (float3) Collision point w/ respect
to center of object B
distance1 (float) Distance between collision
points
lateralFrictionDir1 (float3) Vector orthogonal
to normal and Dir2
lateralFrictionDir2 (float3) Vector orthogonal
to normal and Dir1

37
38
The Passed/Failed Issue

The comparison is going to be made based on
The contact location expressed in global
reference frame
There will be an epsilon value that controls this
test
The contact normal
Your normal and Bullets normal should be within
10
For comparison purposes, your contacts should be
ordered such that
I still owe you two things
How are body A and body B chosen?
What to do with this collision case

38
39
Common Sense Advice

Donald Knuth (1974) Premature optimization is
the root of all evil
Get it to run first, add the bells and whistles
later on
Embrace a Crawl Walk Run approach

39
40
Collision Detection Some Possible Approaches

Brute force simplest and slowest
Ok to implement, unless you want to make this
your final project, in which case youll have to
implement one of the other three methods below
It relies on an exhaustive search N(N-1)/2 tests
(given N bodies)
Baraff sorting based (the method is sometimes
called culling
Harada rely on voxel and texture, not
recommended
Scott Le Grand spatial subdivision method

40
41
References

The methods of Scott Le Grand and Harada are
explained in great detail in GPU Gems 3 (you can
borrow it from me)
I also have the following references (found them
on the web, contact me if you need them)

Naga, K. G., R. Stephane, C. L. Ming, and M.
Dinesh, 2003 CULLIDE interactive collision
detection between complex models in large
environments using graphics hardware. Proceedings
of the ACM SIGGRAPH/EUROGRAPHICS conference on
Graphics hardware, Eurographics Association.
Baraff, D., 1997 An Introduction to Physically
Based Modeling Rigid Body Simulation
IINonpenetration Constraints. SIGGRAPH Course
Notes.
41
42
Brute Force

Iterate over all combinations of object pairs and
perform distance checks on them
Store pairs that are colliding into list and
compute required information
Iterate over each pair and calculate additional
contact data.

42
43
Baraff Sorting axis
p6
p5
p3
p4
p1
p2
b3
b6
e6
e1
b5
b2
e3
b4
e5
e4
e2
b1

Every time an object p begins on the axis add p
to stack
Every time an object p ends on the axis remove p
from stack
Each time an object is added to stack compare it
to those still on stack
p1 check with p3 and p6
p5 check with p3
An object is colliding with another if they
overlap on all three axes.
Multi GPU relevant yes, it can speed up the
problem by simultaneously culling based on more
axes

43
44
Example Why this is tricky at times
(The issue of false positives)
44
45
Harada , GPU Gems 3

Mostly meant to deal with spheres, and of the
same radii
Voxel collision method
Voxel 3d pixel
Spheres are used to represent other objects
Fill objects space using spheres
Use many 2D textures to store data.
Textures make up 3d grid
Optimize by using texture memory
Easier to coalesce reads and writes

45
46
Harada (Contd.)

Each voxel in texture contains up to 4 object
ids
Stored in RGBA values
Objects and cells need to be sized so that this
is true
Constant radius is preferred but some variation
can exist
Too much variation could cause instances with gt4
objects in one cells space
Test each cell with its 26 neighbors
Iterate over all cells in this way

46
47
Scott Le Grand, GPU Gems 3

Spatial subdivision method
Divide space into three dimensional grid
Collision state of spheres cannot be modified by
different threads
Cells numbered by processing order (1, 2,)
First all 1 cells are processed, then 2 cells
and so on
Insures that two threads do not overwrite
collision data for one sphere
Spheres get assigned a home cell id and
overlapping phantom cell ids
Array gets sorted by cell number and type

47
48
Scott Le Grand (Contd.)

From this list determine which cells have
multiple objects inside
These cells will be checked for collisions
Task is simplified because cell id list is
already sorted by cell number
By checking for a change in cell id the number of
objects in that cell can be determined.
If cell only has phantom objects it will not be
checked
Phantom objects - objects whose centroid is not
in that cell but still intersects it
Cell id must have at least one home id and gt1
objects to be checked
Multi GPU relevant yes, it can increase the size
of the problem being addressed