Shared Memory Parallel Programming

About This Presentation

Title:

Shared Memory Parallel Programming

Description:

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C) C$OMP THREADPRIVATE(/ABC ... break all the good software engineering rules (i.e. stop being so portable) ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 33

Provided by: barbara179

Category:

more less

Transcript and Presenter's Notes

Title: Shared Memory Parallel Programming

1
Shared Memory Parallel Programming

OpenMP Performance

2
OpenMP Overview
COMP FLUSH
pragma omp critical
COMP THREADPRIVATE(/ABC/)
CALL OMP_SET_NUM_THREADS(10)

OpenMP An API for Writing Multithreaded
Applications
A set of compiler directives and library routines
for parallel application programmers
Greatly simplifies writing multi-threaded (MT)
programs in Fortran, C and C
Standardizes last 20 years of SMP practice

COMP parallel do shared(a, b, c)
call omp_test_lock(jlok)
call OMP_INIT_LOCK (ilok)
COMP MASTER
COMP ATOMIC
COMP SINGLE PRIVATE(X)
setenv OMP_SCHEDULE dynamic
COMP PARALLEL DO ORDERED PRIVATE (A, B, C)
COMP ORDERED
COMP PARALLEL REDUCTION ( A, B)
COMP SECTIONS
pragma omp parallel for private(A, B)
!OMP BARRIER
COMP PARALLEL COPYIN(/blk/)
COMP DO lastprivate(XX)
Nthrds OMP_GET_NUM_PROCS()
omp_set_lock(lck)
The name OpenMP is the property of the OpenMP
Architecture Review Board.
3
How to Get Started?

First thing figure what takes the time in your
sequential program gt profile it!
Typically, few parts (few loops) take the bulk of
the time.
Parallelize those parts first, worrying about
granularity and load balance.
Advantage of shared memory you can do that
incrementally.
Then worry about locality.

4
Factors that Determine Speedup

Amount of sequential code.
Characteristics of parallel code
granularity
load balancing
locality
uniprocessor
multiprocessor
synchronization and communication

5
Major Performance Impact

Amdahls law tells us that we need to avoid
serial bottlenecks in code if we are to achieve
scalable parallelism
If 1 of a program is serial, speedup is limited
to 100, no matter how many processors it is
computed on
We must profile codes very carefully to find out
how much of them is sequential

Time tseq tpar / p on p threads
6
Major Performance Impact

Equation is very coarse
Some code might be replicated
It is essentially sequential
There are some overheads that are not present in
sequential program
Parallel program may use cache better than
sequential code
Since there is overall more cache available
Occasionally leads to superlinear speedup

Time tseq tpar / p on p threads
7
Uniprocessor Memory Hierarchy
access time
size
memory
100 cycles
128Mb-...
L2 cache
20 cycles
256-512k
L1 cache
2 cycles
32-128k
CPU
8
Shared Memory
access time
shared memory
100 cycles
L2 cache
L2 cache
20 cycles
2 cycles
L1 cache
L1 cache
CPU
CPU
9
Distributed Shared Memory
access time
memory
memory
100s of cycles
L2 cache
L2 cache
20 cycles
2 cycles
L1 cache
L1 cache
CPU
CPU
10
Recall Locality

Locality (or re-use) the extent to which a
thread continues to use the same data or close
data.
Temporal locality If you have accessed a
particular word in memory, you access that word
again (before the line gets replaced).
Spatial locality If you access a word in a
cache line, you access other word(s) in that
cache line before it gets replaced.

11
Bottom Line

To get good performance,
You have to have a high hit rate.
You have to continue to access the data close
to the data that you accessed recently.
Each thread must access data well
Much more important than for sequential code,
since access costs higher
Penalty for getting it wrong can be severe

12
Scalability

Performance of program for large p
If grows roughly in proportion to p, code is
scalable
In practice, some reasonable growth may be
sufficient
Often extremely difficult to achieve

13
Granularity

Granularity size of the piece of code that is
executed by a single processor.
May be a statement, a single loop iteration, a
set of loop iterations, etc.
Fine granularity leads to
(positive) ability to use lots of processors
(positive) finer-grain load balancing
(negative) increased overhead

Appropriate size may depend on hardware
14
Load Balance

Difference in execution time between threads
between synchronization points
Sum up minimal times
Sum up maximal times
Difference is total imbalance
Unpredictable for some codes
For these, we have dynamic and guided schedules

15
Load in Molecular Dynamics

for some number of timesteps
pragma omp parallel for
for( i0 iltnum_mol i )
for( j0 jltcounti j )
forcei f(loci,locindexj)
pragma omp parallel for
for( i0 iltnum_mol i )
loci g( loci, forcei )

How much work is there?
May have poor load balance if number of neighbors
varies a lot
16
Better Load Balance

Rewrite to assign iterations of first loop nest
such that each thread has the same number of
neighbors
Extra overheads we would have to compute this
repeatedly, as neighbor list can change during
computation
Use explicit schedule to assign work

Is it worth it? Can we express the desired
schedule?
17
Parallel Code Performance Issues

Concurrency we want simultaneous execution of
as many actions as possible
Data locality - high percentage of memory
accesses are local (to local cache or memory)
Load balance - same amount of work performed on
each processor
Scalability ability to exploit increasing
numbers of processors (resources) with good
efficiency

18
Some Conclusions

You need to parallelize most of the code
But also minimize overheads introduced by
parallelization
Avoid having threads idle
Take care that parallel code makes good use of
memory hierarchy at all levels

What size problems are worth parallel
execution? How important is ease of code
maintenance?
19
Performance Scalability Hindrances

Too fine grained
Symptom high overhead
Caused by Small parallel/critical
Overly synchronized
Symptom high overhead
Caused by large synchronized sections
Dependencies real?
Load Imbalance
Symptom large wait times
Caused by uneven work distribution

20
Specific OpenMP Improvements

Too many parallel regions
Reduce overheads and improve cache use by merging
them
May need to use single directive for sequential
parts
Too many barriers
Can you remove some (by using nowait)?
Too much work in critical region
Can you remove some or create multiple critical
regions?

21
Tuning Critical Sections

It often helps to chop up large critical sections
into finer, named ones
Original Code
pragma omp critical (foo)
update( a )
update( b )
Transformed Code
pragma omp critical (foo_a)
update( a )
pragma omp critical (foo_b)
update( b )
Still need to avoid wait at first critical!

22
Tuning Locks Instead of Critical

Original Code
pragma omp critical
for( i0 iltn i )
ai
bi
ci
Idea cycle through different parts of the array
using locks!

Transformed Code
jstart omp_get_thread_num()
for( k 0 k lt nlocks k )
j ( jstart k ) nlocks
omp_set_lock( lckj )
for( ilbj iltubj i )
ai
bi
ci
omp_unset_lock( lckj )
Adapt to your situation

23
Tuning Eliminate Implicit Barriers
Remember Work-sharing constructs have implicit
barrier at end

When consecutive work-sharing constructs modify
( use) different objects, the barrier in the
middle can be eliminated
When same object modified (or used), barrier can
be safely removed if iteration spaces align
On most systems will be OK with OpenMP 3.0

24
Problems with Data Accesses

Potential problems
Contention for access to memory
Too much remote data
Frequent false sharing
Poor sequential access pattern
Some remedies
Can you privatize data to make it local (and
remove false sharing)?
Can you use OS features to pin data and memory
down in big system
First touch default placement, dplace,
Next_touch

Might need to revisit parallelization strategy
25
Parallelism worth it?

When would parallelizing this loop help?
DO I 1, N A(I) 0
ENDDO
Some issues to consider
Would it help increase size of parallel region
Number of threads/processors being used
Bandwidth of the memory system
Value of N
Very large N, so A is not cache contained
Placement of Object A
If distributed onto different processor caches,
or about to be distributed
On NUMA systems, when using first touch policy
for placing objects, to achieve a certain
placement for object A

26
Too fine grained?

When would parallelizing this loop help?
DO I 1, N SUM SUM A(I) B(I)
ENDDO
Know your compiler!
Some issues to consider
of threads/processors being used
How are reductions implemented?
Atomic, critical, expanded scalar, logarithmic
All the issues from the previous slide about
existing distribution of A and B

27
Tuning Load Balancing
do I 1, N do J I, M

Notorious problem for triangular loops
Within a parallel do/for, use the schedule clause
Remember, dynamic much more expensive than static
Chunked static can be very effective for load
imbalance
When dealing with consecutive dos/fors, nowait
can help, but be careful to avoid data races

28
Load Imbalance Thread Profiler

Thread Profiler is a performance analysis tool
from Intel Corporation
There are a few tools out there for use with
OpenMP.

29
High Performance Tuning

To fine-tune for high performance, break all the
good software engineering rules (i.e. stop being
so portable).
Step 1
Know your application
For best performance, also know your compiler,
performance tool, and hardware
The safest pragma to use is
parallel do/for
Sections can introduce load imbalance
There is usually more than one way to express the
desired parallelism
So how do you pick which constructs to use?

Tradeoff level of performance vs. portability
30
Understand the Overheads!Approximate numbers
just to illustrate
31
Performance Optimization Summary

Getting maximal performance is difficult
Might involve changing
Some aspect of application
Parallelization strategy
Directives used
Compiler flags used
Use of OS features
Requires understanding of potential problems
Not always easy to pinpoint actual problem

32
Cache Ping Ponging Varying Times for Sequential
Regions

Picture shows three runs of same program (4, 2, 1
threaded)
Each set of three bars is a serial region
Why does runtime change for serial regions?
No reason pinpointed

Time to think!
Thread migration
Data migration
Overhead?

Run Time

Write a Comment

User Comments (0)

About PowerShow.com

Shared Memory Parallel Programming - PowerPoint PPT Presentation

Shared Memory Parallel Programming

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C) C$OMP THREADPRIVATE(/ABC ... break all the good software engineering rules (i.e. stop being so portable) ... – PowerPoint PPT presentation