Title: Shared Memory Parallel Programming
1Shared Memory Parallel Programming
2OpenMP Overview
COMP FLUSH
pragma omp critical
COMP THREADPRIVATE(/ABC/)
CALL OMP_SET_NUM_THREADS(10)
- OpenMP An API for Writing Multithreaded
Applications - A set of compiler directives and library routines
for parallel application programmers - Greatly simplifies writing multi-threaded (MT)
programs in Fortran, C and C - Standardizes last 20 years of SMP practice
COMP parallel do shared(a, b, c)
call omp_test_lock(jlok)
call OMP_INIT_LOCK (ilok)
COMP MASTER
COMP ATOMIC
COMP SINGLE PRIVATE(X)
setenv OMP_SCHEDULE dynamic
COMP PARALLEL DO ORDERED PRIVATE (A, B, C)
COMP ORDERED
COMP PARALLEL REDUCTION ( A, B)
COMP SECTIONS
pragma omp parallel for private(A, B)
!OMP BARRIER
COMP PARALLEL COPYIN(/blk/)
COMP DO lastprivate(XX)
Nthrds OMP_GET_NUM_PROCS()
omp_set_lock(lck)
The name OpenMP is the property of the OpenMP
Architecture Review Board.
3How to Get Started?
- First thing figure what takes the time in your
sequential program gt profile it! - Typically, few parts (few loops) take the bulk of
the time. - Parallelize those parts first, worrying about
granularity and load balance. - Advantage of shared memory you can do that
incrementally. - Then worry about locality.
4Factors that Determine Speedup
- Amount of sequential code.
- Characteristics of parallel code
- granularity
- load balancing
- locality
- uniprocessor
- multiprocessor
- synchronization and communication
5Major Performance Impact
- Amdahls law tells us that we need to avoid
serial bottlenecks in code if we are to achieve
scalable parallelism - If 1 of a program is serial, speedup is limited
to 100, no matter how many processors it is
computed on - We must profile codes very carefully to find out
how much of them is sequential
Time tseq tpar / p on p threads
6Major Performance Impact
- Equation is very coarse
- Some code might be replicated
- It is essentially sequential
- There are some overheads that are not present in
sequential program - Parallel program may use cache better than
sequential code - Since there is overall more cache available
- Occasionally leads to superlinear speedup
Time tseq tpar / p on p threads
7Uniprocessor Memory Hierarchy
access time
size
memory
100 cycles
128Mb-...
L2 cache
20 cycles
256-512k
L1 cache
2 cycles
32-128k
CPU
8Shared Memory
access time
shared memory
100 cycles
L2 cache
L2 cache
20 cycles
2 cycles
L1 cache
L1 cache
CPU
CPU
9Distributed Shared Memory
access time
memory
memory
100s of cycles
L2 cache
L2 cache
20 cycles
2 cycles
L1 cache
L1 cache
CPU
CPU
10Recall Locality
- Locality (or re-use) the extent to which a
thread continues to use the same data or close
data. - Temporal locality If you have accessed a
particular word in memory, you access that word
again (before the line gets replaced). - Spatial locality If you access a word in a
cache line, you access other word(s) in that
cache line before it gets replaced.
11Bottom Line
- To get good performance,
- You have to have a high hit rate.
- You have to continue to access the data close
to the data that you accessed recently. - Each thread must access data well
- Much more important than for sequential code,
since access costs higher - Penalty for getting it wrong can be severe
12Scalability
- Performance of program for large p
- If grows roughly in proportion to p, code is
scalable - In practice, some reasonable growth may be
sufficient - Often extremely difficult to achieve
13Granularity
- Granularity size of the piece of code that is
executed by a single processor. - May be a statement, a single loop iteration, a
set of loop iterations, etc. - Fine granularity leads to
- (positive) ability to use lots of processors
- (positive) finer-grain load balancing
- (negative) increased overhead
Appropriate size may depend on hardware
14Load Balance
- Difference in execution time between threads
between synchronization points - Sum up minimal times
- Sum up maximal times
- Difference is total imbalance
- Unpredictable for some codes
- For these, we have dynamic and guided schedules
15Load in Molecular Dynamics
- for some number of timesteps
- pragma omp parallel for
- for( i0 iltnum_mol i )
- for( j0 jltcounti j )
- forcei f(loci,locindexj)
- pragma omp parallel for
- for( i0 iltnum_mol i )
- loci g( loci, forcei )
How much work is there?
May have poor load balance if number of neighbors
varies a lot
16Better Load Balance
- Rewrite to assign iterations of first loop nest
such that each thread has the same number of
neighbors - Extra overheads we would have to compute this
repeatedly, as neighbor list can change during
computation - Use explicit schedule to assign work
Is it worth it? Can we express the desired
schedule?
17Parallel Code Performance Issues
- Concurrency we want simultaneous execution of
as many actions as possible - Data locality - high percentage of memory
accesses are local (to local cache or memory) - Load balance - same amount of work performed on
each processor - Scalability ability to exploit increasing
numbers of processors (resources) with good
efficiency
18Some Conclusions
- You need to parallelize most of the code
- But also minimize overheads introduced by
parallelization - Avoid having threads idle
- Take care that parallel code makes good use of
memory hierarchy at all levels
What size problems are worth parallel
execution? How important is ease of code
maintenance?
19Performance Scalability Hindrances
- Too fine grained
- Symptom high overhead
- Caused by Small parallel/critical
- Overly synchronized
- Symptom high overhead
- Caused by large synchronized sections
- Dependencies real?
- Load Imbalance
- Symptom large wait times
- Caused by uneven work distribution
20Specific OpenMP Improvements
- Too many parallel regions
- Reduce overheads and improve cache use by merging
them - May need to use single directive for sequential
parts - Too many barriers
- Can you remove some (by using nowait)?
- Too much work in critical region
- Can you remove some or create multiple critical
regions?
21Tuning Critical Sections
- It often helps to chop up large critical sections
into finer, named ones - Original Code
- pragma omp critical (foo)
-
- update( a )
- update( b )
-
- Transformed Code
- pragma omp critical (foo_a)
- update( a )
- pragma omp critical (foo_b)
- update( b )
- Still need to avoid wait at first critical!
22Tuning Locks Instead of Critical
- Original Code
-
- pragma omp critical
- for( i0 iltn i )
- ai
- bi
- ci
-
- Idea cycle through different parts of the array
using locks!
- Transformed Code
- jstart omp_get_thread_num()
- for( k 0 k lt nlocks k )
- j ( jstart k ) nlocks
- omp_set_lock( lckj )
- for( ilbj iltubj i )
- ai
- bi
- ci
-
- omp_unset_lock( lckj )
-
- Adapt to your situation
23Tuning Eliminate Implicit Barriers
Remember Work-sharing constructs have implicit
barrier at end
- When consecutive work-sharing constructs modify
( use) different objects, the barrier in the
middle can be eliminated - When same object modified (or used), barrier can
be safely removed if iteration spaces align - On most systems will be OK with OpenMP 3.0
24Problems with Data Accesses
- Potential problems
- Contention for access to memory
- Too much remote data
- Frequent false sharing
- Poor sequential access pattern
- Some remedies
- Can you privatize data to make it local (and
remove false sharing)? - Can you use OS features to pin data and memory
down in big system - First touch default placement, dplace,
Next_touch
Might need to revisit parallelization strategy
25Parallelism worth it?
- When would parallelizing this loop help?
- DO I 1, N A(I) 0
- ENDDO
- Some issues to consider
- Would it help increase size of parallel region
- Number of threads/processors being used
- Bandwidth of the memory system
- Value of N
- Very large N, so A is not cache contained
- Placement of Object A
- If distributed onto different processor caches,
or about to be distributed - On NUMA systems, when using first touch policy
for placing objects, to achieve a certain
placement for object A
26Too fine grained?
- When would parallelizing this loop help?
- DO I 1, N SUM SUM A(I) B(I)
- ENDDO
- Know your compiler!
- Some issues to consider
- of threads/processors being used
- How are reductions implemented?
- Atomic, critical, expanded scalar, logarithmic
- All the issues from the previous slide about
existing distribution of A and B
27Tuning Load Balancing
do I 1, N do J I, M
- Notorious problem for triangular loops
- Within a parallel do/for, use the schedule clause
- Remember, dynamic much more expensive than static
- Chunked static can be very effective for load
imbalance - When dealing with consecutive dos/fors, nowait
can help, but be careful to avoid data races
28Load Imbalance Thread Profiler
- Thread Profiler is a performance analysis tool
from Intel Corporation - There are a few tools out there for use with
OpenMP.
29High Performance Tuning
- To fine-tune for high performance, break all the
good software engineering rules (i.e. stop being
so portable). - Step 1
- Know your application
- For best performance, also know your compiler,
performance tool, and hardware - The safest pragma to use is
- parallel do/for
- Sections can introduce load imbalance
- There is usually more than one way to express the
desired parallelism - So how do you pick which constructs to use?
Tradeoff level of performance vs. portability
30Understand the Overheads!Approximate numbers
just to illustrate
31Performance Optimization Summary
- Getting maximal performance is difficult
- Might involve changing
- Some aspect of application
- Parallelization strategy
- Directives used
- Compiler flags used
- Use of OS features
- Requires understanding of potential problems
- Not always easy to pinpoint actual problem
32Cache Ping Ponging Varying Times for Sequential
Regions
- Picture shows three runs of same program (4, 2, 1
threaded) - Each set of three bars is a serial region
- Why does runtime change for serial regions?
- No reason pinpointed
- Time to think!
- Thread migration
- Data migration
- Overhead?
Run Time