Title: EECE 571L: Parallel Programming
1EECE 571L Parallel Programming Reconfigurable
UBC EECE571L Prof. Guy Lemieux
- Flynns Taxonomy
- Types of Parallelism
- Limits Dependence
- Limits Amdahl
3Classes of Parallel Architecture
- Flynns Taxonomy
- New one?
UBC EECE571L Prof. Guy Lemieux
- Single Instruction Single Data
Source El-Rewini and Abd-El-Barr - Advanced
Computer Architecture and Parallel Processing
UBC EECE571L Prof. Guy Lemieux
- Single Instruction Multiple Data
Source El-Rewini and Abd-El-Barr - Advanced
Computer Architecture and Parallel Processing
UBC EECE571L Prof. Guy Lemieux
- Multiple Instruction Single Data
- Exist in Concept, Not Implemented
UBC EECE571L Prof. Guy Lemieux
- Multiple Instruction Multiple Data
Source El-Rewini and Abd-El-Barr - Advanced
Computer Architecture and Parallel Processing
UBC EECE571L Prof. Guy Lemieux
- Can either be tightly or loosely-coupled
- Tightly-Coupled
- Share Address Space
- Symmetric Multiprocessors (SMPs)
- Uniform Memory Access (UMA)
- Nonuniform Memory Access (NUMA)
- E.g. Intel Quad Core
- Loosely-Coupled
- Disjoint Address Space
- Distributed SISDs
- Message Passing
- E.g. Network Clusters
UBC EECE571L Prof. Guy Lemieux
- Single Program, Multiple Data
- A special case of MIMD
- MIMD compute nodes can run completely different
programs - 3D physics on node 1
- graphics rendering on node 2
- SPMD compute nodes run identical programs
- Free-running, out-of-sync programs
- At any point in time, each node may run a
different instruction
UBC EECE571L Prof. Guy Lemieux
10Parallelism Levels
- Task
- Thread
- Data
- Loop
- Instruction
- Bit
UBC EECE571L Prof. Guy Lemieux
11Task-level Parallelism
- Function Level of a Program
- Example
- Given data A and B, find func1(A,B) and
func2(A,B) - Two tasks
- Assume two processors available
- func1(A,B) on CPU1
- func2(A,B) on CPU2
UBC EECE571L Prof. Guy Lemieux
12Thread-level Parallelism
- Similar to Task-level, but finer grain
- Thread could be independent or cooperating to
achieve a greater goal
UBC EECE571L Prof. Guy Lemieux
13Data-level Parallelism
- Distribution of Data among Processors
- Example
- Given an array of n elements,
- multiply each element by 2
- Divide n by the number of processors p
- Each processor perform division on n/p elements
UBC EECE571L Prof. Guy Lemieux
14Loop-level Parallelism
- Exploit Concurrency in Loops
- Possible examples
- For-loop to calculate dot product of array A and
B - Is this really data parallelism?
- Overlap loop iteration i with iteration i1 by
starting next iteration as early as possible (but
no earlier than any loop-carried dependence) - Is this pipeline parallelism?
UBC EECE571L Prof. Guy Lemieux
15Instruction-level Parallelism
- Machine Instruction Level
- Identify independent instructions within an
instruction window - Superscalar done at run-time by cpu
- VLIW done at compile-time
- Dynamic optimizations by the run-time software
system are also possible (eg, JIT) - Example
- ADD R1, R2, R3
- LOAD R4, R2
UBC EECE571L Prof. Guy Lemieux
16Bit-level Parallelism
- Example
- 16-bit addition
- Two instructions on a 8-bit ALU
- One instruction on a 16-bit ALU
UBC EECE571L Prof. Guy Lemieux
- Does the result of the current instruction depend
on the previous result? - Yes Previous result must be computed first
- No Instructions can be computed in parallel
UBC EECE571L Prof. Guy Lemieux
18Type of Dependencies
UBC EECE571L Prof. Guy Lemieux
19RAR no dependence
- Read after Read
- No Dependency
- Example
- R2 lt R1 1
- R3 lt R1 2
UBC EECE571L Prof. Guy Lemieux
20RAW true dependence
- Read after Write
- Producer/consumer relationship
- Example
- R2 lt R1 1
- R3 lt R2 2
UBC EECE571L Prof. Guy Lemieux
21WAR false dependence
- Write after Read
- Aka anti-dependence
- Example
- R2 lt R1 1
- R1 lt R3 2
- Can these be avoided?
UBC EECE571L Prof. Guy Lemieux
22Avoid WAR false dependence
- Avoid by allocating new storage
- Register renaming
- Separate memory locations
- Example
- R2 lt R1 1
- R1' lt R3 2
UBC EECE571L Prof. Guy Lemieux
23WAW output dependence
- Write after Write
- What happens if you reorder the output going to a
printer? - Example
- R2 lt R1 1
- R2 lt R3 2
UBC EECE571L Prof. Guy Lemieux
24Avoid WAW output dependence
- Avoid by optimizing away earlier computation?
- Avoid by allocating new storage?
- Register renaming
- Separate memory locations
- Example
- R2 lt R1 1
- R2' lt R3 2
UBC EECE571L Prof. Guy Lemieux
25The Ultimate Speed Limit
UBC EECE571L Prof. Guy Lemieux
26Amdahls Law
- Question
- If you improve part of the system, how much
faster does the entire system run?
Gene Amdahl Famous computersystems architect
atIBM in 60s and 70s.
- Amdahls Law gives us the speed limit!
- Given Enhancement E, define
- Speedup(E) PerformanceAfter(E) /
PerformanceBefore(E) - ExecutionTimeBefore(E
) / ExecutionTimeAfter(E)
UBC EECE571L Prof. Guy Lemieux
27Amdahls Law
- More detail.
- Enhancement E
- results in a speedup of S
- to only some fraction of the program F
- ExecutionTimeAfter(E) (1-F) F/S
ExecutionTimeBefore(E) - (derivation on next slide)
- Usually expressed as a speedup
- Speedup(E)
- ExecutionTimeBefore(E) / ExecutionTimeAfter(E
) - 1 / (1-F) F/S
UBC EECE571L Prof. Guy Lemieux
28Amdahls Law Derivation
- (1-F) portion untouched
- ExecutionTimeBefore(E) (1-F) F 1
- F portion improved by S times, to F/S
- ExecutionTimeAfter(E) (1-F) F/S
- Therefore
- Speedup(E)
Before/After 1 / (1-F) F/S - Lesson when speeding up a computer system,
work on the part with the biggest F
UBC EECE571L Prof. Guy Lemieux
29Amdahls Law Speed Limits!
F is portion of program that can be sped up.
UBC EECE571L Prof. Guy Lemieux
30Amdahls Law Summary
- Amdahls Law
- Designers Mantra Make the common case fast
- Applies to all engineering optimizations !!!
- Corollary
- Rare cases dont matter
- Students Corollary
- On a test, do the easy stuff for the most marks
UBC EECE571L Prof. Guy Lemieux
31Amdahls Law Rebuttal ?
- Does Amdahl always win?
- Gustafsons Law
- As the number of processors increases, you can
scale the problem size - As the problem size grows, ideally the sequential
part will shrink
UBC EECE571L Prof. Guy Lemieux
- Concurrency try to identify independent elements
that can be performed in parallel - Only parallelize the common case, and make sure
it is frequent enough to matter
UBC EECE571L Prof. Guy Lemieux