Title: Using the Iteration Space Visualizer in Loop Parallelization
1Using the Iteration Space Visualizer in Loop
Parallelization
- Yijun YU
- http//winpar.elis.rug.ac.be/ppt/isv
2Overview
- ISV A 3D Iteration Space Visualizer view the
dependence in the iteration space - iteration -- one instance of the loop body
- space the grid of all index values
- Detect the parallelism
- Estimate the speedup
- Derive a loop transformation
- Find Statement-level parallelism
- Future development
31. Dependence
program
DO I 1,3 A(I) A(I-1) ENDDO
DOALL I 1,3 A(I) A(I-1) ENDDO
41.1 Example1
ISV directive
visualize
51.2 Visualize the Dependence
- A dependence is visualized in an iteration space
dependence graph
61.3 Parallelism?
- Stepwise view sequential execution
- No parallelism found
- However, many programs have parallelism
72. Potential Parallelism
- Time(sequential) number of iterations
- Dataflow iterations are executed as soon as its
data are readyTime(dataflow) number of
iterations on the longest critical path - The potential parallelism is denoted byspeedup
Time(sequential)/Time(dataflow)
82.1 Example 2
9Diophantine Equations Loop bounds
(polytope) Iteration Space Dependencies
102.2 Irregular dependence
- Dependences have non-uniform distance
- Parallelism Analysis200 iterations over 15 data
flow steps
Problem How to exploit it?
113. Visualize parallelism
- Find answers to these questions
- What is the dependence pattern?
- Is there a parallel loop? (How to find?)
- What is the maximal parallelism?(How to exploit
it?) - Is the load of parallel tasks balanced?
123.1 Example 3
133.2 3D Space
143.3 Loop parallelizable?
- The I, J, K loops are in the 3D space 32
iterations
Simulate sequential execution
- Which loop can be parallel?
153.4 Loop parallelization
- Interactively try the parallelization
Interactively check a parallel loop I
- The blinking dependence edges prevent the
parallelization of the given loop I.
163.5 Parallel execution
- Let ISV find the correct parallelization
Automatically check the parallel loop
Simulateparallel execution
173.6 Dataflow execution
- Sequential execution takes 32 time steps
Simulatedata flow execution
- Dataflow execution only takes 4 times
steps - Potential speedup8.
183.7 Graph partitioning
Iterating throughpartitions the connected
components
- All the partitions are load balanced
194. Loop Transformation
Potential parallelism
Transformation
Real parallelism
204.1 Example 4
214.2 The iteration space
- Sequentially 25 iterations
224.3 Loop Parallelizable?
234.4 Dataflow execution
- Totally 9 steps
- Potential speedup
- 25/92.78
- Wave front effectall iterations on the same
wave are on the same line
244.5 Zoom-in on the I-space
254.6 Speedup vs program size
- Zoom-in previews parallelism in part of a loop
without modifying the program - Executing the programs of different size n
estimates a speedup of n2/(2n-1)
264.7 How to obtain the potential parallelism
- Here we already have these metrics
- Sequential time steps N2
- Dataflow time step 2N-1
- potential speedup N2/(2N-1)
How to obtain the potential speedup of a loop?
Transformation.
274.8 Unimodular transformation (UT)
Unimodular matrix
New loop index
Old loop index
- A unimodular matrix is a square integer matrix
that has unit determinant. It is the result of
identity matrix by three kinds of basic
transformations reversal, interchange, and
skewing - The new loop execution order is determined by the
transformed index. The iteration space remains
unit step size - Find a suitable UT reorders the iterations such
that the new loop nest has a parallel loop
284.9 Hyperplane transformation
- Interactively define a hyper-plane
- Observe the plane iteration matches the dataflow
simulation - plane dataflow
- Based on the plane, ISV calculates a unimodular
transformation
294.10 The derived UT
The transformed iteration space and the
generated loop
304.11 Verify the UT
- ISV checks if the transformation is valid
- Observe that the parallel loop execution in the
transformed loop matches the plane execution - parallel plane
315. Statement-level parallelism
- Unimodular transformations work at iteration
level - The statement dependence within the loop body is
hidden in the iteration space graph - How to exploit parallelism at statement level?
Statement to iteration
325.1 Example 5
SSV statement space visualization
335.2 Iteration-level parallelism
- The iteration space is 2D.
- There are N216 iterations
- The dataflow execution has 2N-17 time steps.
- The potential speedup is 16/7 2.29
345.3 Parallelism in statements
- The (statement) iteration space is 3D
- There are 2N232 statements
- The dataflow execution still has 2N-17 time
steps. - The potential speedup is 32/7 4.58
355.4 Comparison
- Here doubles the potential speedup at iteration
level
365.5 Define the partition planes
37What is validity?
Show the execution order on top of the dependence
arrows.(for 1 plane or all together, depending
on the density of the slide)
385.6 Invalid UT
- The invalid unimodular transformation derived
from hyper-plane is refused by ISV - Alternatively, ISV calculates the unimodular
transformation based on the dependence distance
vectors available in the dependence graph
396. Pseudo distance method
- The pseudo distance method
- Extract base vectors from the dependent
iterations - Examine if the base vectors generates all the
distances - Calculate the unimodular transformation based on
the base vectors
40Another way to find parallelism automatically
The iteration space is a grid,non-uniform
dependencies are members of a uniform dependence
grid, with unknown base-vectors. Finding these
base vectors allows usto extend existing
parallelizationto the non-uniform case.
416.1 Dependence distance
426.2 The Transformation
- The transforming matrix discovered by pseudo
distance method - 1 1 0
- -1 0 1
- 1 0 0
- The distance vectors are transformed(1,0,-1)
(0,1,0)(0,1,1)
(0,0,1) - The dependent iterations have the same first
index, implies the outermost loop is parallel.
436.3 Compare the UT matrices
- The transforming matrix discovered by pseudo
distance method - 1 1 0
- -1 0 1
- 1 0 0
- An invalid transforming matrix discovered by the
hyper-plane method - 1 0 0
- -1 1 0
- 1 0 1
The same first column means the transformed
outermost loops have the same index.
446.4 The transformed space
- The outermost loop is parallel
- There are 8 parallel tasks
- The load of tasks is not balanced
- The longest task takes 7 time steps
457. Non-perfectly nested loop
- What is it?
- The unimodular transformations only work for
perfectly nested loops - For non-perfectly nested loop, the iteration
space is constructed with extended indices - N fold non-perfectly nested loop to a N1 fold
perfectly nested loop
467.1 Perfectly nested Loop?
- Non-perfectly nested loop
- DO I1 1,3
- A(I1) A(I1-1)
- DO I2 1,4
- B(I1,I2) B(I1-1,I2)B(I1,I2-1)
- ENDDO
- ENDDO
Perfectly nested loop DO I1 1,3 DO I2 1,5
DO I3 0,1 IF (I2.EQ.1.AND.I3.EQ.0) A(I1)
A(I1-1) ELSE IF(I3.EQ.1)
B(I1-1,I2)B(I1-2,I2)B(I1-1,I2-1) ENDDO
ENDDO ENDDO
477.2 Exploit parallelism with UT
488. Applications
499. Future considerations
- Weighted dependence graph
- More semantics on data locality
- data space graph, data communication graph
- data reuse iteration space graph,
- More loop transformation
- Affine (statement) iteration space mappings
- Automatic statement distribution
- Integration with Omega library