Coarse-Grain Parallelism - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Coarse-Grain Parallelism

Description:

If we can align all references, then dependencies would go away, and parallelism ... Theorem: Alignment, replication, and statement reordering are sufficient to ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 23
Provided by: engi100
Learn more at: https://www.cs.rice.edu
Category:

less

Transcript and Presenter's Notes

Title: Coarse-Grain Parallelism


1
Coarse-Grain Parallelism
Chapter 6 of Allen and Kennedy
2
Introduction
  • Previously, our transformations targeted vector
    and superscalar architectures.
  • In this lecture, we worry about transformations
    for symmetric multiprocessor machines.
  • The difference between these transformations
    tends to be one of granularity.

3
Review
  • SMP machines have multiple processors all
    accessing a central memory.
  • The processors are unrelated, and can run
    separate processes.
  • Starting processes and synchonrization between
    proccesses is expensive.

4
Synchonrization
  • A basic synchonrization element is the barrier.
  • A barrier in a program forces all processes to
    reach a certain point before execution continues.
  • Bus contention can cause slowdowns.

5
Single Loops
  • The analog of scalar expansion is privatization.
  • Temporaries can be given separate namespaces for
    each iteration.

DO I 1,N S1 T A(I) S2 A(I)
B(I) S3 B(I) T ENDDO
PARALLEL DO I 1,N PRIVATE t S1 t
A(I) S2 A(I) B(I) S3 B(I) t ENDDO

6
Privatization
  • Definition A scalar variable x in a loop L is
    said to be privatizable if every path from the
    loop entry to a use of x inside the loop passes
    through a definition of x.
  • Privatizability can be stated as a data-flow
    problem
  • We can also do this by declaring a variable x
    private if its SSA graph doesnt contain a phi
    function at the entry.

7
Array Privatization
  • We need to privatize array variables.
  • For iteration J, upwards exposed variables are
    those exposed due to loop body without variables
    defined earlier.

DO I 1,100 S0 T(1)X L1 DO J
2,N S1 T(J) T(J-1)B(I,J) S2
A(I,J) T(J) ENDDO ENDDO
So for this fragment, T(1) is the only exposed
variable.
8
Array Privatization
  • Using this analysis, we get the following code

PARALLEL DO I 1,100 PRIVATE t S0
t(1) X L1 DO J 2,N S1 t(J)
t(J-1)B(I,J) S2 A(I,J)t(J) ENDDO
ENDDO
9
Loop Distribution
  • Loop distribution eliminates carried
    dependencies.
  • Consequently, it often creates opportunity for
    outer-loop parallelism.
  • We must add extra barriers to keep dependent
    loops from executing out of order, so the
    overhead may override the parallel savings.
  • Attempt other transformations before attempting
    this one.

10
Alignment
  • Many carried dependencies are due to array
    alignment issues.
  • If we can align all references, then dependencies
    would go away, and parallelism is possible.

DO I 2,N A(I) B(I)C(I) D(I)
A(I-1)2.0 ENDDO
DO I 1,N1 IF (I .GT. 1) A(I) B(I)C(I)
IF (I .LE. N) D(I1) A(I)2.0 ENDDO
11
Alignment
  • There are other ways to align the loop

DO I 2,N J MOD(IN-4,N-1)2 A(J)
B(J)C D(I)A(I-1)2.0 ENDDO
D(2) A(1)2.0 DO I 2,N-1 A(I) B(I)C(I)
D(I1) A(I)2.0 ENDDO A(N) B(N)C(N)
12
Alignment
  • If an array is involved in a recurrence, then
    alignment isnt possible.
  • If two dependencies between the same statements
    have different dependency distances, then
    alignment doesnt work.
  • We can fix the second case by replicating code

DO I 1,N A(I1) B(I)C ! Replicated
Statement IF (I .EQ 1) THEN t A(I)
ELSE t B(I-1)C END IF X(I)
A(I1)t ENDDO
DO I 1,N A(I1) B(I)C X(I)
A(I1)A(I) ENDDO
13
Alignment
Theorem Alignment, replication, and statement
reordering are sufficient to eliminate all
carried dependencies in a single loop containing
no recurrence, and in which the distance of each
dependence is a constant independent of the loop
index
  • We can establish this constructively.
  • Let G (V,E,?) be a weighted graph. v ? V is a
    statement, and ?(v1, v2) is the dependence
    distance between v1 and v2. Let o V ?Z give the
    offset of vertices.
  • G is said to be carry free if o(v1) ?(v1, v2)
    o(v2).

14
Alignment Procedure
procedure Align(V,E,?,0) While V is not
empty remove element v from V
for each (w,v) ? E if w ? V
W ? W ? w o(w) ? o(v) - ?(w,v)
else if o(w) ! o(v) - ?(w,v)
create vertex w
replace (w,v) with (w,v)
replicate all edges into w onto w
W ? W ? w o(w) ? o(v)
- ?(w,v)
for each (v,w) ? E if w ? V W ? W ?w
o(w) ? o(v) ?(v,w) else if o(w) ! o(v)
?(v,w) create vertex v replace
(v,w) with (v,w) replicate edges into v
onto v W ? W ? v o(v) ? o(w) -
?(v,w) end align
15
Loop Fusion
  • Loop distribution was a method for separating
    parallel parts of a loop.
  • Our solution attempted to find the maximal loop
    distribution.
  • The maximal distribution often finds
    parallelizable components to small for efficient
    parallelizing.
  • Two obvious solutions
  • Strip mine large loops to create larger
    granularity.
  • Perform maximal distribution, and fuse together
    parallelizable loops.

16
Fusion Safety
Definition A loop-independent dependence between
statements S1 and S2 in loops L1 and L2
respectively is fusion-preventing if fusing L1
and L2 causes the dependence to be carried by the
combined loop in the opposite direction.
DO I 1,N S1 A(I) B(I)C ENDDO DO
I 1,N S2 D(I) A(I1)E ENDDO
DO I 1,N S1 A(I) B(I)C S2 D(I)
A(I1)E ENDDO
17
Fusion Safety
  • We shouldnt fuse loops if the fusing will
    violate ordering of the dependence graph.
  • Ordering Constraint Two loops cant be validly
    fused if there exists a path of loop-independent
    dependencies between them containing a loop or
    statement not being fused with them.

Fusing L1 with L3 violates the ordering
constraint. L1,L3 must occur both before and
after the node L2.
18
Fusion Profitability
Parallel loops should generally not be merged
with sequential loops. Definition An edge
between two statements in loops L1 and L2
respectively is said to be parallelism-inhibiting
if after merging L1 and L2, the dependence is
carried by the combined loop.
DO I 1,N S1 A(I1) B(I) C ENDDO
DO I 1,N S2 D(I) A(I) E ENDDO
DO I 1,N S1 A(I1) B(I) C S2 D(I)
A(I) E ENDDO
19
Typed Fusion
  • We start off by classifying loops into two types
    parallel and sequential.
  • We next gather together all edges that inhibit
    efficient fusion, and call them bad edges.
  • Given a loop dependency graph (V,E), we want to
    obtain a graph (V,E) by merging vertices of V
    subject to the following constraints
  • Bad Edge Constraint vertices joined by a bad
    edge arent fused.
  • Ordering Constraint vertices joined by path
    containing non-parallel vertex arent fused

20
Typed Fusion Procedure
procedure TypedFusion(G,T,type,B,t0)
Initialize all variables to zero Set countn
to be the in-degree of node n Initialize W
with all nodes with in-degree zero while W
isnt empty remove element n with type t
from W if t t0 if
maxBadPrevn 0 then p ? fused else
p ? nextmaxBadPrevn if p ! 0
then x ? nodep numn ? numx update_success
ors(n,t) fuse x and n and call the result n
else create_new_fused_node(n)
update_successors(n,t) else
create_new_node(n) update_successors(n,t) e
nd TypedFusion
21
Typed Fusion Example
Original loop graph
Graph annotated (maxBadPrev,p) ? num
After fusing parallel loops
After fusing sequential loops
22
Cohort Fusion
  • Given an outer loop containing some number of
    inner loops, we want to be able to run some inner
    loops in parallel.
  • We can do this as follows
  • Run TypedFusion with B
    fusion-preventing edges, parallelism-inhibiting
    edges, and edges between a parallel loop and a
    sequential loop
  • Put a barrier at the end of each identified
    cohort
  • Run TypedFusion again to fuse the parallel loops
    in each cohort
Write a Comment
User Comments (0)
About PowerShow.com