Recursion%20Unrolling%20for%20Divide%20and%20Conquer%20Programs - PowerPoint PPT Presentation

About This Presentation
Title:

Recursion%20Unrolling%20for%20Divide%20and%20Conquer%20Programs

Description:

Recursion Unrolling for Divide and Conquer Programs – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 76
Provided by: rad130
Category:

less

Transcript and Presenter's Notes

Title: Recursion%20Unrolling%20for%20Divide%20and%20Conquer%20Programs


1
Recursion Unrolling for Divide and Conquer
Programs
  • Radu Rugina and Martin Rinard
  • Laboratory for Computer Science
  • Massachusetts Institute of Technology

2
What This Talk Is About
  • Automatic generation of efficient large base
    cases for divide and conquer programs

3
Outline
  1. Motivating Example
  2. Computation Structure
  3. Transformations
  4. Related Work
  5. Conclusion

4
1. Motivating Example
5
Divide and Conquer Matrix Multiply
A ? B R
A0 A1
A2 A3
B0 B1
B2 B3
A0?B0A1?B2 A0?B1A1?B3
A2?B0A3?B2 A2?B1A3?B3
?
  • Divide matrices into sub-matrices A0 , A1, A2
    etc
  • Use blocked matrix multiply equations

6
Divide and Conquer Matrix Multiply
A ? B R
A0 A1
A2 A3
B0 B1
B2 B3
A0?B0A1?B2 A0?B1A1?B3
A2?B0A3?B2 A2?B1A3?B3
?
  • Recursively multiply sub-matrices

7
Divide and Conquer Matrix Multiply
A ? B R
a0
b0
a0 ? b0
?
  • Terminate recursion with a simple base case

8
Divide and Conquer Matrix Multiply
void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
Implements R A ? B
9
Divide and Conquer Matrix Multiply
Divide matrices in sub-matrices and recursively
multiply sub-matrices
void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
10
Divide and Conquer Matrix Multiply
Identify sub-matrices with pointers
void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
11
Divide and Conquer Matrix Multiply
Use a simple algorithm for the base case
void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
12
Divide and Conquer Matrix Multiply
  • Advantage of small base case simplicity
  • Code is easy to
  • Write
  • Maintain
  • Debug
  • Understand

void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
13
Divide and Conquer Matrix Multiply
  • Disadvantage inefficiency
  • Large control flow overhead
  • Most of the time is spent in dividing the matrix
    in sub-matrices

void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
14
Hand Coded Implementation
void serialmul(block As, block Bs, block
Rs) int i, j DOUBLE A (DOUBLE
) As DOUBLE B (DOUBLE ) Bs
DOUBLE R (DOUBLE ) Rs for (j 0 j lt
16 j 2) DOUBLE bp Bj
for (i 0 i lt 16 i 2)
DOUBLE ap Ai 16 DOUBLE
rp Rj i 16 register
DOUBLE s0_0 rp0, s0_1 rp1
register DOUBLE s1_0 rp16, s1_1 rp17
s0_0 ap0 bp0
s0_1 ap0 bp1 s1_0
ap16 bp0 s1_1 ap16
bp1 s0_0 ap1 bp16
s0_1 ap1 bp17
s1_0 ap17 bp16 s1_1
ap17 bp17 s0_0 ap2
bp32 s0_1 ap2 bp33
s1_0 ap18 bp32
s1_1 ap18 bp33 s0_0
ap3 bp48 s0_1 ap3
bp49 s1_0 ap19 bp48
s1_1 ap19 bp49
s0_0 ap4 bp64 s0_1
ap4 bp65 s1_0 ap20
bp64 s1_1 ap20 bp65

s0_0 ap5 bp80
s0_1 ap5 bp81 s1_0
ap21 bp80 s1_1 ap21
bp81 s0_0 ap6 bp96
s0_1 ap6 bp97
s1_0 ap22 bp96 s1_1
ap22 bp97 s0_0 ap7
bp112 s0_1 ap7
bp113 s1_0 ap23
bp112 s1_1 ap23
bp113 s0_0 ap8 bp128
s0_1 ap8 bp129
s1_0 ap24 bp128 s1_1
ap24 bp129 s0_0 ap9
bp144 s0_1 ap9
bp145 s1_0 ap25
bp144 s1_1 ap25
bp145 s0_0 ap10
bp160 s0_1 ap10
bp161 s1_0 ap26
bp160 s1_1 ap26
bp161 s0_0 ap11
bp176 s0_1 ap11
bp177 s1_0 ap27
bp176 s1_1 ap27
bp177 s0_0 ap12
bp192 s0_1 ap12
bp193 s1_0 ap28
bp192 s1_1 ap28
bp193 s0_0 ap13
bp208 s0_1 ap13
bp209 s1_0 ap29
bp208
s1_1 ap29 bp209
s0_0 ap14 bp224
s0_1 ap14 bp225 s1_0
ap30 bp224 s1_1 ap30
bp225 s0_0 ap15
bp240 s0_1 ap15
bp241 s1_0 ap31
bp240 s1_1 ap31
bp241 rp0 s0_0
rp1 s0_1 rp16 s1_0
rp17 s1_1
cilk void matrixmul(long nb, block A, block
B, block R) if (nb 1)
flops serialmul(A, B, R) else if (nb gt
4) spawn matrixmul(nb/4, A, B, R) spawn
matrixmul(nb/4, A, B(nb/4), R(nb/4)) spawn
matrixmul(nb/4, A2(nb/4), B(nb/4),
R2(nb/4)) spawn matrixmul(nb/4, A2(nb/4),
B, R3(nb/4)) sync spawn matrixmul(nb/4,
A(nb/4), B2(nb/4), R) spawn matrixmul(nb/4,
A(nb/4), B3(nb/4), R(nb/4)) spawn
matrixmul(nb/4, A3(nb/4), B3(nb/4),
R2(nb/4)) spawn matrixmul(nb/4, A3(nb/4),
B3(nb/4), R3(nb/4)) sync
15
Goal
  • The programmer writes simple code with small base
    cases
  • The compiler automatically generates efficient
    code with large base cases

16
2. Computation Structure
17
Running Example Array Increment
void f(char p, int n) if (n 1)
/ base case increment one element / (p)
1 else f(p, n/2) /
increment first half / f(pn/2, n/2) /
increment second half /
18
Dynamic Call Tree for n4
Execution of f(p,4)
19
Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
20
Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
Activation Frame on the Stack
21
Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
Executed Instructions
22
Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
23
Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
24
Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
n1
25
Control Flow Overhead
Execution of f(p,4)
  • Call overhead

Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
n1
26
Control Flow Overhead
Execution of f(p,4)
  • Call overhead Test overhead

Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
n1
27
Computation
Execution of f(p,4)
  • Call overhead Test overhead
  • Computation

Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
n1
28
Large Base Cases Reduced Overhead
Execution of f(p,4)
Test n2 Call f Call f
n4
Test n2 Inc p Inc (p1)
Test n2 Inc p Inc (p1)
n2

29
3. Transformations
30
Transformation 1 Recursion Inlining
Start with the original recursive procedure
void f (char p, int n) if (n 1) (p)
1 else f(p, n/2)
f(pn/2, n/2)
31
Transformation 1 Recursion Inlining
Make two copies of the original procedure
void f1(char p, int n) if (n 1) (p)
1 else f1(p, n/2)
f1(pn/2, n/2)
void f2(char p, int n) if (n 1) (p)
1 else f2(p, n/2)
f2(pn/2, n/2)
32
Transformation 1 Recursion Inlining
Transform direct recursion to mutual recursion
void f1(char p, int n) if (n 1) (p)
1 else f2(p, n/2)
f2(pn/2, n/2)
void f2(char p, int n) if (n 1) (p)
1 else f1(p, n/2)
f1(pn/2, n/2)
33
Transformation 1 Recursion Inlining
Inline procedure f2 at call sites in f1
void f1(char p, int n) if (n 1) (p)
1 else f2(p, n/2)
f2(pn/2, n/2)
void f2(char p, int n) if (n 1) (p)
1 else f1(p, n/2)
f1(pn/2, n/2)
34
Transformation 1 Recursion Inlining
void f1(char p, int n) if (n 1)
(p) 1 else if (n/2
1) p 1 else
f1(p, n/2/2) f1(pn/2/2,
n/2/2) if (n/2 1)
(pn/2) 1 else
f1(pn/2, n/2/2) f1(pn/2n/4,
n/2/2)
35
Transformation 1 Recursion Inlining
void f1(char p, int n) if (n 1)
(p) 1 else if (n/2
1) p 1 else
f1(p, n/2/2) f1(pn/2/2,
n/2/2) if (n/2 1)
(pn/2) 1 else
f1(pn/2, n/2/2) f1(pn/2n/4,
n/2/2)
  • Reduced procedure call overhead
  • More code exposed at the intra-procedural level
  • Opportunities to simplify control flow in the
    inlined code

36
Transformation 1 Recursion Inlining
void f1(char p, int n) if (n 1)
(p) 1 else if (n/2
1) p 1 else
f1(p, n/2/2) f1(pn/2/2,
n/2/2) if (n/2 1)
(pn/2) 1 else
f1(pn/2, n/2/2) f1(pn/2n/4,
n/2/2)
  • Reduced procedure call overhead
  • More code exposed at the intra-procedural level
  • Opportunities to simplify control flow in the
    inlined code
  • identical condition expressions

37
Transformation 2 Conditional Fusion
Merge if statements with identical conditions
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f1(p, n/2/2)
f1(pn/2/2, n/2/2) f1(pn/2, n/2/2)
f1(pn/2n/4, n/2/2)
38
Transformation 2 Conditional Fusion
Merge if statements with identical conditions
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f1(p, n/2/2)
f1(pn/2/2, n/2/2) f1(pn/2, n/2/2)
f1(pn/2n/4, n/2/2)
  • Reduced branching overhead and bigger basic
    blocks
  • Larger base case for n/2 1

39
Unrolling Iterations
Repeatedly apply inlining and conditional fusion
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f1(p, n/2/2)
f1(pn/2/2, n/2/2) f1(pn/2, n/2/2)
f1(pn/2n/4, n/2/2)
40
Second Unrolling Iteration
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f1(p, n/2/2)
f1(pn/2/2, n/2/2) f1(pn/2, n/2/2)
f1(pn/2n/4, n/2/2)
void f2(char p, int n) if (n 1)
p 1 else f2(p, n/2)
f2(pn/2, n/2)
41
Second Unrolling Iteration
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f2(p, n/2/2)
f2(pn/2/2, n/2/2) f2(pn/2, n/2/2)
f2(pn/2n/4, n/2/2)
void f2(char p, int n) if (n 1)
p 1 else f1(p, n/2)
f1(pn/2, n/2)
42
Result of Second Unrolling Iteration
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2/2/2) f1(pn/2/2/2,
n/2/2/2) f1(pn/2/2, n/2/2/2)
f1(pn/2/2n/2/2/2, n/2/2/2) f1(pn/2,
n/2/2/2) f1(pn/2n/2/2/2, n/2/2/2)
f1(pn/2n/2/2, n/2/2/2) f1(pn/2n/2/2n/2/2/
2, n/2/2/2)
43
Unrolling Iterations
  • The unrolling process stops when the number of
    iterations reaches the desired unrolling factor
  • The unrolled recursive procedure
  • Has base cases for larger problem sizes
  • Divides the given problem into more sub-problems
    of smaller sizes
  • In our example
  • Base cases for n1, n2, and n4
  • Problems are divided into 8 problems of 1/8 size

44
Speedup for Matrix Multiply
Matrix of 512 x 512 elements
45
Speedup for Matrix Multiply
Matrix of 512 x 512 elements
46
Speedup for Matrix Multiply
Matrix of 1024 x 1024 elements
47
Efficiency of Unrolled Recursive Part
  • Because the recursive part is also unrolled,
  • recursion may not exercise the large base cases
  • Which base case is executed depends on the size
    of the input problem
  • In our example
  • For a problem of size n8, the base case for n1
    is executed
  • For a problem of size n16, the base case for n2
    is executed
  • The efficient base case for n4 is not executed
    in these cases

48
Solution Recursion Re-Rolling
  • Roll back the recursive part of the unrolled
    procedure after the large base cases are
    generated
  • Re-Rolling ensures that larger base cases are
    always executed, independent of the input problem
    size
  • The compiler unrolls the recursive part only
    temporarily, to generate the base cases

49
Transformation 3 Recursion Re-Rolling
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2/2/2) f1(pn/2/2/2,
n/2/2/2) f1(pn/2/2, n/2/2/2)
f1(pn/2/2n/2/2/2, n/2/2/2) f1(pn/2,
n/2/2/2) f1(pn/2n/2/2/2, n/2/2/2)
f1(pn/2n/2/2, n/2/2/2) f1(pn/2n/2/2n/2/2/
2, n/2/2/2)
50
Transformation 3 Recursion Re-Rolling
Identify the recursive part
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2/2/2) f1(pn/2/2/2,
n/2/2/2) f1(pn/2/2, n/2/2/2)
f1(pn/2/2n/2/2/2, n/2/2/2) f1(pn/2,
n/2/2/2) f1(pn/2n/2/2/2, n/2/2/2)
f1(pn/2n/2/2, n/2/2/2) f1(pn/2n/2/2n/2/2/
2, n/2/2/2)
51
Transformation 3 Recursion Re-Rolling
Replace with the recursive part of the original
procedure
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2) f1(pn/2, n/2)
52
Final Result
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2) f1(pn/2, n/2)
53
Speedup for Matrix Multiply
Matrix of 512 x 512 elements
54
Speedup for Matrix Multiply
Matrix of 1024 x 1024 elements
55
Other Optimizations
  • Inlining moves code from the inter-procedural
    level to the intra-procedural level
  • Conditional fusion brings code from the
    inter-basic-block level to the intra-basic-block
    level
  • Together, inlining and conditional fusion give
    subsequent compiler passes the opportunity to
    perform more aggressive optimizations

56
Comparison to Hand Coded Programs
  • Two applications Matrix multiply, LU
    decomposition
  • Three machines Pentium III, Origin 2000, PowerPC
  • Two different problem sizes
  • Compare automatically unrolled programs to
    optimized, hand coded versions from the Cilk
    benchmarks
  • Best automatically unrolled version performs
  • Between 2.2 and 2.9 times worse for matrix
    multiply
  • As good as hand coded version for LU

57
Related Work
  • Procedure Inlining
  • Scheifler (1977)
  • Richardson, Ghanapathi (1989)
  • Chambers, Ungar (1989)
  • Cooper, Hall, Torczon (1991)
  • Appel (1992)
  • Chang, Mahlke, Chen, Hwu (1992)

58
Conclusion
  • Recursion Unrolling
  • analogous to the loop unrolling transformation
  • Divide and Conquer Programs
  • The programmer writes simple base cases
  • The compiler automatically generates large base
    cases
  • Key Techniques
  • Inlining conceptually inline recursive calls
  • Conditional Fusion simplify intra-procedural
    control flow
  • Re-Rolling ensure that large base cases are
    executed

59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
Comparison to Hand Coded Programs
  • Matrix multiply 512 x 512 elements
  • Best automatically unrolled program 2.55 sec.
  • Hand coded with three nested loops 3.46 sec.
  • Hand coded Cilk program 1.16 sec.
  • Matrix multiply for 1024 x 1024 elements
  • Best automatically unrolled program 20.47 sec.
  • Hand coded with three nested loops 27.40 sec.
  • Hand coded Cilk program 9.19 sec.

63
Correctness
  • Recursion unrolling preserves the semantics of
    the program
  • The unrolled program terminates if and only if
    the original recursive program terminates
  • When both the original and the unrolled program
    terminate, the yield the same result

64
Speedup for Matrix Multiply
Pentium III, Matrix of 512 x 512 elements
65
Speedup for Matrix Multiply
Pentium III, Matrix of 1024 x 1024 elements
66
Speedup for Matrix Multiply
Power PC, Matrix of 512 x 512 elements
67
Speedup for Matrix Multiply
Power PC, Matrix of 1024 x 1024 elements
68
Speedup for Matrix Multiply
Origin 2000, Matrix of 512 x 512 elements
69
Speedup for Matrix Multiply
Origin 2000, Matrix of 1024 x 1024 elements
70
Speedup for LU
Pentium III, Matrix of 512 x 512 elements
71
Speedup for LU
Pentium III, Matrix of 1024 x 1024 elements
72
Speedup for LU
Power PC, Matrix of 512 x 512 elements
73
Speedup for LU
Power PC, Matrix of 1024 x 1024 elements
74
Speedup for LU
Origin 2000, Matrix of 1024 x 1024 elements
75
Speedup for LU
Origin 2000, Matrix of 512 x 512 elements
Write a Comment
User Comments (0)
About PowerShow.com