Recursion%20Unrolling%20for%20Divide%20and%20Conquer%20Programs

About This Presentation

Title:

Recursion%20Unrolling%20for%20Divide%20and%20Conquer%20Programs

Description:

Recursion Unrolling for Divide and Conquer Programs – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 76

Provided by: rad130

Learn more at: http://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Recursion%20Unrolling%20for%20Divide%20and%20Conquer%20Programs

1
Recursion Unrolling for Divide and Conquer
Programs

Radu Rugina and Martin Rinard
Laboratory for Computer Science
Massachusetts Institute of Technology

2
What This Talk Is About

Automatic generation of efficient large base
cases for divide and conquer programs

3
Outline

Motivating Example
Computation Structure
Transformations
Related Work
Conclusion

4
1. Motivating Example
5
Divide and Conquer Matrix Multiply
A ? B R
A0 A1
A2 A3
B0 B1
B2 B3
A0?B0A1?B2 A0?B1A1?B3
A2?B0A3?B2 A2?B1A3?B3
?

Divide matrices into sub-matrices A0 , A1, A2
etc
Use blocked matrix multiply equations

6
Divide and Conquer Matrix Multiply
A ? B R
A0 A1
A2 A3
B0 B1
B2 B3
A0?B0A1?B2 A0?B1A1?B3
A2?B0A3?B2 A2?B1A3?B3
?

Recursively multiply sub-matrices

7
Divide and Conquer Matrix Multiply
A ? B R
a0
b0
a0 ? b0
?

Terminate recursion with a simple base case

8
Divide and Conquer Matrix Multiply
void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
Implements R A ? B
9
Divide and Conquer Matrix Multiply
Divide matrices in sub-matrices and recursively
multiply sub-matrices
void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
10
Divide and Conquer Matrix Multiply
Identify sub-matrices with pointers
void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
11
Divide and Conquer Matrix Multiply
Use a simple algorithm for the base case
void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
12
Divide and Conquer Matrix Multiply

Advantage of small base case simplicity
Code is easy to
Write
Maintain
Debug
Understand

Disadvantage inefficiency
Large control flow overhead
Most of the time is spent in dividing the matrix
in sub-matrices

void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
14
Hand Coded Implementation
void serialmul(block As, block Bs, block
Rs) int i, j DOUBLE A (DOUBLE
) As DOUBLE B (DOUBLE ) Bs
DOUBLE R (DOUBLE ) Rs for (j 0 j lt
16 j 2) DOUBLE bp Bj
for (i 0 i lt 16 i 2)
DOUBLE ap Ai 16 DOUBLE
rp Rj i 16 register
DOUBLE s0_0 rp0, s0_1 rp1
register DOUBLE s1_0 rp16, s1_1 rp17
s0_0 ap0 bp0
s0_1 ap0 bp1 s1_0
ap16 bp0 s1_1 ap16
bp1 s0_0 ap1 bp16
s0_1 ap1 bp17
s1_0 ap17 bp16 s1_1
ap17 bp17 s0_0 ap2
bp32 s0_1 ap2 bp33
s1_0 ap18 bp32
s1_1 ap18 bp33 s0_0
ap3 bp48 s0_1 ap3
bp49 s1_0 ap19 bp48
s1_1 ap19 bp49
s0_0 ap4 bp64 s0_1
ap4 bp65 s1_0 ap20
bp64 s1_1 ap20 bp65

s0_0 ap5 bp80
s0_1 ap5 bp81 s1_0
ap21 bp80 s1_1 ap21
bp81 s0_0 ap6 bp96
s0_1 ap6 bp97
s1_0 ap22 bp96 s1_1
ap22 bp97 s0_0 ap7
bp112 s0_1 ap7
bp113 s1_0 ap23
bp112 s1_1 ap23
bp113 s0_0 ap8 bp128
s0_1 ap8 bp129
s1_0 ap24 bp128 s1_1
ap24 bp129 s0_0 ap9
bp144 s0_1 ap9
bp145 s1_0 ap25
bp144 s1_1 ap25
bp145 s0_0 ap10
bp160 s0_1 ap10
bp161 s1_0 ap26
bp160 s1_1 ap26
bp161 s0_0 ap11
bp176 s0_1 ap11
bp177 s1_0 ap27
bp176 s1_1 ap27
bp177 s0_0 ap12
bp192 s0_1 ap12
bp193 s1_0 ap28
bp192 s1_1 ap28
bp193 s0_0 ap13
bp208 s0_1 ap13
bp209 s1_0 ap29
bp208
s1_1 ap29 bp209
s0_0 ap14 bp224
s0_1 ap14 bp225 s1_0
ap30 bp224 s1_1 ap30
bp225 s0_0 ap15
bp240 s0_1 ap15
bp241 s1_0 ap31
bp240 s1_1 ap31
bp241 rp0 s0_0
rp1 s0_1 rp16 s1_0
rp17 s1_1
cilk void matrixmul(long nb, block A, block
B, block R) if (nb 1)
flops serialmul(A, B, R) else if (nb gt
4) spawn matrixmul(nb/4, A, B, R) spawn
matrixmul(nb/4, A, B(nb/4), R(nb/4)) spawn
matrixmul(nb/4, A2(nb/4), B(nb/4),
R2(nb/4)) spawn matrixmul(nb/4, A2(nb/4),
B, R3(nb/4)) sync spawn matrixmul(nb/4,
A(nb/4), B2(nb/4), R) spawn matrixmul(nb/4,
A(nb/4), B3(nb/4), R(nb/4)) spawn
matrixmul(nb/4, A3(nb/4), B3(nb/4),
R2(nb/4)) spawn matrixmul(nb/4, A3(nb/4),
B3(nb/4), R3(nb/4)) sync
15
Goal

The programmer writes simple code with small base
cases
The compiler automatically generates efficient
code with large base cases

16
2. Computation Structure
17
Running Example Array Increment
void f(char p, int n) if (n 1)
/ base case increment one element / (p)
1 else f(p, n/2) /
increment first half / f(pn/2, n/2) /
increment second half /
18
Dynamic Call Tree for n4
Execution of f(p,4)
19
Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
20
Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
Activation Frame on the Stack
21
Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
Executed Instructions
22
Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
23
Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
24
Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
n1
25
Control Flow Overhead
Execution of f(p,4)

Call overhead

Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
n1
26
Control Flow Overhead
Execution of f(p,4)

Call overhead Test overhead

Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
n1
27
Computation
Execution of f(p,4)

Call overhead Test overhead
Computation

Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
n1
28
Large Base Cases Reduced Overhead
Execution of f(p,4)
Test n2 Call f Call f
n4
Test n2 Inc p Inc (p1)
Test n2 Inc p Inc (p1)
n2

29
3. Transformations
30
Transformation 1 Recursion Inlining
Start with the original recursive procedure
void f (char p, int n) if (n 1) (p)
1 else f(p, n/2)
f(pn/2, n/2)
31
Transformation 1 Recursion Inlining
Make two copies of the original procedure
void f1(char p, int n) if (n 1) (p)
1 else f1(p, n/2)
f1(pn/2, n/2)
void f2(char p, int n) if (n 1) (p)
1 else f2(p, n/2)
f2(pn/2, n/2)
32
Transformation 1 Recursion Inlining
Transform direct recursion to mutual recursion
void f1(char p, int n) if (n 1) (p)
1 else f2(p, n/2)
f2(pn/2, n/2)
void f2(char p, int n) if (n 1) (p)
1 else f1(p, n/2)
f1(pn/2, n/2)
33
Transformation 1 Recursion Inlining
Inline procedure f2 at call sites in f1
void f1(char p, int n) if (n 1) (p)
1 else f2(p, n/2)
f2(pn/2, n/2)
void f2(char p, int n) if (n 1) (p)
1 else f1(p, n/2)
f1(pn/2, n/2)
34
Transformation 1 Recursion Inlining
void f1(char p, int n) if (n 1)
(p) 1 else if (n/2
1) p 1 else
f1(p, n/2/2) f1(pn/2/2,
n/2/2) if (n/2 1)
(pn/2) 1 else
f1(pn/2, n/2/2) f1(pn/2n/4,
n/2/2)
35
Transformation 1 Recursion Inlining
void f1(char p, int n) if (n 1)
(p) 1 else if (n/2
1) p 1 else
f1(p, n/2/2) f1(pn/2/2,
n/2/2) if (n/2 1)
(pn/2) 1 else
f1(pn/2, n/2/2) f1(pn/2n/4,
n/2/2)

Reduced procedure call overhead
More code exposed at the intra-procedural level
Opportunities to simplify control flow in the
inlined code

36
Transformation 1 Recursion Inlining
void f1(char p, int n) if (n 1)
(p) 1 else if (n/2
1) p 1 else
f1(p, n/2/2) f1(pn/2/2,
n/2/2) if (n/2 1)
(pn/2) 1 else
f1(pn/2, n/2/2) f1(pn/2n/4,
n/2/2)

Reduced procedure call overhead
More code exposed at the intra-procedural level
Opportunities to simplify control flow in the
inlined code
identical condition expressions

37
Transformation 2 Conditional Fusion
Merge if statements with identical conditions
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f1(p, n/2/2)
f1(pn/2/2, n/2/2) f1(pn/2, n/2/2)
f1(pn/2n/4, n/2/2)
38
Transformation 2 Conditional Fusion
Merge if statements with identical conditions
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f1(p, n/2/2)
f1(pn/2/2, n/2/2) f1(pn/2, n/2/2)
f1(pn/2n/4, n/2/2)

Reduced branching overhead and bigger basic
blocks
Larger base case for n/2 1

39
Unrolling Iterations
Repeatedly apply inlining and conditional fusion
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f1(p, n/2/2)
f1(pn/2/2, n/2/2) f1(pn/2, n/2/2)
f1(pn/2n/4, n/2/2)
40
Second Unrolling Iteration
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f1(p, n/2/2)
f1(pn/2/2, n/2/2) f1(pn/2, n/2/2)
f1(pn/2n/4, n/2/2)
void f2(char p, int n) if (n 1)
p 1 else f2(p, n/2)
f2(pn/2, n/2)
41
Second Unrolling Iteration
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f2(p, n/2/2)
f2(pn/2/2, n/2/2) f2(pn/2, n/2/2)
f2(pn/2n/4, n/2/2)
void f2(char p, int n) if (n 1)
p 1 else f1(p, n/2)
f1(pn/2, n/2)
42
Result of Second Unrolling Iteration
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2/2/2) f1(pn/2/2/2,
n/2/2/2) f1(pn/2/2, n/2/2/2)
f1(pn/2/2n/2/2/2, n/2/2/2) f1(pn/2,
n/2/2/2) f1(pn/2n/2/2/2, n/2/2/2)
f1(pn/2n/2/2, n/2/2/2) f1(pn/2n/2/2n/2/2/
2, n/2/2/2)
43
Unrolling Iterations

The unrolling process stops when the number of
iterations reaches the desired unrolling factor
The unrolled recursive procedure
Has base cases for larger problem sizes
Divides the given problem into more sub-problems
of smaller sizes
In our example
Base cases for n1, n2, and n4
Problems are divided into 8 problems of 1/8 size

44
Speedup for Matrix Multiply
Matrix of 512 x 512 elements
45
Speedup for Matrix Multiply
Matrix of 512 x 512 elements
46
Speedup for Matrix Multiply
Matrix of 1024 x 1024 elements
47
Efficiency of Unrolled Recursive Part

Because the recursive part is also unrolled,
recursion may not exercise the large base cases
Which base case is executed depends on the size
of the input problem
In our example
For a problem of size n8, the base case for n1
is executed
For a problem of size n16, the base case for n2
is executed
The efficient base case for n4 is not executed
in these cases

48
Solution Recursion Re-Rolling

Roll back the recursive part of the unrolled
procedure after the large base cases are
generated
Re-Rolling ensures that larger base cases are
always executed, independent of the input problem
size
The compiler unrolls the recursive part only
temporarily, to generate the base cases

49
Transformation 3 Recursion Re-Rolling
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2/2/2) f1(pn/2/2/2,
n/2/2/2) f1(pn/2/2, n/2/2/2)
f1(pn/2/2n/2/2/2, n/2/2/2) f1(pn/2,
n/2/2/2) f1(pn/2n/2/2/2, n/2/2/2)
f1(pn/2n/2/2, n/2/2/2) f1(pn/2n/2/2n/2/2/
2, n/2/2/2)
50
Transformation 3 Recursion Re-Rolling
Identify the recursive part
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2/2/2) f1(pn/2/2/2,
n/2/2/2) f1(pn/2/2, n/2/2/2)
f1(pn/2/2n/2/2/2, n/2/2/2) f1(pn/2,
n/2/2/2) f1(pn/2n/2/2/2, n/2/2/2)
f1(pn/2n/2/2, n/2/2/2) f1(pn/2n/2/2n/2/2/
2, n/2/2/2)
51
Transformation 3 Recursion Re-Rolling
Replace with the recursive part of the original
procedure
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2) f1(pn/2, n/2)
52
Final Result
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2) f1(pn/2, n/2)
53
Speedup for Matrix Multiply
Matrix of 512 x 512 elements
54
Speedup for Matrix Multiply
Matrix of 1024 x 1024 elements
55
Other Optimizations

Inlining moves code from the inter-procedural
level to the intra-procedural level
Conditional fusion brings code from the
inter-basic-block level to the intra-basic-block
level
Together, inlining and conditional fusion give
subsequent compiler passes the opportunity to
perform more aggressive optimizations

56
Comparison to Hand Coded Programs

Two applications Matrix multiply, LU
decomposition
Three machines Pentium III, Origin 2000, PowerPC
Two different problem sizes
Compare automatically unrolled programs to
optimized, hand coded versions from the Cilk
benchmarks
Best automatically unrolled version performs
Between 2.2 and 2.9 times worse for matrix
multiply
As good as hand coded version for LU

57
Related Work

Procedure Inlining
Scheifler (1977)
Richardson, Ghanapathi (1989)
Chambers, Ungar (1989)
Cooper, Hall, Torczon (1991)
Appel (1992)
Chang, Mahlke, Chen, Hwu (1992)

58
Conclusion

Recursion Unrolling
analogous to the loop unrolling transformation
Divide and Conquer Programs
The programmer writes simple base cases
The compiler automatically generates large base
cases
Key Techniques
Inlining conceptually inline recursive calls
Conditional Fusion simplify intra-procedural
control flow
Re-Rolling ensure that large base cases are
executed

59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
Comparison to Hand Coded Programs

Matrix multiply 512 x 512 elements
Best automatically unrolled program 2.55 sec.
Hand coded with three nested loops 3.46 sec.
Hand coded Cilk program 1.16 sec.
Matrix multiply for 1024 x 1024 elements
Best automatically unrolled program 20.47 sec.
Hand coded with three nested loops 27.40 sec.
Hand coded Cilk program 9.19 sec.

63
Correctness

Recursion unrolling preserves the semantics of
the program
The unrolled program terminates if and only if
the original recursive program terminates
When both the original and the unrolled program
terminate, the yield the same result

64
Speedup for Matrix Multiply
Pentium III, Matrix of 512 x 512 elements
65
Speedup for Matrix Multiply
Pentium III, Matrix of 1024 x 1024 elements
66
Speedup for Matrix Multiply
Power PC, Matrix of 512 x 512 elements
67
Speedup for Matrix Multiply
Power PC, Matrix of 1024 x 1024 elements
68
Speedup for Matrix Multiply
Origin 2000, Matrix of 512 x 512 elements
69
Speedup for Matrix Multiply
Origin 2000, Matrix of 1024 x 1024 elements
70
Speedup for LU
Pentium III, Matrix of 512 x 512 elements
71
Speedup for LU
Pentium III, Matrix of 1024 x 1024 elements
72
Speedup for LU
Power PC, Matrix of 512 x 512 elements
73
Speedup for LU
Power PC, Matrix of 1024 x 1024 elements
74
Speedup for LU
Origin 2000, Matrix of 1024 x 1024 elements
75
Speedup for LU
Origin 2000, Matrix of 512 x 512 elements

Write a Comment

User Comments (0)