Title: Recursion%20Unrolling%20for%20Divide%20and%20Conquer%20Programs
1Recursion Unrolling for Divide and Conquer
Programs
- Radu Rugina and Martin Rinard
- Laboratory for Computer Science
- Massachusetts Institute of Technology
2What This Talk Is About
- Automatic generation of efficient large base
cases for divide and conquer programs
3Outline
- Motivating Example
- Computation Structure
- Transformations
- Related Work
- Conclusion
41. Motivating Example
5Divide and Conquer Matrix Multiply
A ? B R
A0 A1
A2 A3
B0 B1
B2 B3
A0?B0A1?B2 A0?B1A1?B3
A2?B0A3?B2 A2?B1A3?B3
?
- Divide matrices into sub-matrices A0 , A1, A2
etc - Use blocked matrix multiply equations
6Divide and Conquer Matrix Multiply
A ? B R
A0 A1
A2 A3
B0 B1
B2 B3
A0?B0A1?B2 A0?B1A1?B3
A2?B0A3?B2 A2?B1A3?B3
?
- Recursively multiply sub-matrices
7Divide and Conquer Matrix Multiply
A ? B R
a0
b0
a0 ? b0
?
- Terminate recursion with a simple base case
8Divide and Conquer Matrix Multiply
void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
Implements R A ? B
9Divide and Conquer Matrix Multiply
Divide matrices in sub-matrices and recursively
multiply sub-matrices
void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
10Divide and Conquer Matrix Multiply
Identify sub-matrices with pointers
void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
11Divide and Conquer Matrix Multiply
Use a simple algorithm for the base case
void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
12Divide and Conquer Matrix Multiply
- Advantage of small base case simplicity
- Code is easy to
- Write
- Maintain
- Debug
- Understand
void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
13Divide and Conquer Matrix Multiply
- Disadvantage inefficiency
- Large control flow overhead
- Most of the time is spent in dividing the matrix
in sub-matrices
void matmul(int A, int B, int R, int n) if
(n 1) (R) (A) (B) else
matmul(A, B, R, n/4) matmul(A, B(n/4),
R(n/4), n/4) matmul(A2(n/4), B, R2(n/4),
n/4) matmul(A2(n/4), B(n/4), R3(n/4),
n/4) matmul(A(n/4), B2(n/4), R,
n/4) matmul(A(n/4), B3(n/4), R(n/4),
n/4) matmul(A3(n/4), B2(n/4), R2(n/4),
n/4) matmul(A3(n/4), B3(n/4), R3(n/4),
n/4)
14Hand Coded Implementation
void serialmul(block As, block Bs, block
Rs) int i, j DOUBLE A (DOUBLE
) As DOUBLE B (DOUBLE ) Bs
DOUBLE R (DOUBLE ) Rs for (j 0 j lt
16 j 2) DOUBLE bp Bj
for (i 0 i lt 16 i 2)
DOUBLE ap Ai 16 DOUBLE
rp Rj i 16 register
DOUBLE s0_0 rp0, s0_1 rp1
register DOUBLE s1_0 rp16, s1_1 rp17
s0_0 ap0 bp0
s0_1 ap0 bp1 s1_0
ap16 bp0 s1_1 ap16
bp1 s0_0 ap1 bp16
s0_1 ap1 bp17
s1_0 ap17 bp16 s1_1
ap17 bp17 s0_0 ap2
bp32 s0_1 ap2 bp33
s1_0 ap18 bp32
s1_1 ap18 bp33 s0_0
ap3 bp48 s0_1 ap3
bp49 s1_0 ap19 bp48
s1_1 ap19 bp49
s0_0 ap4 bp64 s0_1
ap4 bp65 s1_0 ap20
bp64 s1_1 ap20 bp65
s0_0 ap5 bp80
s0_1 ap5 bp81 s1_0
ap21 bp80 s1_1 ap21
bp81 s0_0 ap6 bp96
s0_1 ap6 bp97
s1_0 ap22 bp96 s1_1
ap22 bp97 s0_0 ap7
bp112 s0_1 ap7
bp113 s1_0 ap23
bp112 s1_1 ap23
bp113 s0_0 ap8 bp128
s0_1 ap8 bp129
s1_0 ap24 bp128 s1_1
ap24 bp129 s0_0 ap9
bp144 s0_1 ap9
bp145 s1_0 ap25
bp144 s1_1 ap25
bp145 s0_0 ap10
bp160 s0_1 ap10
bp161 s1_0 ap26
bp160 s1_1 ap26
bp161 s0_0 ap11
bp176 s0_1 ap11
bp177 s1_0 ap27
bp176 s1_1 ap27
bp177 s0_0 ap12
bp192 s0_1 ap12
bp193 s1_0 ap28
bp192 s1_1 ap28
bp193 s0_0 ap13
bp208 s0_1 ap13
bp209 s1_0 ap29
bp208
s1_1 ap29 bp209
s0_0 ap14 bp224
s0_1 ap14 bp225 s1_0
ap30 bp224 s1_1 ap30
bp225 s0_0 ap15
bp240 s0_1 ap15
bp241 s1_0 ap31
bp240 s1_1 ap31
bp241 rp0 s0_0
rp1 s0_1 rp16 s1_0
rp17 s1_1
cilk void matrixmul(long nb, block A, block
B, block R) if (nb 1)
flops serialmul(A, B, R) else if (nb gt
4) spawn matrixmul(nb/4, A, B, R) spawn
matrixmul(nb/4, A, B(nb/4), R(nb/4)) spawn
matrixmul(nb/4, A2(nb/4), B(nb/4),
R2(nb/4)) spawn matrixmul(nb/4, A2(nb/4),
B, R3(nb/4)) sync spawn matrixmul(nb/4,
A(nb/4), B2(nb/4), R) spawn matrixmul(nb/4,
A(nb/4), B3(nb/4), R(nb/4)) spawn
matrixmul(nb/4, A3(nb/4), B3(nb/4),
R2(nb/4)) spawn matrixmul(nb/4, A3(nb/4),
B3(nb/4), R3(nb/4)) sync
15Goal
- The programmer writes simple code with small base
cases - The compiler automatically generates efficient
code with large base cases
162. Computation Structure
17Running Example Array Increment
void f(char p, int n) if (n 1)
/ base case increment one element / (p)
1 else f(p, n/2) /
increment first half / f(pn/2, n/2) /
increment second half /
18Dynamic Call Tree for n4
Execution of f(p,4)
19Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
20Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
Activation Frame on the Stack
21Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
Executed Instructions
22Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
23Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
24Dynamic Call Tree for n4
Execution of f(p,4)
Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
n1
25Control Flow Overhead
Execution of f(p,4)
Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
n1
26Control Flow Overhead
Execution of f(p,4)
- Call overhead Test overhead
Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
n1
27Computation
Execution of f(p,4)
- Call overhead Test overhead
- Computation
Test n1 Call f Call f
n4
Test n1 Call f Call f
Test n1 Call f Call f
n2
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
Test n1 Inc p
n1
28Large Base Cases Reduced Overhead
Execution of f(p,4)
Test n2 Call f Call f
n4
Test n2 Inc p Inc (p1)
Test n2 Inc p Inc (p1)
n2
293. Transformations
30Transformation 1 Recursion Inlining
Start with the original recursive procedure
void f (char p, int n) if (n 1) (p)
1 else f(p, n/2)
f(pn/2, n/2)
31Transformation 1 Recursion Inlining
Make two copies of the original procedure
void f1(char p, int n) if (n 1) (p)
1 else f1(p, n/2)
f1(pn/2, n/2)
void f2(char p, int n) if (n 1) (p)
1 else f2(p, n/2)
f2(pn/2, n/2)
32Transformation 1 Recursion Inlining
Transform direct recursion to mutual recursion
void f1(char p, int n) if (n 1) (p)
1 else f2(p, n/2)
f2(pn/2, n/2)
void f2(char p, int n) if (n 1) (p)
1 else f1(p, n/2)
f1(pn/2, n/2)
33Transformation 1 Recursion Inlining
Inline procedure f2 at call sites in f1
void f1(char p, int n) if (n 1) (p)
1 else f2(p, n/2)
f2(pn/2, n/2)
void f2(char p, int n) if (n 1) (p)
1 else f1(p, n/2)
f1(pn/2, n/2)
34Transformation 1 Recursion Inlining
void f1(char p, int n) if (n 1)
(p) 1 else if (n/2
1) p 1 else
f1(p, n/2/2) f1(pn/2/2,
n/2/2) if (n/2 1)
(pn/2) 1 else
f1(pn/2, n/2/2) f1(pn/2n/4,
n/2/2)
35Transformation 1 Recursion Inlining
void f1(char p, int n) if (n 1)
(p) 1 else if (n/2
1) p 1 else
f1(p, n/2/2) f1(pn/2/2,
n/2/2) if (n/2 1)
(pn/2) 1 else
f1(pn/2, n/2/2) f1(pn/2n/4,
n/2/2)
- Reduced procedure call overhead
- More code exposed at the intra-procedural level
- Opportunities to simplify control flow in the
inlined code
36Transformation 1 Recursion Inlining
void f1(char p, int n) if (n 1)
(p) 1 else if (n/2
1) p 1 else
f1(p, n/2/2) f1(pn/2/2,
n/2/2) if (n/2 1)
(pn/2) 1 else
f1(pn/2, n/2/2) f1(pn/2n/4,
n/2/2)
- Reduced procedure call overhead
- More code exposed at the intra-procedural level
- Opportunities to simplify control flow in the
inlined code - identical condition expressions
37Transformation 2 Conditional Fusion
Merge if statements with identical conditions
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f1(p, n/2/2)
f1(pn/2/2, n/2/2) f1(pn/2, n/2/2)
f1(pn/2n/4, n/2/2)
38Transformation 2 Conditional Fusion
Merge if statements with identical conditions
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f1(p, n/2/2)
f1(pn/2/2, n/2/2) f1(pn/2, n/2/2)
f1(pn/2n/4, n/2/2)
- Reduced branching overhead and bigger basic
blocks - Larger base case for n/2 1
39Unrolling Iterations
Repeatedly apply inlining and conditional fusion
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f1(p, n/2/2)
f1(pn/2/2, n/2/2) f1(pn/2, n/2/2)
f1(pn/2n/4, n/2/2)
40Second Unrolling Iteration
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f1(p, n/2/2)
f1(pn/2/2, n/2/2) f1(pn/2, n/2/2)
f1(pn/2n/4, n/2/2)
void f2(char p, int n) if (n 1)
p 1 else f2(p, n/2)
f2(pn/2, n/2)
41Second Unrolling Iteration
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else f2(p, n/2/2)
f2(pn/2/2, n/2/2) f2(pn/2, n/2/2)
f2(pn/2n/4, n/2/2)
void f2(char p, int n) if (n 1)
p 1 else f1(p, n/2)
f1(pn/2, n/2)
42Result of Second Unrolling Iteration
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2/2/2) f1(pn/2/2/2,
n/2/2/2) f1(pn/2/2, n/2/2/2)
f1(pn/2/2n/2/2/2, n/2/2/2) f1(pn/2,
n/2/2/2) f1(pn/2n/2/2/2, n/2/2/2)
f1(pn/2n/2/2, n/2/2/2) f1(pn/2n/2/2n/2/2/
2, n/2/2/2)
43Unrolling Iterations
- The unrolling process stops when the number of
iterations reaches the desired unrolling factor - The unrolled recursive procedure
- Has base cases for larger problem sizes
- Divides the given problem into more sub-problems
of smaller sizes - In our example
- Base cases for n1, n2, and n4
- Problems are divided into 8 problems of 1/8 size
44Speedup for Matrix Multiply
Matrix of 512 x 512 elements
45Speedup for Matrix Multiply
Matrix of 512 x 512 elements
46Speedup for Matrix Multiply
Matrix of 1024 x 1024 elements
47Efficiency of Unrolled Recursive Part
- Because the recursive part is also unrolled,
- recursion may not exercise the large base cases
- Which base case is executed depends on the size
of the input problem - In our example
- For a problem of size n8, the base case for n1
is executed - For a problem of size n16, the base case for n2
is executed - The efficient base case for n4 is not executed
in these cases
48Solution Recursion Re-Rolling
- Roll back the recursive part of the unrolled
procedure after the large base cases are
generated - Re-Rolling ensures that larger base cases are
always executed, independent of the input problem
size - The compiler unrolls the recursive part only
temporarily, to generate the base cases
49Transformation 3 Recursion Re-Rolling
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2/2/2) f1(pn/2/2/2,
n/2/2/2) f1(pn/2/2, n/2/2/2)
f1(pn/2/2n/2/2/2, n/2/2/2) f1(pn/2,
n/2/2/2) f1(pn/2n/2/2/2, n/2/2/2)
f1(pn/2n/2/2, n/2/2/2) f1(pn/2n/2/2n/2/2/
2, n/2/2/2)
50Transformation 3 Recursion Re-Rolling
Identify the recursive part
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2/2/2) f1(pn/2/2/2,
n/2/2/2) f1(pn/2/2, n/2/2/2)
f1(pn/2/2n/2/2/2, n/2/2/2) f1(pn/2,
n/2/2/2) f1(pn/2n/2/2/2, n/2/2/2)
f1(pn/2n/2/2, n/2/2/2) f1(pn/2n/2/2n/2/2/
2, n/2/2/2)
51Transformation 3 Recursion Re-Rolling
Replace with the recursive part of the original
procedure
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2) f1(pn/2, n/2)
52Final Result
void f1(char p, int n) if (n 1)
p 1 else if (n/2 1)
p 1 (pn/2) 1
else if (n/2/2 1) p 1
(pn/2/2) 1 (pn/2) 1
(pn/2n/2/2) 1
else f1(p, n/2) f1(pn/2, n/2)
53Speedup for Matrix Multiply
Matrix of 512 x 512 elements
54Speedup for Matrix Multiply
Matrix of 1024 x 1024 elements
55Other Optimizations
- Inlining moves code from the inter-procedural
level to the intra-procedural level - Conditional fusion brings code from the
inter-basic-block level to the intra-basic-block
level - Together, inlining and conditional fusion give
subsequent compiler passes the opportunity to
perform more aggressive optimizations
56Comparison to Hand Coded Programs
- Two applications Matrix multiply, LU
decomposition - Three machines Pentium III, Origin 2000, PowerPC
- Two different problem sizes
- Compare automatically unrolled programs to
optimized, hand coded versions from the Cilk
benchmarks - Best automatically unrolled version performs
- Between 2.2 and 2.9 times worse for matrix
multiply - As good as hand coded version for LU
57Related Work
- Procedure Inlining
- Scheifler (1977)
- Richardson, Ghanapathi (1989)
- Chambers, Ungar (1989)
- Cooper, Hall, Torczon (1991)
- Appel (1992)
- Chang, Mahlke, Chen, Hwu (1992)
58Conclusion
- Recursion Unrolling
- analogous to the loop unrolling transformation
- Divide and Conquer Programs
- The programmer writes simple base cases
- The compiler automatically generates large base
cases - Key Techniques
- Inlining conceptually inline recursive calls
- Conditional Fusion simplify intra-procedural
control flow - Re-Rolling ensure that large base cases are
executed
59(No Transcript)
60(No Transcript)
61(No Transcript)
62Comparison to Hand Coded Programs
- Matrix multiply 512 x 512 elements
- Best automatically unrolled program 2.55 sec.
- Hand coded with three nested loops 3.46 sec.
- Hand coded Cilk program 1.16 sec.
- Matrix multiply for 1024 x 1024 elements
- Best automatically unrolled program 20.47 sec.
- Hand coded with three nested loops 27.40 sec.
- Hand coded Cilk program 9.19 sec.
63Correctness
- Recursion unrolling preserves the semantics of
the program - The unrolled program terminates if and only if
the original recursive program terminates - When both the original and the unrolled program
terminate, the yield the same result
64Speedup for Matrix Multiply
Pentium III, Matrix of 512 x 512 elements
65Speedup for Matrix Multiply
Pentium III, Matrix of 1024 x 1024 elements
66Speedup for Matrix Multiply
Power PC, Matrix of 512 x 512 elements
67Speedup for Matrix Multiply
Power PC, Matrix of 1024 x 1024 elements
68Speedup for Matrix Multiply
Origin 2000, Matrix of 512 x 512 elements
69Speedup for Matrix Multiply
Origin 2000, Matrix of 1024 x 1024 elements
70Speedup for LU
Pentium III, Matrix of 512 x 512 elements
71Speedup for LU
Pentium III, Matrix of 1024 x 1024 elements
72Speedup for LU
Power PC, Matrix of 512 x 512 elements
73Speedup for LU
Power PC, Matrix of 1024 x 1024 elements
74Speedup for LU
Origin 2000, Matrix of 1024 x 1024 elements
75Speedup for LU
Origin 2000, Matrix of 512 x 512 elements