Title: Parallelizing C Programs Using Cilk
1Parallelizing C Programs Using Cilk
2Cilk Language
- Cilk is a language for multithreaded parallel
programming based on C. - The programmer should not worry about scheduling
the computation to run efficiently. - There are three additional keywords cilk, spawn
and sync.
3Example Fibonacci
- Int fib (int n)
-
- int x, y
- if (nlt2) return n
- x fib (n-1)
- y fib (n-2)
- return xy
cilk Int fib (int n) int x, y if (nlt2)
return n x spawn fib (n-1) y
spawn fib (n-2) sync return
xy
4Performance Measures
- Tp execution time on P processors.
- T1 is called work.
- T8 is called span.
- Obvious lower bounds
- Tp T1/P
- Tp T8
- p T1/T8 is called parallelism. Using more than p
processors makes little sense.
5Cilk Compiler
- The file extension should be .cilk.
- Example
- gt cilkc -O3 fib.cilk -o fib
- To find the 30th Fibonacci number using 4 CPUs
- gt fib --nproc 4 30
- To collect timings of each processor and compute
the span (not efficient) - gt cilkc -cilk-profile -cilk-span -O3 fib.cilk
-o fib
6Example Matrix Multiplication
- Suppose we want to multiply two n by n matrices
- We can recursively formulate the problem
- i.e. one n by n matrix multiplication reduces to
- 8 multiplications and for additions of (n/2) by
(n/2) submatrices.
7Multiplication Procedure
- Mult(C, A, B, n)
- if (n 1) C1,1 A1,1.B1,1
- else
-
- spawn Mult(C11,A11,B11,n/2)
-
- spawn Mult(C22,A21,B12,n/2)
- spawn Mult(T11,A12,B21,n/2)
-
- spawn Mult(T22,A22,B22,n/2)
- sync
- Add(C,T,n)
-
8Addition Procedure
- Add(C,T,n)
- if (n 1) C1,1 C1,1T1,1
- else
-
- spawn Add(C11,T11,n/2)
-
- spawn Add(C22,T22,n/2)
- sync
-
- T1 (work) for addition O(n2).
- T8(span) for addition O(log(n)).
9Complexity of Multiplication
- We know that matrix multiplication is O(n3) hence
T1 (work) for multiplication O(n3). - T8 M8(n) M8(n/2) O(log(n)) O(log2(n)).
-
- p T1 / T8 O(n3) / O(log2(n)).
- To multiply 1000 by 1000 p 107 ( a lot of CPUs
!!!)
10Discrete Fourier Transform
- DFT(n,w,p,)
- ...
- t w2 mod p
- DFT(n/2,t,p,)
- DFT(n/2,t,p,)
-
- w1 1
- for (i 0 i lt n/2 i)
-
-
- ai
- w1 w1.w mod p
-
-
cilk DFT(n,w,p,) ... t w2 mod p spawn
DFT(n/2,t,p,) spawn DFT(n/2,t,p,) sync
spawn ParCom(n,a,p,1,) cilk
ParCom(n,a,p,m,) if (n lt 512) spawn
ParCom(n/2,a,p,1,) m m . wn/2 mod p spawn
ParCom(n/2,an/2,p,m,) sync
11Complexity of ParCom
- The sequential combining does n/2 multiplication.
- T8 (span) for ParCom
- T8(n) T8(n/2) O(log(n)) T8(n)
O(log2(n)). - p O(n/log2(n)).
- We run FFT on stan which has 4 CPUs.
- Thus p gt 4 does not make sense, so we cut off
the parallelism at some level of recursion to
speed up the program.
12Timings
- Sequential FFT 123789 (ms)
processors Par time (ms) Speed up
4 32837 3.77
3 44315 2.79
2 66262 1.87
1 124006 0.998