Title: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)
1The Implementation of the Cilk-5 Multithreaded
Language(Frigo, Leiserson, and Randall)
- Alistair Dundas
- Department of Computer Science
- University of Massachusetts
2Outline
- What is Cilk?
- Cilk example the Fibonacci algorithm.
- The work-first principle.
- Work Stealing.
- The T.H.E. Protocol.
- Empirical results.
- Summary and questions.
3What Is Cilk?
- Extension of C for parallel programming.
- Designed for SMP machines with support for shared
memory. - Benefits
- Provably efficient work stealing scheduler.
- Clean programming model.
- Benefits over normal thread programming
discussion topic! - Specifically Source to source compiler
generating C.
4Example Fibonacci Algorithm
int main (int argc, char argv) int n,
result n atoi(argv1) result fib(n)
printf(Resultd\n, result)
return 0
int fib (int n) if (nlt2) return n else
int x, y x fib (n-1) y fib
(n-2) return (xy)
5Example Fibonacci In Parallel
cilk int main (int argc, char argv) int
n, result n atoi(argv1) result spawn
fib(n) sync printf(Resultd\n,
result) return 0
cilk int fib (int n) if (nlt2) return n
else int x, y x spawn fib (n-1)
y spawn fib (n-2) sync return (xy)
6Source to Source Compiler
7The Work First Principle
- Work is the amount of time needed to execute the
computation serially. - Critical path length is the execution time on an
infinite number of processors. - The Work-First Principle Minimize scheduling
overhead borne by work at the expense of
increasing the critical path.
8Theory The Work First Principle
- Where TP is the time on P processors
- TP T1/P O(T?) (1)
- Making critical path overhead explicit
- TP lt T1/P c?T? (2)
- Define average parallelism (max speedup)
- PAVERAGE T1/T?
- Define parallel slackness
- PAVERAGE/P
9The Work First Principle (cont)
- Assumption of parallel slackness
- PAVERAGE/P c?
- Combining these with the inequality, we get
- TP T1/P
- Define work overhead
- c1 T1/TS
- TP c1TS/P
- Conclusion Minimize work overhead.
10Work Stealing Algorithm
- Each worker keeps a ready deque (double ended
queue) of procedure instances waiting to run. - Workers treat the deque as a stack, pushing and
popping procedure calls on to the end. - When workers have no more work, they steal from
the front of another workers deque. - Parents are stolen before children.
- This is implemented using two versions of each
procedure a fast clone, and a slow clone.
11Fast Clone
- Run fast clone when a procedure is spawned.
- Little support for parallelism.
- Whenever a call is made, save complete state, and
push on to end of deque. - When call returns, check to see if procedure was
stolen. - If stolen, return immediately.
- If not stolen, carry on execution.
- Since children are never stolen, sync is a no op.
12Fast Clone Example
cilk int fib (int n) if (nlt2) return n
else int x, y x spawn fib (n-1)
y spawn fib (n-2) sync return (xy)
13Fast Clone Example
- 1 int fib (int n)
- 2
- 3 fib.frame f frame pointer
- 4 f alloc(sizeof(f)) allocate frame
- 5 f-gtsig fib.sig initialize frame
- 6 if (n!2)
- 7 free(f, sizeof(f)) free frame
- 8 return n
- 9
- 10 else
14Fast Clone Example
- 11 int x, y
- 12 f-gtentry 1 save PC
- 13 f-gtn n save live vars
- 14 T f store frame pointer
- 15 push() push frame
- 16 x fib (n-1) do C call
- 17 if (pop(x) FAILURE) pop frame
- 18 return 0 procedure stolen
- 19 lt gt second spawn
- 20 sync is free!
- 21 free(f, sizeof(f)) free frame
- 22 return (xy)
- 23
15Slow Clone
- Slow clone used when a procedure is stolen.
- Similar to fast clone, but also supports
concurrent execution. - It restores program counter and procedure state
using copy stored on deque. - Calling sync makes call to runtime system for
check on childrens status.
16The T.H.E. Protocol
- Deques held in shared memory.
- Workers operate at the end, thiefs at the front.
- We must prevent race conditions where a thief and
victim try to access the same procedure frame. - Locking deques would be expensive for workers.
- The T.H.E Protocol removes overhead of the common
case, where there is no conflict.
17The T.H.E. Protocol
- Assumes only reads and writes are atomic.
- Head of the deque is H, tail is T, and (T H)
- Only thief can change H.
- Only worker can change T.
- To steal thiefs must get the lock L.
- At most two processors operating on deque.
- Three cases of interaction
- Two or more items on deque each gets one.
- One item on deque either worker or thief gets
frame, but not both. - No items on deque both worker and thief fail.
18One item on deque case
- Both thief and worker assume they can get a
procedure frame and change H or T. - If both thief and worker try to steal frame, one
or both of them will discover (H gt T), depending
on instruction order. - If thief discovers (H gt T) it backs off and
restores H. - If worker discovers (H gt T) it restores T, and
then tries for the lock. Inside lock, procedure
can be safely popped if still there.
19Empirical Results
- On an eight processor Sun SMP, achieved average
speed up of 6.2 from elison (serial C
non-threaded versions). - Assumptions of work-first seem sound
- Applications tested all showed high amounts of
average parallelism. - Work overhead small for most programs. Least
speed up is where overhead is greatest.
20Summary
- Extension of C for parallel programming.
- Aims to simplify parallelization.
- Main idea is to prevent overhead for workers
rather than focus on critical path. - Empirical results show speed up average of 6.2 on
an 8 processor machine.
21My Questions
- A cilk spawn is always just a C call. Who starts
the threads, and how many are there? - Why use Cilk rather than use threads directly?
- What about using Cilk on a bewoulf cluster?
- Are their test programs representative of SMP
applications?
22Other Extentions
- Inlets a wrapper around spawned procedure
returns. - Abort terminates work no longer needed (e.g. in
parallel search). - Locking facilities for access to shared data.
23T.H.E. Protocol The Worker/Victim
pop() T-- if (H gt T) T
lock(L) T-- if (H gt T) T
unlock(L) return FAILURE
unlock(L) return SUCCESS
steal() lock(L) H if (H gt T)
H-- unlock(L) return FAILURE
unlock(L) return SUCCESS
24Fibonacci Illustration