The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

About This Presentation
Title:

The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)

Description:

The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall) Alistair Dundas Department of Computer Science University of Massachusetts –

Number of Views:88
Avg rating:3.0/5.0
Slides: 25
Provided by: uma145
Category:

less

Transcript and Presenter's Notes

Title: The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall)


1
The Implementation of the Cilk-5 Multithreaded
Language(Frigo, Leiserson, and Randall)
  • Alistair Dundas
  • Department of Computer Science
  • University of Massachusetts

2
Outline
  • What is Cilk?
  • Cilk example the Fibonacci algorithm.
  • The work-first principle.
  • Work Stealing.
  • The T.H.E. Protocol.
  • Empirical results.
  • Summary and questions.

3
What Is Cilk?
  • Extension of C for parallel programming.
  • Designed for SMP machines with support for shared
    memory.
  • Benefits
  • Provably efficient work stealing scheduler.
  • Clean programming model.
  • Benefits over normal thread programming
    discussion topic!
  • Specifically Source to source compiler
    generating C.

4
Example Fibonacci Algorithm
int main (int argc, char argv) int n,
result n atoi(argv1) result fib(n)
printf(Resultd\n, result)
return 0
int fib (int n) if (nlt2) return n else
int x, y x fib (n-1) y fib
(n-2) return (xy)
5
Example Fibonacci In Parallel
cilk int main (int argc, char argv) int
n, result n atoi(argv1) result spawn
fib(n) sync printf(Resultd\n,
result) return 0
cilk int fib (int n) if (nlt2) return n
else int x, y x spawn fib (n-1)
y spawn fib (n-2) sync return (xy)

6
Source to Source Compiler
7
The Work First Principle
  • Work is the amount of time needed to execute the
    computation serially.
  • Critical path length is the execution time on an
    infinite number of processors.
  • The Work-First Principle Minimize scheduling
    overhead borne by work at the expense of
    increasing the critical path.

8
Theory The Work First Principle
  • Where TP is the time on P processors
  • TP T1/P O(T?) (1)
  • Making critical path overhead explicit
  • TP lt T1/P c?T? (2)
  • Define average parallelism (max speedup)
  • PAVERAGE T1/T?
  • Define parallel slackness
  • PAVERAGE/P

9
The Work First Principle (cont)
  • Assumption of parallel slackness
  • PAVERAGE/P c?
  • Combining these with the inequality, we get
  • TP T1/P
  • Define work overhead
  • c1 T1/TS
  • TP c1TS/P
  • Conclusion Minimize work overhead.

10
Work Stealing Algorithm
  • Each worker keeps a ready deque (double ended
    queue) of procedure instances waiting to run.
  • Workers treat the deque as a stack, pushing and
    popping procedure calls on to the end.
  • When workers have no more work, they steal from
    the front of another workers deque.
  • Parents are stolen before children.
  • This is implemented using two versions of each
    procedure a fast clone, and a slow clone.

11
Fast Clone
  • Run fast clone when a procedure is spawned.
  • Little support for parallelism.
  • Whenever a call is made, save complete state, and
    push on to end of deque.
  • When call returns, check to see if procedure was
    stolen.
  • If stolen, return immediately.
  • If not stolen, carry on execution.
  • Since children are never stolen, sync is a no op.

12
Fast Clone Example
cilk int fib (int n) if (nlt2) return n
else int x, y x spawn fib (n-1)
y spawn fib (n-2) sync return (xy)

13
Fast Clone Example
  • 1 int fib (int n)
  • 2
  • 3 fib.frame f frame pointer
  • 4 f alloc(sizeof(f)) allocate frame
  • 5 f-gtsig fib.sig initialize frame
  • 6 if (n!2)
  • 7 free(f, sizeof(f)) free frame
  • 8 return n
  • 9
  • 10 else

14
Fast Clone Example
  • 11 int x, y
  • 12 f-gtentry 1 save PC
  • 13 f-gtn n save live vars
  • 14 T f store frame pointer
  • 15 push() push frame
  • 16 x fib (n-1) do C call
  • 17 if (pop(x) FAILURE) pop frame
  • 18 return 0 procedure stolen
  • 19 lt gt second spawn
  • 20 sync is free!
  • 21 free(f, sizeof(f)) free frame
  • 22 return (xy)
  • 23

15
Slow Clone
  • Slow clone used when a procedure is stolen.
  • Similar to fast clone, but also supports
    concurrent execution.
  • It restores program counter and procedure state
    using copy stored on deque.
  • Calling sync makes call to runtime system for
    check on childrens status.

16
The T.H.E. Protocol
  • Deques held in shared memory.
  • Workers operate at the end, thiefs at the front.
  • We must prevent race conditions where a thief and
    victim try to access the same procedure frame.
  • Locking deques would be expensive for workers.
  • The T.H.E Protocol removes overhead of the common
    case, where there is no conflict.

17
The T.H.E. Protocol
  • Assumes only reads and writes are atomic.
  • Head of the deque is H, tail is T, and (T H)
  • Only thief can change H.
  • Only worker can change T.
  • To steal thiefs must get the lock L.
  • At most two processors operating on deque.
  • Three cases of interaction
  • Two or more items on deque each gets one.
  • One item on deque either worker or thief gets
    frame, but not both.
  • No items on deque both worker and thief fail.

18
One item on deque case
  • Both thief and worker assume they can get a
    procedure frame and change H or T.
  • If both thief and worker try to steal frame, one
    or both of them will discover (H gt T), depending
    on instruction order.
  • If thief discovers (H gt T) it backs off and
    restores H.
  • If worker discovers (H gt T) it restores T, and
    then tries for the lock. Inside lock, procedure
    can be safely popped if still there.

19
Empirical Results
  • On an eight processor Sun SMP, achieved average
    speed up of 6.2 from elison (serial C
    non-threaded versions).
  • Assumptions of work-first seem sound
  • Applications tested all showed high amounts of
    average parallelism.
  • Work overhead small for most programs. Least
    speed up is where overhead is greatest.

20
Summary
  • Extension of C for parallel programming.
  • Aims to simplify parallelization.
  • Main idea is to prevent overhead for workers
    rather than focus on critical path.
  • Empirical results show speed up average of 6.2 on
    an 8 processor machine.

21
My Questions
  • A cilk spawn is always just a C call. Who starts
    the threads, and how many are there?
  • Why use Cilk rather than use threads directly?
  • What about using Cilk on a bewoulf cluster?
  • Are their test programs representative of SMP
    applications?

22
Other Extentions
  • Inlets a wrapper around spawned procedure
    returns.
  • Abort terminates work no longer needed (e.g. in
    parallel search).
  • Locking facilities for access to shared data.

23
T.H.E. Protocol The Worker/Victim
pop() T-- if (H gt T) T
lock(L) T-- if (H gt T) T
unlock(L) return FAILURE
unlock(L) return SUCCESS
  • push()
  • T

steal() lock(L) H if (H gt T)
H-- unlock(L) return FAILURE
unlock(L) return SUCCESS
24
Fibonacci Illustration
Write a Comment
User Comments (0)
About PowerShow.com