Title: CILK: An Efficient Multithreaded Runtime System
1CILK An Efficient Multithreaded Runtime System
2People
- Project at MIT now at UT Austin
- Bobby Blumofe (now UT Austin, Akamai)
- Chris Joerg
- Brad Kuszmaul (now Yale)
- Charles Leiserson (MIT, Akamai)
- Keith Randall (Bell Labs)
- Yuli Zhou (Bell Labs)
3Outline
- Introduction
- Programming environment
- The work-stealing thread scheduler
- Performance of applications
- Modeling performance
- Proven Properties
- Conclusions
4Introduction
- Why multithreading?
- To implement dynamic, asynchronous, concurrent
programs. - Cilk programmer optimizes
- total work
- critical path
- A Cilk computation is viewed as a dynamic,
directed acyclic graph (dag)
5Introduction ...
6Introduction ...
- Cilk program is a set of procedures
- A procedure is a sequence of threads
- Cilk threads are
- represented by nodes in the dag
- Non-blocking run to completion no waiting or
suspension atomic units of execution
7Introduction ...
- Threads can spawn child threads
- downward edges connect a parent to its children
- A child parent can run concurrently.
- Non-blocking threads ? a child cannot return a
value to its parent. - The parent spawns a successor that receives
values from its children
8Introduction ...
- A thread its successor are parts of the same
Cilk procedure. - connected by horizontal arcs
- Childrens returned values are received before
their successor begins - They constitute data dependencies.
- Connected by curved arcs
9Introduction ...
10Introduction Execution Time
- Execution time of a Cilk program using P
processors depends on - Work (T1) time for Cilk program with 1 processor
to complete. - Critical path (T?) the time to execute the
longest directed path in the dag. - TP gt T1 / P (not true for some searches)
- TP gt T?
11Introduction Scheduling
- Cilk uses run time scheduling called work
stealing. - Works well on dynamic, asynchronous, MIMD-style
programs. - For fully strict programs, Cilk achieves
asymptotic optimality for - space, time, communication
12Introduction language
- Cilk is an extension of C
- Cilk programs are
- preprocessed to C
- linked with a runtime library
13Programming Environment
- Declaring a thread
- thread T ( ltargsgt ) ltstmtsgt
- T is preprocessed into a C function of 1 argument
and return type void. - The 1 argument is a pointer to a closure
14Environment Closure
- A closure is a data structure that has
- a pointer to the C function for T
- a slot for each argument (inputs continuations)
- a join counter count of the missing argument
values - A closure is ready when join counter 0.
- A closure is waiting otherwise.
- They are allocated from a runtime heap
15Environment Continuation
- A Cilk continuation is a data type, denoted by
the keyword cont. - cont int x
- It is a global reference to an empty slot of a
closure. - It is implemented as 2 items
- a pointer to the closure (what thread)
- an int value the slot number. (what input)
16Environment Closure
17Environment spawn
- To spawn a child, a thread creates its closure
- spawn T (ltargsgt )
- creates childs closure
- sets available arguments
- sets join counter
- To specify a missing argument, prefix with a ?
- spawn T (k, ?x)
18Environment spawn_next
- A successor thread is spawned the same way as a
child, except the keyword spawn_next is used - spawn_next T(k, ?x)
- Children typically have no missing arguments
successors do.
19Explicit continuation passing
- Nonblocking threads ? a parent cannot block on
childrens results. - It spawns a successor thread.
- This communication paradigm is called explicit
continuation passing. - Cilk provides a primitive to send a value from
one closure to another.
20send_argument
- Cilk provides the primitive
- send_argument( k, value )
- sends value to the argument slot of a waiting
closure specified by continuation k.
spawn_next
successor
parent
spawn
send_argument
child
21Cilk Procedure for computing a Fibonacci number
thread int fib ( cont int k, int n ) if ( n
lt 2 ) send_argument( k, n ) else cont int
x, y spawn_next sum ( k, ?x, ?y )
spawn fib ( x, n - 1 )
spawn fib ( y, n - 2 ) thread sum (
cont int k, int x, int y ) send_argument ( k,
x y )
22Nonblocking Threads Advantages
- Shallow call stack. (for us fault tolerance )
- Simplify runtime system
- Completed threads leave C runtime stack empty.
- Portable runtime implementation
23Nonblocking Threads Disdvantages
- Burdens programmer with explicit continuation
passing.
24Work-Stealing Scheduler
- The concept of work-stealing goes at least as far
back as 1981. - Work-stealing
- a process with no work selects a victim from
which to get work. - it gets the shallowest thread in the victims
spawn tree. - In Cilk, thieves choose victims randomly.
25Thread Level
26Stealing Work The Ready Deque
- Each closure has a level
- level( child ) level( parent ) 1
- level( successor ) level( parent )
- Each processor maintains a ready deque
- Contains ready closures
- The Lth element contains the list of all ready
closures whose level is L.
27Ready deque
- if ( ! readyDeque .isEmpty() )
- take deepest thread
- else
- steal shallowest thread from readyDeque of
randomly selected victim
28Why Steal Shallowest closure?
- Shallow threads probably produce more work,
therefore, reduce communication. - Shallow threads more likely to be on critical
path.
29Readying a Remote Closure
- If a send_argument makes a remote closure ready,
- put closure on sending processors readyDeque
- ? extra communication.
- Done to make scheduler provably good
- Putting on local readyDeque works well in
practice.
30Performance of Application
- Tserial time for C program
- T1 time for 1-processor Cilk program
- Tserial /T1 efficiency of the Cilk program
- Efficiency is close to 1 for programs with
moderately long threads Cilk overhead is small.
31Performance of Applications
- T1/TP speedup
- T1/ T? average parallelism
- If average parallelism is large
- then speedup is nearly perfect.
- If average parallelism is small
- then speedup is much smaller.
32Performance Data
33Performance of Applications
- Application speedup efficiency X speedup
- ( Tserial /T1 ) X ( T1/TP ) Tserial / TP
34Modeling Performance
- TP gt max( T? , T1 / P )
- A good scheduler should come close to these lower
bounds.
35Modeling Performance
- Empirical data suggests that for Cilk
- TP ? c1 T1 / P c ? T? ,
- where c1 ? 1.067 c ? ? 1.042
- If T1 / T? gt 10P
- then critical path does not affect TP.
36Proven Property Time
- Time Including overhead,
- TP O( T1/P T? ),
- which is asymptotically optimal
37Conclusions
- We can predict the performance of a Cilk program
by observing machine-independent characteristics
- Work
- Critical path
- when the program is fully-strict.
- Cilks usefulness is unclear for other kinds of
programs (e.g., iterative programs).
38Conclusions ...
- Explicit continuation passing a nuisance.
- It subsequently was removed (with more clever
pre-processing).
39Conclusions ...
- Great system research has a theoretical
underpinning. - Such research identifies important properties
- of the systems themselves, or
- of our ability to reason about them formally.
- Cilk identified 3 significant system properties
- Fully strict programs
- Non-blocking threads
- Randomly choosing a victim.
40END
41The Cost of Spawns
- A spawn is about an order of magnitude more
costly than a C function call. - Spawned threads running on parents processor can
be implemented more efficiently than remote
spawns. - This usually is the case.
- Compiler techniques can exploit this distinction.
42Communication Efficiency
- A request is an attempt to steal work (the victim
may not have work). - Requests/processor steals/processor both grow
as the critical path grows.
43Proven Properties Space
- A fully strict programs threads send arguments
only to its parents successors. - For such programs, space, time, communication
bounds are proven. - Space SP lt S1 P.
- There exists a P-processor execution for which
this is asymptotically optimal.
44Proven Properties Communication
- Communication The expected of bits
communicated in a P-processor execution is - O( T? P SMAX )
- where SMAX denotes its largest closure.
- There exists a program such that, for all P,
there exists a P-processor execution that
communicates k bits, where k gt c T? P SMAX, for
some constant, c.