Title: Deterministic Execution of Nondeterministic Shared-Memory Programs
1Deterministic Execution of Nondeterministic
Shared-Memory Programs
- Dan Grossman
- University of Washington
- Dagstuhl Seminar on
- Design and Validation of Concurrent Systems
- August 2009
2What if
- What if you could run the same multithreaded
program on the same inputs twice and know you
would get the same results? - What exactly does that mean?
- Why might you want that?
- How can we do that (semi-efficiently)?
- But first
- Some background on me and the talks Im not
giving - Key terminology and perspectives
- More important than technical details at this
event
3Biography / group names
- Me
- Programming-languages person
- Type systems, compilers for memory-safe C dialect
200-2004 - 30 ? 80 focus on multithreading, 2005-
- Co-advising 3-4 students with computer architect
Luis Ceze, 2007- - Two groups for marketing purposes
- WASP, wasp.cs.washington.edu
- SAMPA, sampa.cs.washington.edu
4The talk you wont see
void transferFrom(int amt, Acct other)
atomic other.withdraw(amt)
this.deposit(amt)
- Transactions are to shared-memory concurrency as
garbage - collection is to memory management OOPSLA 07
- Semantic problems with nontransactional accesses
worse than locks! - Fix with stronger guarantees and compiler opts
PLDI07 - Or static type system, formal semantics, and
proof POPL08 - Or more dynamic approach adapting to Haskell
submitted -
- Prototypes for OCaml, Java, Scheme, and Haskell
5This talk
- Take an arbitrary C/C program with POSIX
threads - Locks, barriers, condition variables, data races,
whatever - Compile it funny
- Link it against a funny run-time system
- Get deterministic behavior
- Well, as deterministic as a sequential C program
- Joint work Luis Ceze, Tom Bergan, Joe Devietti,
Owen Anderson
6Terminology
- Essential perspectives, not just definitions
- Parallelism vs. concurrency
- Or different terms if you prefer
- Sequential semantics vs. determinism vs.
nondeterminism - What is an input?
- Level of abstraction
- Which one do you care about?
7Concurrency
- Working definition
- Software is concurrent if a primary intellectual
challenge is responding to external events from
multiple sources in a timely manner. - Examples operating system, shared hashtable,
version control - Key challenge is responsiveness
- often leads to threads or asynchrony
- Correctness usually requires synchronization
(e.g., locks)
8Parallelism
- Working definition
- Software is parallel if a primary intellectual
challenge is using extra computational resources
to do more useful work per unit time. - Examples scientific computing, most graphics, a
lot of servers - Key challenge is Amdahls Law
- No sequential bottlenecks, no imbalanced load
- When pure fork-join isnt correct, need
synchronization
9The confusion
- First, this use of terms isnt standard
- Many systems are both
- And its really a matter of degree
- Similar lower-level mechanisms, such as threads
and locks - And similar errors (race conditions, deadlocks,
etc.) - Our work determinizes these lower-level
mechanisms, so we determinize concurrent and
parallel applications - But purely parallel ones probably benefit less
10Terminology
- Essential perspectives, not just definitions
- Parallelism vs. concurrency
- Or different terms if you prefer
- Sequential semantics vs. determinism vs.
nondeterminism - What is an input?
- Level of abstraction
- Which one do you care about?
11Sequential semantics
- Some languages can have results defined purely
sequentially, but are designed to have better
parallel-performance guarantees (thanks to a cost
model) - Examples DPJ, Cilk, NESL,
- For correctness, reason sequentially
- For performance, reason in parallel
- Really designed for parallelism, not concurrency
- Not our work
12Sequential isnt always deterministic
- Surprisingly easy to forget this
int f1() print(A) print(B) return 0 int
f2() print(C) print(D) return 0 int g()
return f1() f2()
- Must g() print ABCD?
- Java yes
- C/C no, CDAB allowed, but not ACBD, ACDB, etc.
13Another example
- Dijkstras guarded-command conditionals
if x 2 1 -gt y x - 1 x lt 10 -gt y
7 x gt 10 -gt y 0 fi
- We might still expect a particular language
implementation (compiler) to be deterministic - May choose any deterministic result consistent
with the nondeterministic semantics - Presumably doesnt change choice across
executions, but may across compiles (including
butterfly effects) - Our work does this
14Why helpful?
- So programmer gets a deterministic executable,
but doesnt know which one - Key degree of freedom for automated performance
- Still helpful for
- Whole-program testing and debugging
- Automated replicas
- In general, repeatability and reducing possible
executions
15Define deterministic, part 1
- Deterministic outputs depend only on inputs
- Thats right, but means must clearly specify what
is an input (and an output) - Can define away anything you want
- Example All syscall results are inputs, so
seeding the pseudorandom number generator with
time-of-day is deterministic - We mean what you think we mean
- Inputs command-line, I/O, syscalls
- Not inputs cache state, hardware timing, thread
scheduler -
16Terminology
- Essential perspectives, not just definitions
- Parallelism vs. concurrency
- Or different terms if you prefer
- Sequential semantics vs. determinism vs.
nondeterminism - What is an input?
- Level of abstraction
- Which one do you care about?
17Define deterministic, part 2
- Is it deterministic? depends crucially on your
abstraction level - Another obvious easy-to-forget thing
- Examples
- File systems
- Memory-allocation (Java vs. C)
- Set implemented as a list
- Quantum mechanics
- Our work
- The language level state of logical memory,
program output - Application may care only about a higher level
(future work)
18Okay how?
- Trade-off between complexity and performance
PERFORMANCE
COMPLEXITY
- Performance
- Overhead (single-thread slowdown)
- Scalability (minimize extra synchronization,
waiting)
19Starting serial
- Determinization is easy!
- Run one thread at a time in round-robin order
- Context-switch after N basic blocks for
deterministic N - Cannot use a timer use compiler and run-time
- Races in source program are irrelevant locks
still respected - Example with 3 threads running (time moves with
arrows)
T1
T2
T3
1 quantum
1 round
20Parallel quanta
- The quanta in a round can start to run in
parallel provided they stop before any
communication occurs (see how next) - So each round has two stages, parallel then serial
T1
T2
T3
Parallel stage ends with global barrier
load A
load A
Serial stage ends next round starts
store B
store C
21Is that legal?
T1
T2
T3
load A
load A
store B
store C
- Can produce different result than serial
execution - In fact, execution not necessarily equivalent
with any serialization of quanta - But it doesnt matter as long as we are
deterministic! Just need - Parallel stages do no communication
- Parallel stages end at deterministic points
22Performance
T1
T2
T3
load A
load A
store B
store C
- Keys to scalability
- Run almost everything in the parallel stage
- Keep quanta balanced
- Assume (1), use rough instruction costs
23Memory ownership
- To avoid communication during parallel stage
- Every memory location is shared or owned by 1
thread T - Dynamic table checked and updated during
execution - Can read only memory that is shared or
owned-by-you - Can write only memory owned-by-you
- Locks just like memory locations blocking ends
quantum - In our example, perhaps A is shared, B and C are
owned by T2
T1
T2
T3
load A
load A
store B
store C
24Changing ownership
- Policy
- For each location (any deterministic granularity
is correct), - First owner is first thread to allocate in the
location - On read in serial stage, if owned-by-other set to
shared - One write in serial stage, set to owned-by-self
- Correctness
- Ownership immutable in parallel stages (so no
communication) - Serial-stage changes are deterministic
- So many, many polices are correct
- Chose the obvious one for temporal locality
read-sharing - Must have good locality for scalability!
25Overhead
- Significant overhead
- All reads/writes consult ownership information
- All basic blocks subtract from a thread-local
quantum counter - Reduce via
- Lots of run-time engineering and data structures
(not too much magic, but most important) - Obvious compiler optimizations like escape
analysis and hoisting counter-subtractions - Specialized compiler optimizations like
Subsequent Access Optimization Dont recheck
same ownership unless a quantum boundary might
intervene. - Correctness of this is a subtle argument and
slightly affects the ownership-change policy
(deterministically!)
26Brittle
- Change any line of code, command-line argument,
environment variable, etc. and you can get a
different deterministic program ? - We are mostly robust to memory-safety errors ?,
- except ?
- Bounds errors that corrupt ownership information
- Bounds errors that write to another threads
allegedly-thread-local data
27Results
- Overhead Varies a lot, but about 3x at 8 threads
- Scalability Varies a lot, but on average with
parsec suite () - nondet 8 threads vs. nondet 2 threads 2.4
(linear 4) - det 8 threads vs. det 2 threads
2.0 -
- det 8 threads vs. nondet 2 threads 0.91
(range 0.41 - 2.75) - How do you want to spend Moores Dividend?
- subset runnable no mpi, no C exceptions, no
32-bit assumptions
28Buffering
- Actually, ownership is only one approach
- Second approach relies on buffering and a commit
stage - Even higher overhead (to consult buffers)
- Even better scalability (block only for
synchronization commits) - And a third hybrid approach
- Hopefully more details soon
29Conclusion
- The fundamental assumption that nondeterministic
shared-memory programs must be run
nondeterministically is false - A fun problem to throw principled compiler and
run-time optimizations at. - Could dramatically change how we test and debug
parallel and concurrent programs - Most-related work
- Kendo from MIT done concurrently (in parallel?
?), requires knowing about data races statically,
different approach - Colleagues in ASPLOS09 hardware support for
ownership - Record replay systemswe can replay without the
record