Title: Debugging Concurrent Software by ContextBounded Analysis
1Debugging Concurrent Software by Context-Bounded
Analysis
- Shaz Qadeer
- Microsoft Research
- Joint work with
- Jakob Rehof, Microsoft Research
- Dinghao Wu, Princeton University
2Concurrent software
Thread 1
Processor 1
Thread 2
Thread 3
Processor 2
Thread 4
- Operating systems, device drivers
- Databases, web servers, browsers, GUIs, ...
- Modern languages C, Java
3Concurrency is increasingly important
- Single-chip multiprocessors are an architectural
inflexion point
- Software running on these chips will be even more
concurrent
- Embedded systems
- Airplanes, cars, PDAs, cellphones
- Web services
4Reliable concurrent software?
- Correctness Problem
- does program behave correctly for all inputs and
all interleavings?
- Bugs due to concurrency are insidious
- nondeterministic, timing dependent
- difficult to detect, reproduce, eliminate
- coverage from testing very poor
5Analysis of concurrent programs is difficult (1)
- Finite-data single-procedure program
- n lines
- m states for global data variables
- 1 thread
- n m states
- K threads
- (n)K m states
6Analysis of concurrent programs is difficult (2)
- Finite-data program with procedures
- n lines
- m states for global data variables
- 1 thread
- Infinite number of states
- Can still decide assertions in O(n m3)
- SLAM, ESP, BLAST implement this algorithm
- K ? 2 threads
- Undecidable! (Ramalingam 00)
7Context-bounded verification of concurrent
software
Context switch
Context switch
?
?
?
?
?
?
?
?
?
?
Context
Context
Context
Analyze all executions with small number of
context switches !
8Why context-bounded analysis?
- Many subtle concurrency errors are manifested in
executions with a small number of contexts
- Context-bounded analysis can be performed
efficiently
9KISS A static checker for concurrent software
- An implementation of context-bounded analysis
- Technique to use any sequential checker to
perform context-bounded concurrency analysis
- Has found a number of concurrency errors in NT
device drivers
10KISS A static checker for concurrent software
No error found
?
Sequential Checker
KISS
Sequential program Q
Concurrent program P
?
Error in Q indicates error in P
11KISS A static checker for concurrent software
No error found
?
KISS
SDV
Sequential program Q
Concurrent program P
?
Error in Q indicates error in P
12KISS A static checker for concurrent software
No error found
?
KISS
PREfix
Sequential program Q
Concurrent program P
?
Error in Q indicates error in P
13KISS A static checker for concurrent software
No error found
?
KISS
ESP
Sequential program Q
Concurrent program P
?
Error in Q indicates error in P
14Inside a static checker for sequential programs
int x, y, z void foo ( ) if (x y)
y x if (y z)
z y
assert (x z)
- Symbolically analyze all paths
- Check the assertion for each path
- Interprocedural analysis
- e.g., PREfix, ESP, SLAM, BLAST
15KISS strategy
- Q encodes executions of P with small number of
context switches
- instrumentation introduces lots of extra paths to
mimic context switches
- Leverage all-path analysis of sequential checkers
16PnpStop( ) int t de-stopping
T t AtomicDecr( de-count)
if (t 0) SetEvent( de-stopEve
nt) WaitEvent( de-stopEvent)
DispatchRoutine( ) int t if (!
de-stopping) AtomicIncr( de
-count) // do useful work //
t AtomicDecr( de-count)
if (t 0) SetEve
nt( de-stopEvent)
17DispatchRoutine( ) int t if (!
de-stopping) AtomicIncr( de
-count) // do useful work //
t AtomicDecr( de-count)
if (t 0) SetEve
nt( de-stopEvent)
PnpStop( ) int t if () return
de-stopping T if () return t A
tomicDecr( de-count) if () return i
f (t 0) SetEvent( de-stopEvent)
if () return WaitEvent( de-stopEvent)
18PnpStop( ) int t if () return
de-stopping T if () return t A
tomicDecr( de-count) if () return i
f (t 0) SetEvent( de-stopEvent)
if () return WaitEvent( de-stopEvent)
19PnpStop( ) int t if () return
de-stopping T if () return t A
tomicDecr( de-count) if () return i
f (t 0) SetEvent( de-stopEvent)
if () return WaitEvent( de-stopEvent)
main( ) DispatchRoutine( )
20PnpStop( ) int t CODE de-stop
ping T CODE t AtomicDecr( de-co
unt) CODE if (t 0) SetEve
nt( de-stopEvent) CODE WaitEvent( d
e-stopEvent) CODE
main( ) PnpStop( )
21KISS features
- KISS trades off soundness for scalability
- Cost of analyzing a concurrent program P cost
of analyzing a sequential program Q
- Size of Q asymptotically same as size of P
- Unsoundness is precisely quantifiable
- for 2-thread program, explores all executions
with up to two context switches
- for n-thread program, explores up to 2n-2 context
switches
- Allows any sequential checker to analyze
concurrency
22(No Transcript)
23Experimental Evaluation of KISS
24Driver Stopping Error in Bluetooth Driver (1 KLOC)
DispatchRoutine() int t if (! de-sto
pping) AtomicIncr( de-count)
assert ! driverStopped // do useful wo
rk // t AtomicDecr( de-coun
t) if (t 0) SetEvent( de-
stopEvent)
PnpStop() int t de-stopping T
t AtomicDecr( de-count)
if (t 0) SetEvent( de-stopEvent
) WaitEvent( de-stopEvent) driverSto
pped T
25int t if (! de-stopping)
Assertion fails!
26IRP Cancellation Error in Packet Driver (2.5
KLOC)
DispatchRoutine(IRP irp) irp-Cance
lRoutine PacketCancelRoutine Enq
ueue(irp)
IoMarkIrpPending(irp)
IoCancelIrp(IRP irp) IoAcquireCancelSpinLo
ck() if (irp-CancelRoutine) (irp
-CancelRoutine)(irp) Packet
CancelRoutine(IRP irp) Dequeue(irp)
IoCompleteRequest(irp) IoReleaseCance
lSpinLock()
27 irp-CancelRoutine PacketCancelRoutine
Enqueue(irp)
Error An irp should not be marked pending after
it has been completed !
28Data-race Conditions in DDK Sample Drivers
- Device extension shared among threads
- Data-races on device extension fields
- 18 sample DDK drivers
- Range 0.5-9.2 KLOC
- Total 70 KLOC
- Each field checked separately with resource limit
of 20 minutes and 800MB
- Two threads each calls nondeterministically
chosen dispatch routine
29Total 30 races
30DevicePnpState Field in Toaster/toastmon
ToastMon_DispatchPnp( DEVICE_OBJECT obj, IRP
irp) IoAcquireRemoveLock()
case IRP_MN_QUERY_STOP_DEVICE // R
ace write access deviceExt-DevicePnPSta
te StopPending
break IoReleaseRemoveLock()
ToastMon_DispatchPower( DEVICE_OBJECT obj, IR
P irp) // Race read access i
f (deviceExt-DevicePnpState Deleted)
31Acknowledgments
- Tom Ball
- Byron Cook
- John Henry
- Doron Holan
- Vladimir Levin
- Jakob Lichtenberg
- Adrian Oney
- Sriram Rajamani
- Peter Wieland
32Keep It Simple and Sequential
- Context-bounded analysis by leveraging existing
sequential checkers
- Validates the hypothesis that many concurrency
errors require few context switches to show up
33However
- Hard limit on number of explored contexts
- e.g., two context switches for concurrent program
with two threads
- Case study Concurrent transaction management
code written in C (Naik-Rehof 04)
- Analyzed by the Zing model checker after
automatically translating to the Zing input
language
- Found three bugs each requiring between three and
four context switches
34Is a tuning knob possible?
Given a concurrent boolean program P and a
positive integer c, does P go wrong by failing a
n assertion via an execution with at most c conte
xts?
Decidable
Given a concurrent boolean program P with
unbounded fork-join parallelism and a positive i
nteger c, does P go wrong by failing an assertio
n via an execution with at most c contexts?
Decidable
35Context switch
Context switch
?
?
?
?
?
?
?
?
?
?
Context
Context
Context
- Problem
- Unbounded computation possible within each
context!
- Unbounded execution depth and reachable state
space
- Different from bounded-depth model checking
36Sequential boolean program
Global store g, valuation to global
variables Local store l, valuati
on to local variables Stack s,
sequence of local stores
State (g, s)
37Example
bool a F void main( ) L1 a T L2
flip(a) L3 void flip(bool x) L4
a !x
L5
(a, ?x, pc?)
(F, ?_, L1?)
(T, ?_, L2?)
(T, ?_, L3? ?T, L4?)
(F, ?_, L3? ?T, L5?)
(F, ?_, L3?)
(F, ?)
38Sequential boolean program
Global store g, valuation to global
variables Local store l, valuati
on to local variables Stack s,
sequence of local stores
State (g, s)
39Reachability problem for sequential boolean
program
Given (g, s), is there s such that
(g, s) ? (error,s)?
40Aggregate state
Set of stacks ss Aggregate state (g
, ss) (g,s) s ? ss
Reach(g, ss) (g, s) exists s ? ss such t
hat (g, s) ? (g, s)
41Aggregate transition relation
- Observations
- There is a unique smallest partition of Reach(g,
ss)
- into aggregate states (g1, ss1) ? ? (gn,
ssn)
- The number of elements in the partition is
- bounded by the number of global stores
(g, ss) ? (g1, ss1) . . . (g, ss) ? (gn, ss
n)
42Theorem (Buchi, Schwoon00)
- If ss is regular and (g, ss) ? (g, ss), then
ss is regular.
- If ss is given as a finite automaton A, then a
finite automaton A for ss can be constructed
from A in polynomial time.
43Algorithm
Problem Given (g, s), is there s such that (g
, s) ? (error,s)?
Solution Compute automaton for ss such that (
g, s) ? (error, ss) and check if ss
is nonempty.
44Concurrent boolean program
Global store g, valuation to global
variables Local store l, valuati
on to local variables Stack s,
sequence of local stores
State (g, s1, s2)
45Reachability problem for concurrent boolean
program
Given (g, s1, s2), are there s1 and s2 such
that (g, s1, s2) reaches (error, s1, s2) via an
execution with at most c contexts?
46Aggregate transition relation
(g, ss1, ss2) (g, s1, s2) s1 ? ss1, s2 ?
ss2
47Algorithm 2 threads, c contexts
Compute the set of reachable aggregate states.
Report an error if (g, ss1, ss2) is reachable
and g error, ss1 is nonempty, and ss2 is nonemp
ty.
48Complexity 2 threads, c contexts
(g, s1, s2)
?
1
2
?
?
?
1
2
1
2
?
?
Depth of tree context bound c
Branching factor bounded by G ? 2 (G of
global stores) Number of edges bounded by (G ? 2)
(c1) Each edge computable in polynomial time
49Context-bounded analysis of concurrent software
- Many subtle concurrency errors are manifested in
executions with few context switches
- Experience with KISS on Windows drivers
- Experience with Zing on transaction manager
- Algorithms for context-bounded analysis are more
efficient than those for unbounded analysis
- Reducibility to sequential checking with KISS
- Decidability of assertion checking for concurrent
boolean programs
50Applications of context-bounded analysis
- Coverage metric for testing concurrent software
- Analysis of computer protocols
- networking
- cache-coherence
51Unbounded fork-join parallelism
- Fork operation x fork
- Join operation join(x)
- Copy thread identifier from one variable to
another
52Algorithm unbounded fork-join parallelism, c
contexts
- At most c threads may perform a transition
- Reduce to previously solved problem with c
threads and c contexts
- Nondeterministically pick c forked threads for
execution
53start 1, , c ? boolean, initialized
to ? i. (i 1) end 1, , c ? boolean,
initialized to ? i. false
- c statically created threads
- thread i starts execution when starti is true
- thread i sets endi to true on termination
count 1, , c, initialized to 1