Title: Runtime Safety Analysis of Concurrent and Distributed Systems
1Runtime Safety Analysis of Concurrent and
Distributed Systems
- Koushik Sen
- University of Illinois at
- Urbana-Champaign, USA
Joint work with Gul Agha, Grigore Rosu and Abhay
Vardhan.
2Increasing Software Reliability
- Current solutions
- Human review of code and testing
- Most used in practice
- Usually ad-hoc, intensive human support
- (Advanced) Static analysis
- Often scales up
- False positives and negatives, annotations
- (Traditional) Formal methods
- Model checking and theorem proving
- General, good confidence, do not always scale up
3Runtime Verification
- Merge testing and temporal logic specification
- Specify safety properties in proper temporal
logic. - Monitor safety properties against a run of the
program. - Examples JPaX (NASA Ames), Upenn's Java MaC
analyzes the observed run. - Disadvantage
- Lack of coverage.
- Not suitable for Distributed Systems.
Run
Naïve Observer
4Our Approach
- For Distributed Programs
- Use Distributed Temporal Logic.
- Use KnowledgeVector
- Decentralize Monitoring by distributing monitors
to all processes. - For MultiThreaded Programs
- Use smart observers to predict safety violations
- Vector Clock Algorithm for MultiThreaded Programs
- Construct Computation Lattice
- Analyze Lattice level by level
- Use Causality Cone Heuristics to increase
efficiency
5Decentralized Distributed Program Monitoring
(DIANA)
http//osl.cs.uiuc.edu/
6Centralized Monitoring Approach
- Distributed Systems
- Global state is distributed
- To Monitor
- For every state update send state to a central
monitor - Central monitor assembles them to form consistent
execution traces - Verify global safety property on these execution
traces
7An Example
- Mobile node a requests certain value from node b
- b computes the value and sends it to a
- Property no node receives a value from another
node to which it had not sent a request
8Centralized Monitoring Example
If a receives a value from b then b calculated
the value after receiving request from a
For large number of nodes size of property in LTL
may be large
Message is sent to monitor for every state update
valRcv ? ?(valComputed ? ?valReq)
valRcv ? ?(valComputed ? ?valReq)
valReq
?valReq
valComputed ? ?valReq
?(valComputed ? ?valReq)
b
valComputed
a
valReq
valRcv
9Decentralized Approach
- Distribute property
- Properties expressed with respect to a process
- Local properties at every process
- Decentralize Monitoring
- Maintain knowledge of global state at each
process - Update knowledge with incoming messages
- Attach knowledge with outgoing messages
- At each process check safety property against
local knowledge
10Decentralized Monitoring Example
If a receives a value from b then b calculated
the value after receiving request from a
valRcv ? _at_a(?(valComputed ? _at_b(?valReq)))
valComputed ? _at_b(?valReq)
?(valComputed ? _at_b(?valReq))
_at_b(?valReq)
Formulas w.r.t processes
No extra message
b
No Need for Global Snapshot
valComputed
a
valReq
valRcv
?valReq
valRcv ? _at_a(?(valComputed ? _at_b(?valReq)))
11Past time Distributed Temporal Logic (pt-DTL)
- Based on epistemic logic
- Properties with respect to a process, say p
- Interpreted over a sequence of global states that
the process p is aware of - Each process monitors the properties local to it
- No need for extra messages to create a relevant
portion of global state - KnowledgeVector keeps track of relevant global
state that can effect a property.
12Remote Expressions in pt-DTL
- Remote expressions arbitrary expressions
related to the state of a remote process - Propositions constructed from remote and local
expressions - If my alarm is set then eventually in past
difference between my temperature and temperature
at process b exceeded the allowed value - alarm ? ?((myTemp - _at_btemp) gt allowed)
13Safety in Airplane Landing
- If my airplane is landing then the runway that
the airport has allocated matches the one that I
am planning to use - landing ? (runway _at_airportallocRunway)
14Leader Election Example
- If a leader is elected then if the current
process is a leader then, at its knowledge, none
of the other processes is a leader - elected ? (stateleader ? /\i?j(_at_j(state ?
leader)))
15pt-DTL syntax and semantics
- Fi true false P(Ei) Fi Fi Æ Fi
propositional - Fi Fi ?Fi Fi S Fi temporal
- _at_jFj epistemic
- Ei c vi 2 Vi f(Ei) functional
- _at_jEj epistemic
- c constant
- vi variable at process I
- P(Ei) predicate on Ei
- f(Ei) function f applied to Ei
- _at_jEj expression Ej at process j
- Fi previously Fi
- Fi always in past Fi
- ?Fi eventually in past Fi
- Fi S Fi Fi since Fi
- _at_jFj Fj at process j
16Interpretation of _at_jEj at process i
s31
s32
s33
p3
m4
m1
m2
s22
p2
s23
s21
m3
p1
s12
s11
Since, at s23 p2 is aware of s12 of p1 value of
_at_1E in s23 at p2 value of E in s12 at p1
17Monitoring Algorithm
- Requirements
- Should be fast so that online monitoring is
possible - Little memory overhead
- Additional messages sent should be minimal
ideally zero - KnowledgeVector
- Motivated by Vector Clocks
- Unlike Vector Clocks size independent of number
of processes
18KnowledgeVector
- KV is vector
- one entry for each process appearing in formula
- KVj denotes entry for process j
- KVj.seq is the sequence number of last event
seen at process j - KVj.values stores values of j-expressions and
j-formulae
19KnowledgeVector Algorithm
- internal event (at process i)
- store eval(Ei,si) and eval(Fi,si) for each _at_iEi
and _at_iFi in KVii.values - send m
- KVii.seq à KVii.seq 1. Send KVi with m as
KVm - receive m
- for all j, if KVmj.seq gt KVij.seq then
- KVij.seq à KVmj.seq
- KVij.values à KVmj.value
20Example
p3
p2
Y7
Y3
violation
p1
X5
X9
X6
KV1.seq
(Y _at_1X) at p2
KV1.values
21DIANA Architecture
pt-DTL Monitor
22MultiThreaded Program Analysis (JMPaX)
23MultiThreaded Smart Observer
- Ideas
- A single execution trace contains more
information than appears at first sight - Extract other possible runs from a single
execution - Analyze all these runs intelligently.
- A technique between model checking and testing.
Run
Smart Observer
24MultiPathExplorer JMPaX (Java)
- Based on smart observers
- Smartness obtained by proper instrumentation
vector clocks - Possible global states generated dynamically ?
form a lattice - Analysis is performed on a level-by-level basis
in the lattice of global states
25Motivating Example Safe Landing
Safe Landing Land the air/space craft only after
approval from ground and only if, since then, the
radio signal has not been lost
- Three variables
- Landing indicating air/space craft is landing
- Approved indicating landing has been approved
- Radio indicating radio signal is live
?Landing ? ?Approved, ?Radio?
26Code of a Landing Controller
- Two threaded program to control landing
- int landing 0, approved 0, radio 1
- void thread1()
- askLandingApproval()
- if (approved 1)
- print("Landing approved") landing1
print("Landing started") - else print("Landing not approved")
-
- void askLandingApproval()
- if (radio 1) approved 1 else
approved 0 -
- void thread2()
- while (true) checkRadio()
-
27Landing Safety Violation
- Suppose the plane has received approval for
landing and just before it started landing the
radio signal went off - the plane must abort landing!
- A simple observer will most likely not detect the
bug. - JMPaX can construct a possible run in which radio
goes off between approval and landing
approved 1
landing 1
28Events in Multithreaded Programs
- Given n threads p1, p2, ..., pn,
- A multithreaded execution is a sequence of events
e1 e2 er of type - internal or,
- read of a shared variable or,
- write of a shared variable.
- eij represents the jth event generated by thread
pi since the start of its execution.
29Causality in Multithreaded Programs
- Define the partial order Á on the set of events
as follows - eik Á eil if k lt l
- e Á e' if there is some x 2 S such that e ltx e'
and at least one of e, e is a write. - e Á e'' if e Á e' and e' Á e''.
30Vector Clocks and Relevant Events
- Consider a subset R of relevant events.
- (typically those writing specifications
variables) - R-relevant causality is a relation C µ Á
- C is a projection of Á on R R.
- We provide a technique based on vector clocks
that correctly implements the relevant causality
relation.
31Vector Clock Algorithm
- Let Vi be an n-dimensional vector of natural
numbers for each thread pi. - Let Vxa and Vxw be vectors for each shared
variable x. - if eik is relevant, i.e., if eik 2 R, then
- Vii à Vii 1
- if eik is a read of a variable x then
- Vi à maxVi,Vxw
- Vxa à maxVxa,Vi
- if eik is a write of a variable x then
- Vxw à Vxa à Vi à maxVxa,Vi
- if eik is relevant then
- send message h eik, i, Vi i to observer.
32Correspondence with Standard Vector Clocks
33Implementing Causality by Vector Clocks
- Theorem If he, i, Vi and he', j, V' i are
messages sent by our algorithm, then - e C e' iff Vi V'i
- If i and j are not given, then
- e C e' iff V lt V
34Example with Two Threads
(initially x -1)
35Relevant Global State
- The program state after the events
ek11,ek22,...,eknn is called a relevant global
multithreaded state or simply a state. - A state ?k1 k2 kn is called consistent if and
only if it can be seen in some possible run of
the system.
36MultiThreaded Run
- e1e2 eR is a multithreaded run iff it
generates a sequence of global states ?K0 ?K1
?KR such that - each ?Kr is consistent and
- ?Kr after event er becomes ?Kr1.
- (consecutive states)
37Computation Lattice
- We say ? À ?' when there is some run in which ?
and ?' are consecutive states - Consistent global states together with the
transitive closure of À form a lattice - Multithreaded runs are paths in the lattice
38Example Revisited
39Monitoring Safety Formula
(x gt 0) ! (y 0), (y gt z))s
40Safety Violation in a Possible Run
(x gt 0) ! (y 0), (y gt z))s
41Past Time Linear Temporal Logic Syntax
- F true false a 2 A F F op F
Propositional ops - O F ltgt F F F Ss F F Sw F Standard
ops - " F F F,F)s F,F)w
Monitoring ops
42Semantics
-
- ? ² ltgt F iff ? ² F or (n gt 1 and ?n-1 ² ltgt F)
- ? ² F iff ? ² F and (n gt 1 implies ?n-1 ²
F) - ? ² F1 Ss F2 iff ? ² F2 or (n gt 1 and ? ² F1
and ?n-1 ² F1 Ss F2) - ? ² F1 Sw F2 iff ? ² F2 or (? ² F1 and (n gt 1
and ?n-1 ² F1 Sw F2)) - ? ² F1,F2)s iff ? 2 F2 and (? ² F1 or (n gt 1
and ?n-1 ² F1,F2)w)) - ? ² F1,F2)w iff ? 2 F2 and (? ² F1 or (n gt 1
implies ?n-1 ² F1,F2)w))
43Safety Against All Runs
- Number of possible runs can be exponential
- Traverse the state lattice level by level
- Avoids analyzing an exponential number of runs
- Maintain a queue of events
- Enqueue an event as soon as it arrives
- Construct a new level from the set of states in
the previous level and the events in the queue - Monitor safety formula against all states in a
level using dynamic programming and intelligent
merging.
44Algorithm Pseudocode
- for each (e 2 Q)
- if exists s 2 CurrentLevel s.t. isNextState(s,e)
then - NextLevel à addToSet(NextLevel,createState(s,e))
- if isUnnecessary(s) then
- remove(s,CurrentLevel)
- if isEmpty(CurrentLevel) then
- monitorAll(NextLevel)
- CurrentLevel à NextLevel NextLevel Ã
- Q Ã removeUnnecessaryEvents(CurrentLevel,Q)
-
45Complexity
- Time complexity is O(w.2m.n)
- w width of the lattice
- m size of the formula
- n length of the run
- Memory used is O(w.2m)
- w width of the lattice
- m number of temporal operators in the formula
- Further optimizations
- Consider bounded width w of queue Q
46Computation Lattice Width
- The number of states in a level can be large
- Observe that all states are not equi-probable
- Ignore states in lattice that are formed by
events that are largely separated by distance - Distance can be measured
- In terms of real-time between the events
- Some notion of distance between Vector Clocks
- Euclidean Distance
47Causality Cone Heuristics
48JMPaX Architecture
49Further Applications
- Security
- Security policies as safety requirements
- Predict safety violations efficiently!
?communicate(A,B,K) ? ? (sendKey(S,(A,B),K) ?
? requestKey(S,A,B))
50Future Work
- Evaluate JMPaX and DiAna on real, large
applications - Investigate for programmer friendly and more
expressive logics - Extend EAGLE logic (NASA Ames)
- Add more epistemic operators
- Find techniques to increase coverage of analysis
- Use Machine Learning
- Apply Statistical Analysis
- Investigate efficient instrumentation techniques