Title: Failure Handling in a modal Language
1Failure Handling in a modal Language
- Nels Eric Beckman
- Research Talk
- Institute for Software Research
- October 30, 2006
2Claims Made in this Talk
- ML5 is an elegant language for programming
distributed systems. - In the face of node failure, the meaning of ML5
programs becomes unclear. - We propose extensions to ML5 that makes their
meaning clear. - (In reality, this research is a work in progress.)
3ML5
- A Programming Language for Distributed Systems
- Based on a Modal Logic
- i.e. A Logic With an Embedded Notion of Place
- Tom Murphys Thesis Work
- Targeted for Grid Programming
4ML5, Briefly...
- Allows Hosts to Send Thunks to One Another for
Execution - In practice, code can be more cleanly decomposed.
- Has An Advanced Type System
- Location-specific resources can be typed as so.
5RPC-Style Distributed Programming
fun a
fun b
rpc(b,19.x.x.x) r
return x
Host Active thread Blocked thread Message
6RPC-Style Distributed Programming
fun a
fun b
rpc(b,19.x.x.x) r
return x
Host Active thread Blocked thread Message
7RPC-Style Distributed Programming
fun a
fun b
rpc(b,19.x.x.x) r
return x
Host Active thread Blocked thread Message
8RPC-Style Distributed Programming
fun a
fun b
rpc(b,19.x.x.x) r
rpc b
return x
Host Active thread Blocked thread Message
9RPC-Style Distributed Programming
fun a
fun b
rpc(b,19.x.x.x) r
return x
Host Active thread Blocked thread Message
10RPC-Style Distributed Programming
fun a
fun b
rpc(b,19.x.x.x) r
return x
Host Active thread Blocked thread Message
11RPC-Style Distributed Programming
fun a
fun b
rpc(b,19.x.x.x) r
return x
Host Active thread Blocked thread Message
12RPC-Style Distributed Programming
fun a
fun b
rpc(b,19.x.x.x) r
ret x
return x
Host Active thread Blocked thread Message
13RPC-Style Distributed Programming
fun a
fun b
rpc(b,19.x.x.x) r
ret x
return x
Host Active thread Blocked thread Message
14RPC-Style Distributed Programming
fun a
fun b
rpc(b,19.x.x.x) r
ret x
return x
Host Active thread Blocked thread Message
15ML5 Illustration
Host Location of thread Migration of thread
16ML5 Illustration
Host Location of thread Migration of thread
17ML5 Illustration
Host Location of thread Migration of thread
18ML5 Illustration
Host Location of thread Migration of thread
19ML5 Illustration
Host Location of thread Migration of thread
20ML5 Illustration
Host Location of thread Migration of thread
21ML5 Illustration
Host Location of thread Migration of thread
22ML5 Illustration
Host Location of thread Migration of thread
23Example
- Remotely Finding Lists Sum (RPC)
- Server Code
class ListServ ListltIntegergt myList new
... ListltIntegergt getList() return myList
24Example
- Remotely Finding Lists Sum (RPC)
- Client Code
class ListClient ListServerStub myServ new
... public void foo() ListltIntegergt list
myServ.getList() for(Integer item list)
count item.intValue() if( count gt
40 ) ...
25Example
- Remotely Finding Lists Sum (RPC)
- To Fix Should We
- Add a new server operation that returns true if a
lists sum is greater than 40? - Weird if operation is only used once.
- We wouldnt structure application this way in a
centralized setting. - Bite the performance bullet and send the whole
list?
26Example
- Remotely Finding Lists Sum (ML5)
- Before
fun foo remote_host remote_list_ref let fun
sum a_list foldl op 0 a_list in if sum
( getremote_host( !remote_list_ref ) ) gt
40 then true else false
27Example
- Remotely Finding Lists Sum (ML5)
- After
fun foo remote_host remote_list_ref let fun
sum a_list foldl op 0 a_list in getremot
e_host( if sum ( !remote_list_ref ) gt
40 then true else false )
28Types
- ML5 Type System Embeds a Notion of Place
- Some values can be used at any place.
- e.g. Primitive data types, structures
- Some values can only be used at the location
where they make sense. - e.g. File descriptors, reference cells, printers
29Just a Few Types
- t_at_w The type t is well-typed on host w.
30Just a Few Types
- getw,ae Evaluate e on host w and return
the result to the current host. Change es type
from _at_w to _at_w. - Example
- fun foo (x int ref _at_w,
- a w addr _at_w)
- getw,a( !x !x )
31Just a Few Types
- getw,ae Evaluate e on host w and return
the result to the current host. Change es type
from _at_w to _at_w. - Example
- fun foo (x int ref _at_w,
- a w addr _at_w)
- getw,a( !x !x )
Typed int_at_w
32Just a Few Types
- getw,ae Evaluate e on host w and return
the result to the current host. Change es type
from _at_w to _at_w. - Example
- fun foo (x int ref _at_w,
- a w addr _at_w)
- getw,a( !x !x )
Typed int_at_w
33Just a Few Types
- ?t Suspended code that can be evaluated
anywhere. Produces a value of type t. - Example
- (let fun sum il foldl op 0 il
- in
- box (sum 1,2,3,4,5)
- end) ?int _at_w
34Just a Few Types
- ?t A value of type t that exists at some other
location. - Example
- here (ref 5)?(ref int) _at_w
35But What About Host Failure?
( at host 1 ) getw_2, a_2( ( at host 2
) !int_ref_at_w_2 getw_3, a_3( ( at
host 3 ) !int_ref_at_w_3))
36But What About Host Failure?
( at host 1 ) getw_2, a_2( ( at host 2
) !int_ref_at_w_2 getw_3, a_3( ( at
host 3 ) !int_ref_at_w_3))
Host 2 dies!
37But What About Host Failure?
( at host 1 ) getw_2, a_2( ( at host 2
) !int_ref_at_w_2 getw_3, a_3( ( at
host 3 ) !int_ref_at_w_3))
Throw an exception?
Host 2 dies!
38But What About Host Failure?
( at host 1 ) getw_2, a_2( ( at host 2
) !int_ref_at_w_2 getw_3, a_3( ( at
host 3 ) !int_ref_at_w_3))
Continue on from Host 3?
Throw an exception?
Host 2 dies!
39But What About Host Failure?
( at host 1 ) getw_2, a_2( ( at host 2
) !int_ref_at_w_2 getw_3, a_3( ( at
host 3 ) !int_ref_at_w_3) or_if_i_cant_return
(...)))
Continue on from Host 3?
Throw an exception?
Host 2 dies!
40But What About Host Failure?
( at host 1 ) getw_2, a_2( ( at host 2
WHICH DOESNT EXIST!) !int_ref_at_w_2
getw_3, a_3( ( at host 3
) !int_ref_at_w_3) or_if_i_cant_return (...)))
Continue on from Host 3?
Throw an exception?
Host 2 dies!
41What We Want (Intuitively)
- callcc x gt
- ( at host 1 )
- getw_2, a_2(
- ( at host 2 )
-
- !int_ref_at_h_2
- getw_3, a_3(
- ( at host 3 )
- !int_ref_at_h_3
- or_if_i_cant_return
- (throw (raise NetFail) to x)))
42What We Want (Intuitively)
- callcc x gt
- ( at host 1 )
- getw_2, a_2(
- ( at host 2 )
-
- !int_ref_at_h_2
- getw_3, a_3(
- ( at host 3 )
- !int_ref_at_h_3
- or_if_i_cant_return
- (throw (raise NetFail) to x)))
Dont actually throw something through the
network.
43What We Want (Intuitively)
- callcc x gt
- ( at host 1 )
- getw_2, a_2(
- ( at host 2 )
-
- !int_ref_at_h_2
- getw_3, a_3(
- ( at host 3 )
- !int_ref_at_h_3
- or_if_i_cant_return
- (throw (raise NetFail) to x)))
Have host one detect the failure.
Dont actually throw something through the
network.
44Isnt This Just a Timeout Exception?
- A Good Question
- Why not just have the get operation throw a
timeout exception, like in Java? - e.g.
getw_2, a_2 ( !int_on_w2 ) handle TimeOut gt
( do something )
45Answers
- This is actually a little smarter than just
timeout. - The Implicit Spawn Problem
46Answers
- This is actually a little smarter than just
timeout. - The Implicit Spawn Problem
getw_2, a_2 ( ( extremely complicated op ) )
handle TimeOut gt ( do something )
47Answers
- This is actually a little smarter than just
timeout. - The Implicit Spawn Problem
T2
getw_2, a_2 ( ( extremely complicated op ) )
handle TimeOut gt ( do something )
T1
48What We Need
- Share the Fact that Host 1 Has Given Up
- Kill the Thread ASAP
- Make That Threads Actions Irrelevant
- Each host gets a chance to undo potential
effects. - All with Best Effort
49One More Wrinkle
Grab continuation
Catom 1 Catom 2
50One More Wrinkle
Assign Catom1 to myLeader
Catom 1 Catom 2
51One More Wrinkle
Catom 1 Catom 2
52The Design, In Short
- try
- e_1
- continuing
- e_2
- end
53The Design, In Short
- try
- e_1
- continuing
- e_2
- end
- Execute e_1
54The Design, In Short
- try
- e_1
- continuing
- e_2
- end
- Execute e_1
- In the event of node failure... the entire
expression will throw an exception on this host.
55The Design, In Short
- try
- e_1
- continuing
- e_2
- end
- Execute e_1
- In the event of node failure... the entire
expression will throw an exception on this host. - On the other hosts, e_2 will be executed, and its
value discarded.
56The Design, In Short
- ( host 1)
- try
- ( set all of my neighbors
- myLeader to host 1 )
- continuing
- if !myLeader host_1
- then myLeader NONE
- else ()
- end
57ML5-C Error Continuations
try continuing l end
Host Visited Host Location of thread Migration of
thread
58ML5-C Error Continuations
Store Cont(stack)
try continuing l end
Host Visited Host Location of thread Migration of
thread
59ML5-C Error Continuations
Store Cont(?l)
try continuing l end
Host Visited Host Location of thread Migration of
thread
60ML5-C Error Continuations
try continuing l end
Host Visited Host Location of thread Migration of
thread
61ML5-C Error Continuations
try continuing l end
Store Cont(?l)
Host Visited Host Location of thread Migration of
thread
62ML5-C Error Continuations
try continuing l end
Host Visited Host Location of thread Migration of
thread
63ML5-C Error Continuations
try continuing l end
Host Visited Host Location of thread Migration of
thread
64ML5-C Error Continuations
Error!
try continuing l end
Error!
Host Visited Host Location of thread Migration of
thread
65ML5-C Error Continuations
Restore Cont.
try continuing l end
Restore Cont.
l
Host Visited Host Location of thread Migration of
thread
66ML5-C Error Continuations
raise Fail) handle...
l
Host Visited Host Location of thread Migration of
thread
67ML5-C Error Continuations
raise Fail) handle...
Host Visited Host Location of thread Migration of
thread
68Interesting Note
- In Failure Case, We Have to Reason About Client
and Server. - (The avoidance of this was one of the touted
benefits of ML5!)
69Future Work
- This Work is Not Yet Finished
- More Restrictive Modal Basis
- Only neighbor catoms are accessible
- This would be a lower level language in some
sense.
70Thanks!
71Failure Handling is More Natural
- In Claytronics, Failure is Possible at Any
Moment. - Intuitively, it would be nice to say
try // a complex, multi host operation catch
(Failure v) // take an alternate // course of
action.
72So You Want to See the Typing Rules...
- Note These rules represent just a snapshot of
the work.