Title: Computing in the
1Computing in the
Reliable Array of Independent Nodes
Vasken Bohossian, Charles Fan, Paul LeMahieu,
Marc Riedel, Lihao Xu, Jehoshua Bruck
Marc Riedel
Marc Riedel
California Institute of Technology
IEEE Workshop on Fault-Tolerant Parallel and
Distributed Systems
May 5, 2000
2RAIN Project
Collaboration
- Caltechs Parallel and Distributed Computing
Group www.paradise.caltech.edu
- JPLs Center for Integrated Space Microsystems
www.csmt.jpl.nasa.gov
3RAIN Platform
node
node
node
Heterogeneous network of nodes and switches
switch
bus network
switch
node
node
node
4RAIN Testbed
www.paradise.caltech.edu
5Proof of Concept Video Server
Video client server on every node.
C
D
B
A
switch
switch
6Limited Storage
Insufficient storage to replicate all the data on
each node.
C
D
B
A
switch
switch
7k-of-n Code
Erasure-correcting code
from any k of n columns
8Encoding
Encode video using 2-of-4 code.
C
D
B
A
switch
switch
9Decoding
Retrieve data and decode.
C
D
B
A
switch
switch
10Node Failure
C
D
B
A
switch
switch
11Node Failure
C
D
B
A
switch
switch
12Node Failure
Dynamically switch to another node.
C
D
B
A
switch
switch
13Link Failure
D
C
B
A
switch
switch
14Link Failure
C
D
B
A
switch
switch
15Link Failure
Dynamically switch to another network path.
D
C
B
A
switch
switch
16Switch Failure
D
C
B
A
switch
switch
17Switch Failure
C
D
B
A
switch
switch
18Switch Failure
Dynamically switch to another network path.
C
D
B
A
switch
switch
19Node Recovery
C
D
B
A
switch
switch
20Node Recovery
Continuous reconfiguration (e.g., load-balancing).
C
D
B
A
switch
switch
21Features
High availability
- tolerates multiple node/link/switch failures
- no single point of failure
Efficient use of resources
- multiple data paths
- redundant storage
- graceful degradation
Dynamic scalability/reconfigurability
22RAIN Project Goals
Efficient, reliable distributed computing and
storage systems
key building blocks
Applications
Storage
Communication
Networks
23Topics
Todays Talk
- Fault-Tolerant Interconnect Topologies
- Connectivity
- Group Membership
- Distributed Storage
Applications
Storage
Communication
Networks
24Interconnect Topologies
Goal lose at most a constant number of nodes
for given network loss
N
N
N
N
N
N
N
N
N
N
Network
computing/storage node
N
25Resistance to Partitions
Large partitions problematic for distributed
services/computation
N
N
N
N
N
N
N
N
N
N
Network
computing/storage node
N
26Resistance to Partitions
Large partitions problematic for distributed
services/computation
N
N
N
N
N
N
N
N
N
N
Network
computing/storage node
N
27Related Work
Embedding hypercubes, rings, meshes, trees in
fault-tolerant networks
- Hayes et al., Bruck et al., Boesch et al.
Bus-based networks which are resistant to
partitioning
- Ku and Hayes, 1997. Connective Fault-Tolerance
in Multiple-Bus Systems
28A Ring of Switches
degree-2 compute nodes, degree-4 switches
a naïve solution
29A Ring of Switches
degree-2 compute nodes, degree-4 switches
a naïve solution
30A Ring of Switches
degree-2 compute nodes, degree-4 switches
N
N
S
S
S
a naïve solution
N
N
easily partitioned
S
S
Node
N
N
N
S
S
Switch
S
N
31Resistance to Partitioning
degree-2 compute nodes, degree-4 switches
nodes on diagonals
32Resistance to Partitioning
degree-2 compute nodes, degree-4 switches
nodes on diagonals
33Resistance to Partitioning
degree-2 compute nodes, degree-4 switches
2
8
8
2
nodes on diagonals
7
3
7
- tolerates any 3 switch failures (optimal)
- generalizes to arbitrary node/switch degrees.
3
4
5
5
Details paper IPPS98, www.paradise.caltech.edu
34Resistance to Partitioning
Details paper IPPS98, www.paradise.caltech.edu
35Point-to-Point Connectivity
A
node
node
node
?
Is the path from A to B up or down?
Network
node
node
node
B
36Connectivity
Bi-directional communication.
Link is seen as up or down by each node.
Node A
Node B
U,D
U,D
Each node sends out pings.
A node may time-out, deciding the link is down.
37Consistent History
A
B
38The Slack
Node State
A
B
U
U
D
D
U
U
D
D
U
D
U
39Consistent History
Consistency in error reporting If A sees channel
error, B sees channel error.
Node A
Node B
U,D
U,D
Birman et al. Reliability Through Consistency
Details paper IPPS99, www.paradise.caltech.edu
40Group Membership
Consistent global view given local,
point-to-point connectivity information
ABCD
ABCD
- link/node failures
- dynamic reconfiguration
ABCD
ABCD
41Related Work
42Group Membership
Token-Ring based Group Membership Protocol
43Group Membership
Token-Ring based Group Membership Protocol
B
A
Token carries
- group membership list
- sequence number
D
C
44Group Membership
Token-Ring based Group Membership Protocol
B
A
1
Token carries
- group membership list
- sequence number
D
C
45Group Membership
Token-Ring based Group Membership Protocol
B
A
1
2
Token carries
- group membership list
- sequence number
2 ABCD
D
C
46Group Membership
Token-Ring based Group Membership Protocol
B
A
1
2
Token carries
- group membership list
- sequence number
3 ABCD
D
C
3
47Group Membership
Token-Ring based Group Membership Protocol
B
A
1
2
Token carries
- group membership list
- sequence number
4 ABCD
D
C
3
4
48Group Membership
Token-Ring based Group Membership Protocol
B
A
5
2
Token carries
- group membership list
- sequence number
D
C
3
4
49Group Membership
Node or link fails
B
A
5
2
D
C
3
4
50Group Membership
Node or link fails
B
A
5
D
C
3
4
51Group Membership
Node or link fails
?
B
A
5
D
C
3
4
52Group Membership
Node or link fails
?
B
A
5
D
C
3
4
53Group Membership
Node or link fails
B
A
5
If a node is inaccessible, it is excluded and
bypassed.
D
C
3
4
54Group Membership
Node or link fails
B
A
5
If a node is inaccessible, it is excluded and
bypassed.
6 ACD
D
C
6
4
55Group Membership
Node or link fails
B
A
5
If a node is inaccessible, it is excluded and
bypassed.
D
C
6
7
56Group Membership
Node or link fails
B
A
5
If a node is inaccessible, it is excluded and
bypassed.
D
C
6
7
57Group Membership
Node with token fails
B
A
5
D
C
6
7
58Group Membership
Node with token fails
B
A
5
D
C
6
59Group Membership
Node with token fails
?
B
A
5
?
D
C
6
60Group Membership
Node with token fails
?
B
A
5
If the token is lost, it is regenerated.
?
D
C
6
61Group Membership
Node with token fails
B
A
5
If the token is lost, it is regenerated.
D
C
6
62Group Membership
Node with token fails
B
A
5
If the token is lost, it is regenerated.
5 ACD
6 AD
D
C
6
63Group Membership
Node with token fails
B
A
5
If the token is lost, it is regenerated.
5 ACD
Highest sequence number prevails.
6 AD
D
C
6
64Group Membership
Node with token fails
B
A
7
If the token is lost, it is regenerated.
Highest sequence number prevails.
D
C
6
65Group Membership
Node recovers
B
A
7
D
C
6
66Group Membership
Node recovers
B
A
7
Recovering nodes are added.
D
C
6
67Group Membership
Node recovers
B
A
7
Recovering nodes are added.
7 ADC
D
C
6
68Group Membership
Node recovers
B
A
7
Recovering nodes are added.
8 ADC
D
C
6
8
69Group Membership
Node recovers
B
A
7
Recovering nodes are added.
9 ADC
D
C
9
8
70Group Membership
Node recovers
B
A
10
Recovering nodes are added.
D
C
9
8
71Group Membership
Features
B
A
10
- Unicast messages
- Dynamic reconfiguration
- Mean time-to-failure gt convergence time
D
Details publication forthcoming.
C
9
8
72Distributed Storage
101001001000
73Distributed Storage
Focus reliability and performance.
74Array Codes
Ideally suited for distributed storage.
Low encoding/decoding complexity.
data
redundancy
B-code
75Array Codes
Ideally suited for distributed storage.
Low encoding/decoding complexity.
a
b
c
d
dc
da
ab
bc
from any k of n columns
76Array Codes
Ideally suited for distributed storage.
Low encoding/decoding complexity.
a
b
c
d
a
b
c
d
dc
da
ab
bc
B-Code and X-Code
- optimally redundant
- optimal encoding/decoding complexity
Details IEEE Trans. Info Theory,
www.paradise.caltech.edu
77Summary
78Proof-of-Concept Applications
79Rainfinity
Start-up based on RAIN technology
www.rainfinity.com
- availability
- scalability
- performance
80Rainfinity
Start-up based on RAIN technology
Company
- Founded Sept. 1998
- Released first product April 1999
- Received 15 million funding in Dec. 1999
- Now over 50 employees
www.rainfinity.com
81Future Research
- Development of APIs
- Fault-Tolerant Distributed Filesystem
- Fault-Tolerant MPI/PVM implementation
82End of Talk
Material that was cut...
83Erasure Correcting Codes
Strategy encode data with an erasure-correcting
code.
84Erasure Correcting Codes
Strategy encode data with an erasure-correcting
code.
data
k
lose up to m coordinates
1
0
1
0
1
1
0
1
0
1
0
0
0
n
85Erasure Correcting Codes
Strategy encode data with an erasure-correcting
code.
data
k
lose up to m coordinates
1
0
1
0
1
1
0
1
0
1
0
0
0
n
86Erasure Correcting Codes
Example Reed-Solomon code.
data
k
lose up to m coordinates
1
0
1
0
1
1
0
1
0
1
0
0
0
n
87RAIN Distributed Store
- Encode data with (n, k) array code
- Store one symbol per node
disk
disk
disk
disk
88RAIN Distributed Retrieve
- Retrieve encoded data from any k nodes
disk
a
dc
89RAIN Distributed Retrieve
- Reliability (similar to RAID systems)
disk
a
dc
90RAIN Distributed Retrieve
- Reliability (similar to RAID systems)
disk
a
dc
91RAIN Distributed Retrieve
- Reliability (similar to RAID systems)
- Performance load-balancing
disk
disk
disk
a
dc
92RAIN Distributed Retrieve
- Reliability (similar to RAID systems)
- Performance load-balancing
disk
disk
disk
a
dc
93RAIN Distributed Retrieve
- Reliability (similar to RAID systems)
- Performance load-balancing
disk
disk
disk
a
dc