Title: Extending MRNet with Scalable Failure Recovery Mechanisms
1Extending MRNet with Scalable Failure Recovery
Mechanisms
Dorian Arnold Paradyn Project
Paradyn Week April 29 30, 2008 Madison, WI
2Todays Talk
- Brief TBON/MRNet overview
- State compensation recovery model
- MRNet fault-tolerant extensions
- Evaluation
3Tree-based Overlay Networks for Scalable
Performance
FE
FE
Front-end
- Scalable data multicastand aggregation
- Flexible topologies
- User-defined filters
- Trade-off extra processing nodes for performance
BE
BE
BE
BE
BE
BE
BE
BE
Back-ends
BE
BE
BE
BE
BE
BE
BE
BE
4Tree-based Overlay Networks for Scalable
Performance
FE
FE
- Integer Maximum Computation
- Practically infinite input stream
- Stateful filters for incremental updates
9
9
4
9
4
9
4
9
4
9
4
2
8
9
4
2
8
9
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
2
4
1
2
6
8
9
2
5To Infinity and Beyond
- TBONs provide very scalable performance
- Several MRNet-based examples today
- Increasing scales reliability attention
- Today 105
- Tomorrow 106! 107? 108?
- Need recovery mechanisms that do not mitigate
scalability or performance
6Current Reliability Approaches
- Fail-over
- Replace failed primary w/ backup replica
- Quick failure recovery
- High synchronization/utilization overhead
- Checkpointing (coordinated)
- Simplicity
- Coordination overhead
- Petascale Checkpointing Elnozahy, Plank 04
- Need resources dedicated to fault-tolerance
- May overload network/storage resources
7State Compensation
- Conceptual and empirical frameworks
- Filter properties and weak consistency
- Relax recovery model constraints
- Limit recovery participants
- Avoid coordination protocols
- Inherent redundancy
- No explicit replication
- Surviving state compensates for lost state
- Rapid recovery
- Minimal application perturbation
8Failure Model
- Fail-stop (detectable crashes)
- Any TBON process failure
- Multiple, simultaneous failures
- Application process failures
- Restart or sequential checkpointing
9TBON Data Aggregation
input
filter state
output
updated filter state
f (in, fsn ) ? out, fsn1
Filter state encapsulates input history (merges
new input) in fsn ? fsn1 Output is
incremental update fsn1 fsn ? out
10Complicit Reductions
- Idempotent
- Upper/lower bound computations
- Set unions
- Graph merging (E.g. STAT, Paradyn)
- Equivalence class computations
- Data classification
- Anomaly detection,
- Non-idempotent
- summation, average,
- counting-based operations ? overvaluation
11Inherent Redundancy
Integer Maximum Computation
FE
FE
9
Joining childrens state forms parents state
4
9
4
9
4
2
8
9
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
2
4
1
2
6
8
9
2
12State Composition
If CPj fails, all state associated withCPj is
lost
output
TBON Output Theorem Output depends only on
channel states and root filter state
CPi
channel state
Compose state from below points of failure for
compensation.
filter state
All-encompassing Leaf State Theorem Leaf states
subsume sub-trees state
CPj
channel state
CPk
CPl
Therefore, leaf states can replacelost channel
state without changingcomputations semantics
13MRNet Extension IEvent Detection Service
- Detecting Component Failures
- Premature connection termination
- Process failures detected immediately
- Node failures detected via keep alive
- Disseminating Failure Information
- Failures detected by multiple peers
- Peers use TBON for rapid propagation
FE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
BE
14MRNet Extension IEvent Detection Service
- New child connections
- Dynamic topologies at startup/recovery
- Topologies can expand/change arbitrarily
- How do new/deleted processes impact existing
streams? - How do we notify application processes about
topology changes?
15MRNet Extension IITree Reconstruction
(Algorithm)
- Orphans independently rank each adopter
- Fan-out (overloading/imbalances decreased
bandwidth) - Depth (more hops increased latencies)
- Proximity (virtual topology physical topology)
16MRNet Extension IITree Reconstruction
(Algorithm)
- Weighted random sampling to mitigate overloading
same parent
parent weight
sort key
random float
17MRNet Extension IITree Reconstruction (Update
Propagation)
- Failure report (failed rank)
- Recovery report (child rank, parent rank)
- Consistency Issues
- Stale/missing failure report?
- Retry on connect failure
- Stale/missing recovery report?
- Reconstruction algorithm inputs stale topology
- Multiple identical reports?
- Failure/recovery reports are idempotent
- Conflicting/out-of-order recovery reports?
- Resolve using incarnation (version) number
18MRNet Extension IIIState Composition
if child fails remove failed child
resume filtering from non-failed children
endif if parent fails compute new parent
list while failed to connect to list front
remove list front send filter state
to parent endif
if child fails remove failed child
resume filtering from non-failed children
endif if parent fails compute new parent
list while failed to connect to list front
remove list front send filter state
to parent endif
19Evaluation
- Does compensation really work?
- How responsive is failure recovery?
- How do failures affect application performance?
20Evaluation Does it Work?
- Integer equivalence over input stream
- Same input stream with/without failures
- Are all input elements produced?
- Are any erroneous output produced?
INSPECTION PASSED!
YES!
NO!
21Evaluation Is it Responsive?
- Recovery latency is a function of fan-out!
- Only orphans actively participate in recovery
- Adopting parents passively participate
- How does fan-out impact recovery latency?
22Evaluation Is it Responsive?
1283 2,097,152 processes
INSPECTION PASSED!
- LLNLs Thunder
- 1024x4 processors
- 1.4 GHz Itanium2
- Quadrics QsNetII
23Current Work
- Evaluate application perturbation
- Evaluate tree reconstruction algorithms
- TBON mechanisms for event dissemination
- Other compensation mechanisms
INSPECTION PENDING!
24References
- Arnold and Miller, A Scalable Recovery Model for
Tree-based Overlay Networks, UW Computer
Sciences Technical Report, TR-1626, January 2008. - Roth, Arnold, and Miller, MRNet A
Software-based Multicast/Reduction Network for
Scalable Tools, SC 2003, Phoenix, AZ, November
2003.
http//www.paradyn.org/mrnet