Title: Application Resilience: Making Progress in Spite of Failure
1Application Resilience Making Progress in Spite
of Failure
- Nathan A. DeBardeleben and John T. Daly
- High Performance Computing Division
- Los Alamos National Laboratory
- William M. Jones
- Electrical and Computer Engineering Department
- United States Naval Academy
- LA-UR-08-3236
Resilience 2008 Workshop on Resiliency in
High Performance Computing
2Applications WILL Fail
- In spite of improved fault tolerance
- Failures will inevitably occur
- Hardware failures
- Application and system software bugs
- We are moving to petaflop-scale supercomputers
- More software layers more points of failure
- Extreme temperature
- Extreme power
- Extreme scale
- With more computing power comes more potential
for wasted money when not utilized as best as
possible
3Should We Even Try to Avoid Failure?
- Failure - how to avoid it?
- Dynamic process creation to recover from node
failures - Fault Tolerant MPI
- Periodic checkpoints - but how often?
- System support to advise the application of
imminent failure - Save spare processors allocated for use after a
failure - Costly. Complex.
- Let us just ask ourselves instead a simple
question Is my application performing useful
work (making progress)?
4Is My Application Making Progress?
- How do we ensure progress is made?
- Application monitoring frameworks
- Intelligent application checkpointing
- Analysis of checkpoint overhead
- So, whats the main problem?
5Failures May Go Unnoticed
wasted time
Application stops making progress
6There Are Many Ways to Monitor Application
Progress
- It is a surprisingly hard task to determine if an
application has stopped making progress! - Maybe its just waiting on a network/disk
- Maybe its computing or maybe its just spinning
in an infinite loop - Maybe a node is not responding or maybe another
task is just switched in - Lets take a look at a layered approach to
monitoring progress
7Node-Level System Monitoring
- Daemons
- Heart-beat mechanisms
- Coupled with useful performance data sometimes
- Are we willing to pay for daemon processing time?
System noise already is considered too high
8Subsystem-Level System Monitoring
- Network heartbeat - Infiniband
- Fault tolerant MPI
- Parallel file system fault tolerance
- Fail over nodes
- Redundancy
- Kernel - power, heat
- Degrade performance but try and recover in some
cases - Helps pinpoint failure to specific subsystems
9Application-Level System Monitoring
- Who better to know if an application is making
progress than the application itself? - Source/binary instrumentation to emit heart beats
- Kernel modifications to look for system call
usage - does the application appear to be in a
wait loop? - Watch application output. Is it producing any at
a regular interval? - How does one determine these intervals?
10- Suppose you could detect that an error occurred,
migrate the job, and restart the job from last
checkpoint. - How quickly would you need to determine that an
interrupt occurred?
11Our Assumptions
- Coupled checkpoint / restart application
- Some tradeoff exists between checkpoint frequency
and how far we have to backup after an interrupt - R f(detection latency restart overhead)
12Analytical Model
13(No Transcript)
14(No Transcript)
15Compare Theory to Simulation
- How closely does real supercomputer usage match
the theory? - Need a simulator - BeoSim
- Need real data - Pink at Los Alamos
16Workload Distribution
(1926 node cluster)
Event driven simulation using 4,000,000 jobs
(using BeoSim)
17BeoSimA Computational Grid Simulator
Parallel Job Scheduling Research Single and
Multiple Clusters Checkpointing Studies
JAVA front-front C back-end Discrete event
simulator Single-threaded Parameter studies in
parallel
18BeoSim Framework
Beosim http//www.parl.clemson.edu/beosim
19Impact of Increasing Failure Rates
May seem negligible, but, multiple interrupts,
impact on throughput - NOT total number of
failures
20Impact on Throughput for ALL jobs
significant reduction in queueing delays
CPdelta (time to determine an interrupt
occurred) (min)?
21Impact on Execution Time
marginal(1.8)?
significant (13.5)
CPdelta (time to determine an interrupt
occurred) (min)?
22Keep in Mind That . . .
(6.5 of total job interrupted)
(1.5 of total job interrupted)
So while the averages are relatively close for
both scenarios, there are an increasing number of
jobs that are effected as the MTBF decreases and
therefore more resources tied to applications
that are not making progress
CPdelta (time to determine an interrupt
occurred) (min)?
23Conclusions
- Simulation seems to relatively closely match
theory approximation - Simple theory but applied to complex system not
included in theory - but still closely matches - Could it extend to more complex systems?
- Application monitoring is paramount
- Immediate detection not necessarily a hard
requirement (for this system) - Helps decision makers
- 100million to spend - do I need to pay 5x the
cost for a better detection system? - Whats my expected workload?
- Put it into the simulation!
- Pink is a general purpose cluster - lots of
different jobs with different runtimes and
widths. We use averages which tend to make the
results murky.
24Future Work
- No time to factor in fixing the failure, hardware
takes time to repair - Completely independent failures
- Look at different classes of jobs or look at a
system that is less diverse as Pink - How to come up with the MTBF and how it effects
the optimal checkpointing intervals - More work determining parameter M for systems
where were not running a job across the entire
machine
25Thank-you!Questions?