Application Resilience: Making Progress in Spite of Failure

About This Presentation

Title:

Application Resilience: Making Progress in Spite of Failure

Description:

Making Progress in Spite of Failure. Nathan A. DeBardeleben and ... Electrical and Computer Engineering Department. United States Naval Academy. LA-UR-08-3236 ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 26

Provided by: parlCl

Category:

more less

Transcript and Presenter's Notes

Title: Application Resilience: Making Progress in Spite of Failure

1
Application Resilience Making Progress in Spite
of Failure

Nathan A. DeBardeleben and John T. Daly
High Performance Computing Division
Los Alamos National Laboratory
William M. Jones
Electrical and Computer Engineering Department
United States Naval Academy
LA-UR-08-3236

Resilience 2008 Workshop on Resiliency in
High Performance Computing
2
Applications WILL Fail

In spite of improved fault tolerance
Failures will inevitably occur
Hardware failures
Application and system software bugs
We are moving to petaflop-scale supercomputers
More software layers more points of failure
Extreme temperature
Extreme power
Extreme scale
With more computing power comes more potential
for wasted money when not utilized as best as
possible

3
Should We Even Try to Avoid Failure?

Failure - how to avoid it?
Dynamic process creation to recover from node
failures
Fault Tolerant MPI
Periodic checkpoints - but how often?
System support to advise the application of
imminent failure
Save spare processors allocated for use after a
failure
Costly. Complex.
Let us just ask ourselves instead a simple
question Is my application performing useful
work (making progress)?

4
Is My Application Making Progress?

How do we ensure progress is made?
Application monitoring frameworks
Intelligent application checkpointing
Analysis of checkpoint overhead
So, whats the main problem?

5
Failures May Go Unnoticed
wasted time
Application stops making progress
6
There Are Many Ways to Monitor Application
Progress

It is a surprisingly hard task to determine if an
application has stopped making progress!
Maybe its just waiting on a network/disk
Maybe its computing or maybe its just spinning
in an infinite loop
Maybe a node is not responding or maybe another
task is just switched in
Lets take a look at a layered approach to
monitoring progress

7
Node-Level System Monitoring

Daemons
Heart-beat mechanisms
Coupled with useful performance data sometimes
Are we willing to pay for daemon processing time?
System noise already is considered too high

8
Subsystem-Level System Monitoring

Network heartbeat - Infiniband
Fault tolerant MPI
Parallel file system fault tolerance
Fail over nodes
Redundancy
Kernel - power, heat
Degrade performance but try and recover in some
cases
Helps pinpoint failure to specific subsystems

9
Application-Level System Monitoring

Who better to know if an application is making
progress than the application itself?
Source/binary instrumentation to emit heart beats
Kernel modifications to look for system call
usage - does the application appear to be in a
wait loop?
Watch application output. Is it producing any at
a regular interval?
How does one determine these intervals?

Suppose you could detect that an error occurred,
migrate the job, and restart the job from last
checkpoint.
How quickly would you need to determine that an
interrupt occurred?

11
Our Assumptions

Coupled checkpoint / restart application
Some tradeoff exists between checkpoint frequency
and how far we have to backup after an interrupt
R f(detection latency restart overhead)

12
Analytical Model
13
(No Transcript)
14
(No Transcript)
15
Compare Theory to Simulation

How closely does real supercomputer usage match
the theory?
Need a simulator - BeoSim
Need real data - Pink at Los Alamos

16
Workload Distribution
(1926 node cluster)
Event driven simulation using 4,000,000 jobs
(using BeoSim)
17
BeoSimA Computational Grid Simulator
Parallel Job Scheduling Research Single and
Multiple Clusters Checkpointing Studies
JAVA front-front C back-end Discrete event
simulator Single-threaded Parameter studies in
parallel
18
BeoSim Framework
Beosim http//www.parl.clemson.edu/beosim
19
Impact of Increasing Failure Rates
May seem negligible, but, multiple interrupts,
impact on throughput - NOT total number of
failures
20
Impact on Throughput for ALL jobs
significant reduction in queueing delays
CPdelta (time to determine an interrupt
occurred) (min)?
21
Impact on Execution Time
marginal(1.8)?
significant (13.5)
CPdelta (time to determine an interrupt
occurred) (min)?
22
Keep in Mind That . . .
(6.5 of total job interrupted)
(1.5 of total job interrupted)
So while the averages are relatively close for
both scenarios, there are an increasing number of
jobs that are effected as the MTBF decreases and
therefore more resources tied to applications
that are not making progress
CPdelta (time to determine an interrupt
occurred) (min)?
23
Conclusions

Simulation seems to relatively closely match
theory approximation
Simple theory but applied to complex system not
included in theory - but still closely matches
Could it extend to more complex systems?
Application monitoring is paramount
Immediate detection not necessarily a hard
requirement (for this system)
Helps decision makers
100million to spend - do I need to pay 5x the
cost for a better detection system?
Whats my expected workload?
Put it into the simulation!
Pink is a general purpose cluster - lots of
different jobs with different runtimes and
widths. We use averages which tend to make the
results murky.

24
Future Work

No time to factor in fixing the failure, hardware
takes time to repair
Completely independent failures
Look at different classes of jobs or look at a
system that is less diverse as Pink
How to come up with the MTBF and how it effects
the optimal checkpointing intervals
More work determining parameter M for systems
where were not running a job across the entire
machine