Application Resilience: Making Progress in Spite of Failure - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Application Resilience: Making Progress in Spite of Failure

Description:

Making Progress in Spite of Failure. Nathan A. DeBardeleben and ... Electrical and Computer Engineering Department. United States Naval Academy. LA-UR-08-3236 ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 26
Provided by: parlCl
Category:

less

Transcript and Presenter's Notes

Title: Application Resilience: Making Progress in Spite of Failure


1
Application Resilience Making Progress in Spite
of Failure
  • Nathan A. DeBardeleben and John T. Daly
  • High Performance Computing Division
  • Los Alamos National Laboratory
  • William M. Jones
  • Electrical and Computer Engineering Department
  • United States Naval Academy
  • LA-UR-08-3236

Resilience 2008 Workshop on Resiliency in
High Performance Computing
2
Applications WILL Fail
  • In spite of improved fault tolerance
  • Failures will inevitably occur
  • Hardware failures
  • Application and system software bugs
  • We are moving to petaflop-scale supercomputers
  • More software layers more points of failure
  • Extreme temperature
  • Extreme power
  • Extreme scale
  • With more computing power comes more potential
    for wasted money when not utilized as best as
    possible

3
Should We Even Try to Avoid Failure?
  • Failure - how to avoid it?
  • Dynamic process creation to recover from node
    failures
  • Fault Tolerant MPI
  • Periodic checkpoints - but how often?
  • System support to advise the application of
    imminent failure
  • Save spare processors allocated for use after a
    failure
  • Costly. Complex.
  • Let us just ask ourselves instead a simple
    question Is my application performing useful
    work (making progress)?

4
Is My Application Making Progress?
  • How do we ensure progress is made?
  • Application monitoring frameworks
  • Intelligent application checkpointing
  • Analysis of checkpoint overhead
  • So, whats the main problem?

5
Failures May Go Unnoticed
wasted time
Application stops making progress
6
There Are Many Ways to Monitor Application
Progress
  • It is a surprisingly hard task to determine if an
    application has stopped making progress!
  • Maybe its just waiting on a network/disk
  • Maybe its computing or maybe its just spinning
    in an infinite loop
  • Maybe a node is not responding or maybe another
    task is just switched in
  • Lets take a look at a layered approach to
    monitoring progress

7
Node-Level System Monitoring
  • Daemons
  • Heart-beat mechanisms
  • Coupled with useful performance data sometimes
  • Are we willing to pay for daemon processing time?
    System noise already is considered too high

8
Subsystem-Level System Monitoring
  • Network heartbeat - Infiniband
  • Fault tolerant MPI
  • Parallel file system fault tolerance
  • Fail over nodes
  • Redundancy
  • Kernel - power, heat
  • Degrade performance but try and recover in some
    cases
  • Helps pinpoint failure to specific subsystems

9
Application-Level System Monitoring
  • Who better to know if an application is making
    progress than the application itself?
  • Source/binary instrumentation to emit heart beats
  • Kernel modifications to look for system call
    usage - does the application appear to be in a
    wait loop?
  • Watch application output. Is it producing any at
    a regular interval?
  • How does one determine these intervals?

10
  • Suppose you could detect that an error occurred,
    migrate the job, and restart the job from last
    checkpoint.
  • How quickly would you need to determine that an
    interrupt occurred?

11
Our Assumptions
  • Coupled checkpoint / restart application
  • Some tradeoff exists between checkpoint frequency
    and how far we have to backup after an interrupt
  • R f(detection latency restart overhead)

12
Analytical Model
13
(No Transcript)
14
(No Transcript)
15
Compare Theory to Simulation
  • How closely does real supercomputer usage match
    the theory?
  • Need a simulator - BeoSim
  • Need real data - Pink at Los Alamos

16
Workload Distribution
(1926 node cluster)
Event driven simulation using 4,000,000 jobs
(using BeoSim)
17
BeoSimA Computational Grid Simulator
Parallel Job Scheduling Research Single and
Multiple Clusters Checkpointing Studies
JAVA front-front C back-end Discrete event
simulator Single-threaded Parameter studies in
parallel
18
BeoSim Framework
Beosim http//www.parl.clemson.edu/beosim
19
Impact of Increasing Failure Rates
May seem negligible, but, multiple interrupts,
impact on throughput - NOT total number of
failures
20
Impact on Throughput for ALL jobs
significant reduction in queueing delays
CPdelta (time to determine an interrupt
occurred) (min)?
21
Impact on Execution Time
marginal(1.8)?
significant (13.5)
CPdelta (time to determine an interrupt
occurred) (min)?
22
Keep in Mind That . . .
(6.5 of total job interrupted)
(1.5 of total job interrupted)
So while the averages are relatively close for
both scenarios, there are an increasing number of
jobs that are effected as the MTBF decreases and
therefore more resources tied to applications
that are not making progress
CPdelta (time to determine an interrupt
occurred) (min)?
23
Conclusions
  • Simulation seems to relatively closely match
    theory approximation
  • Simple theory but applied to complex system not
    included in theory - but still closely matches
  • Could it extend to more complex systems?
  • Application monitoring is paramount
  • Immediate detection not necessarily a hard
    requirement (for this system)
  • Helps decision makers
  • 100million to spend - do I need to pay 5x the
    cost for a better detection system?
  • Whats my expected workload?
  • Put it into the simulation!
  • Pink is a general purpose cluster - lots of
    different jobs with different runtimes and
    widths. We use averages which tend to make the
    results murky.

24
Future Work
  • No time to factor in fixing the failure, hardware
    takes time to repair
  • Completely independent failures
  • Look at different classes of jobs or look at a
    system that is less diverse as Pink
  • How to come up with the MTBF and how it effects
    the optimal checkpointing intervals
  • More work determining parameter M for systems
    where were not running a job across the entire
    machine

25
Thank-you!Questions?
  • Nathan A. DeBardeleben
Write a Comment
User Comments (0)
About PowerShow.com