Diskless Checkpointing - PowerPoint PPT Presentation

About This Presentation

Title:

Diskless Checkpointing

Description:

Forked Diskless Chkpnt: Extra 2I in memory, less CPU activity. Optimizations: ... CPU-Intensive parallel programs. Instances that took 1.5~2 hrs on 16 processors ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 35

Provided by: KL7471

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Diskless Checkpointing

1
Diskless Checkpointing

15 Nov 2001

2
Motivation

Checkpointing on Stable Storage
Disk access is a major bottleneck!
Incremental Checkpointing
Copy-on-write
Compression
Memory Exclusion
Diskless Checkpointing

3
Diskless?

Extra memory is available (e.g. NOW)
Use memory instead of disk
Good
Network Bandwidth gt Disk Bandwidth
Bad
Memory is not stable

4
Bottom-line

NOW with (nm) processors
The application runs on exactly n procs,
and should proceed as long as
The number of processors in the system is at
least n
The failures occur within certain constraint

Application Processors (n)
Chkpnt Processors (m)
Available Processors (nm)
5
Overview

Coordinated Chkpnt (Sync-and-Stop)
To checkpoint,
Application Proc Chkpnt the state in memory
Chkpnt Proc Encoding the application chkpnts and
storing the encodings in memory
To recover,
Non-failed Procs Roll-back
Replacement processors are chosen.
Replacement Proc Calculate the chkpnts of the
failed procs using other chkpnts encodings

6
Outline

Application Processor Chkpnt
Disk-based
Diskless
Incremental
Forked (or copy-on-write)
Optimization
Encoding the chkpnts
Parity (RAID level 5)
Mirroring
1-Dimensional Parity
2-Dimensional Parity
Reed-Solomon Coding
Optimization
Result

7
Application Processor Chkpnt

Goal
The processor should be able to roll back to its
most recent chkpnt.
Need to tolerate failures when chkpnt
Make sure that each coordinated chkpnt remains
valid until the next coordinated chkpnt has been
completed.

8
Disk-based Chkpnt

To chkpnt
Save all values in the stack, heap, and registers
to disk
To recover
Overwrites the address space with the stored
checkpoint
Space Demands
2M in disk

(M the size of an application processors
address space)
9
Simple Diskless Chkpnt

To chkpnt
Wait until encoding calculated
Overwrite diskless chkpnts in memory
To recover
Roll-backed from in-memory chkpnts
Space Demands
Extra M in memory

(M the size of an application processors
address space)
10
Incremental Diskless Chkpnt

To chkpnt
Initially set all pages R_ONLY
On page fault, copy set RW
To recover
Restore all RW pages
Space Demands
Extra I in memory

(I the incremental chkpnt size)
11
Forked Diskless Chkpnt

To chkpnt
Application clones itself
To recover
Overwrites state with clones
Or clone assumes the role of the application
Space Demands
Extra 2I in memory

(I the incremental chkpnt size)
12
Optimizations

Breaking the chkpnt into chunks
Efficient use of memory
Sending Diffs (Incremental)
Bitwise xor of the current copy and chkpnt copy
Unmodified pages need not be sent
Compressing Diffs
Unmodified regions of memory

13
Application Processor Chkpnt (review)

Simple Diskless Chkpnt Extra M in memory
Incremental Diskless Chkpnt Extra I in memory
Forked Diskless Chkpnt Extra 2I in memory,
less CPU activity
Optimizations
Chkpnt into chunks, diffs, and compressed diffs

14
Encoding the chkpnts

Goal
Extra chkpnt processors should store enough
information that the chkpnts of failed processors
may be reconstructed.
Notation
Number of chkpnt processors (m)
Number of application processors (n)

15
Parity (RAID level 5, m1)

To chkpnt,
On failure of ith proc,
Can tolerate
Only one processor failure
Remarks
Chkpnt processor is a bottleneck of communication
and computation

Example n4, m1
Application Processor
Chkpnt Processor
j-th byte of Application processor i
16
Mirroring (mn)

To chkpnt,
On failure of ith proc,
Can tolerate
Up to n processor failures
Except the failure of both an application
processor and its checkpoint processor
Remarks
Fast, no calculation needed

Example nm4
Application Processor
Chkpnt Processor
j-th byte of Application processor i
17
1-Dimensional Parity (1ltmltn)

To chkpnt,
Application processors are partitioned into m
groups.
ith chkpnt processor calculates the parity of the
chkpnts in group i
On failure of ith proc,
Same as in Parity encoding
Can tolerate
One processor failure per group
Remarks
More efficient in communication and computation

Example n4, m2
Application Processor
Chkpnt Processor
j-th byte of Application processor i
18
2-Dimensional Parity

To chkpnt,
Application processors are arranged logically in
a two-dimensional grid
Each chkpnt processor calculates the parity of
the row or the column
On failure of ith proc,
Same as in Parity encoding
Can tolerate
Any two-processor failures
Remarks
Multicast

Example n4, m4
Application Processor
Chkpnt Processor
j-th byte of Application processor i
19
Reed-Solomon Coding (m)

To chkpnt,
Vandermonde matrix F, s.t. f(i,j)j(i-1)
Use matrix-vector multiplication to calculate
chkpnt
To recover,
Use Gaussian Elimination
Can tolerate
Any m failures
Remarks
Use Galois Fields to perform arithmetic
Computation overhead

20
Optimizations

Sending and calculating the encoding in
RAID level 5-based encodings (e.g. Parity)

(b) FAN-IN log(n) step
(a) DIRECT C1 bottleneck
21
Encoding the Chkpnts (review)

Parity (RAID level 5, m1)
Only one failure, bottleneck
Mirroring (mn)
Up to n failures (unless both app and chkpnt
fail), fast
1-Dimensional Parity
One failure per group, more efficient than Parity
2-Dimensional Parity
Any two failures, comm overhead w/o multicast
Reed-Solomon Coding
Any m failures, computation overhead
DIRECT vs. FAN-IN

22
Testing Applications (1)

CPU-Intensive parallel programs
Instances that took 1.52 hrs on 16 processors
NBODY N-body interactions among particles in a
system
Particles are partitioned among processors
Location field of each particle is updated
Expectation
Poor with incremental chkpnt
Good with diff-based compression
MAT FP matrix product of two square matrices
(Cannons alg.)
All three matrices are partitioned in square
blocks among processors
In each step, adds the product and passing the
input submatrices
Expectation
Incremental chkpnt
Very poor with diff-based compression

23
Testing Applications (2)

PSTSWM Nonlinear shallow water equations on a
rotating sphere
Majority pages, but only few bytes per page are
modified
Expectation
Poor with incremental chkpnt
Good with diff-based compression
CELL Parallel cellular automaton simulation
program
Two (sparse) grids of cellular automata
(current/next)
Expectation
Poor with incremental chkpnt
Good with compression
PCG Axb for a large, sparse matrix
First, converted to a small, dense format
Expectation
Incremental chkpnt
Very poor with diff-based compression

24
Diskless Checkpointing

20 Nov 2001

25
Disk-based vs. Diskless Chkpnt
Disk-based Diskless
Where to chkpnt? In stable storage In local memory
How to recover? Restore from stable storage Re-calculate
Remarks Can tolerate whole failure Cannot tolerate whole failure
Low BW to stable storage Memory is much faster
Encoding (communication) overhead
26
Recalculate the lost chkpnt?
Error Detection Correction in Digital
Communication
Chkpnt Recovery in Diskless Chkpnt
1-bit Parity (m1)
110010111 (right) 110000111
(detectable) 110010110 (detectable) 110000110
(oops)
110010111 (chkpnt) 1100X0111
(tolerable) 11001011X (tolerable) 1100X011X
(intolerable)
Mirroring (mn)
1100101111001011 (right) 1100101111001010
(detectable) 1100101100111100
(detectable) 1100101011001010 (oops)
1100101111001011 (right) 110010111100101X
(tolerable) 11001011XXXXXXXX (tolerable) 1100101
X1100101X (intolerable)

Remarks
Difference we can easily know that which node is
wrong in chkpnt system.
Some codings can be used to recover from errors
in Digital Comm, too. (e.g. Reed-Solomon)

27
Performance

Criteria
Latency time between chkpnt initiated and ready
for recovery
Overhead increase in execution time with chkpnt
Applications

App Description Pattern
NBODY N-body interactions PSTSWM Simulation of
the states on 3-D system CELL Parallel cellular
automaton
Majority pages, but only few bytes per page are
modified
Only small parts are updated, but updated in
their entirety
MAT FP Matrix multiplication (Canons) PCG PCG
for sparse matrix
28
Implementation

BASE No chkpnt
DISK-FORK Disk-based chkpnt w/ fork()
SIMP Simple diskless
INC Incremental diskless
FORK Forked diskless
INC-FORK Incremental, forked diskless
C-SIMP w/ diff-based compression
C-INC
C-FORK
C-INC-FORK

29
Experiment Framework

Network of 24 Sun Sparc5 w/s connected to each
other by a fast, switched Ethernet 5MB/s
Each w/s has
96MB of physical memory
38MB of local disk storage
Disks with bandwidth of 1.7MB/s are connected via
Ethernet, and NFS on Ethernet achieved a
bandwidth of 0.13 MB/s
Latency time between chkpnt initiated and ready
for recovery
Overhead increase in execution time with chkpnt

30
(No Transcript)
31
(No Transcript)
32
Discussion

Latency diskless has much lower latency than
disk-based.
Lowers the expected running time of the
application in the presence of failures (has
small recovery time)
Overhead comparable

33
Recommendations