Title: Diskless Checkpointing
1Diskless Checkpointing
2Motivation
- Checkpointing on Stable Storage
- Disk access is a major bottleneck!
- Incremental Checkpointing
- Copy-on-write
- Compression
- Memory Exclusion
- Diskless Checkpointing
3Diskless?
- Extra memory is available (e.g. NOW)
- Use memory instead of disk
- Good
- Network Bandwidth gt Disk Bandwidth
- Bad
- Memory is not stable
4Bottom-line
- NOW with (nm) processors
- The application runs on exactly n procs,
- and should proceed as long as
- The number of processors in the system is at
least n - The failures occur within certain constraint
Application Processors (n)
Chkpnt Processors (m)
Available Processors (nm)
5Overview
- Coordinated Chkpnt (Sync-and-Stop)
- To checkpoint,
- Application Proc Chkpnt the state in memory
- Chkpnt Proc Encoding the application chkpnts and
storing the encodings in memory - To recover,
- Non-failed Procs Roll-back
- Replacement processors are chosen.
- Replacement Proc Calculate the chkpnts of the
failed procs using other chkpnts encodings
6Outline
- Application Processor Chkpnt
- Disk-based
- Diskless
- Incremental
- Forked (or copy-on-write)
- Optimization
- Encoding the chkpnts
- Parity (RAID level 5)
- Mirroring
- 1-Dimensional Parity
- 2-Dimensional Parity
- Reed-Solomon Coding
- Optimization
- Result
7Application Processor Chkpnt
- Goal
- The processor should be able to roll back to its
most recent chkpnt. - Need to tolerate failures when chkpnt
- Make sure that each coordinated chkpnt remains
valid until the next coordinated chkpnt has been
completed.
8Disk-based Chkpnt
- To chkpnt
- Save all values in the stack, heap, and registers
to disk - To recover
- Overwrites the address space with the stored
checkpoint - Space Demands
- 2M in disk
(M the size of an application processors
address space)
9Simple Diskless Chkpnt
- To chkpnt
- Wait until encoding calculated
- Overwrite diskless chkpnts in memory
- To recover
- Roll-backed from in-memory chkpnts
- Space Demands
- Extra M in memory
(M the size of an application processors
address space)
10Incremental Diskless Chkpnt
- To chkpnt
- Initially set all pages R_ONLY
- On page fault, copy set RW
- To recover
- Restore all RW pages
- Space Demands
- Extra I in memory
(I the incremental chkpnt size)
11Forked Diskless Chkpnt
- To chkpnt
- Application clones itself
- To recover
- Overwrites state with clones
- Or clone assumes the role of the application
- Space Demands
- Extra 2I in memory
(I the incremental chkpnt size)
12Optimizations
- Breaking the chkpnt into chunks
- Efficient use of memory
- Sending Diffs (Incremental)
- Bitwise xor of the current copy and chkpnt copy
- Unmodified pages need not be sent
- Compressing Diffs
- Unmodified regions of memory
13Application Processor Chkpnt (review)
- Simple Diskless Chkpnt Extra M in memory
- Incremental Diskless Chkpnt Extra I in memory
- Forked Diskless Chkpnt Extra 2I in memory,
less CPU activity - Optimizations
- Chkpnt into chunks, diffs, and compressed diffs
14Encoding the chkpnts
- Goal
- Extra chkpnt processors should store enough
information that the chkpnts of failed processors
may be reconstructed. - Notation
- Number of chkpnt processors (m)
- Number of application processors (n)
15Parity (RAID level 5, m1)
- To chkpnt,
- On failure of ith proc,
- Can tolerate
- Only one processor failure
- Remarks
- Chkpnt processor is a bottleneck of communication
and computation
Example n4, m1
Application Processor
Chkpnt Processor
j-th byte of Application processor i
16Mirroring (mn)
- To chkpnt,
- On failure of ith proc,
- Can tolerate
- Up to n processor failures
- Except the failure of both an application
processor and its checkpoint processor - Remarks
- Fast, no calculation needed
Example nm4
Application Processor
Chkpnt Processor
j-th byte of Application processor i
171-Dimensional Parity (1ltmltn)
- To chkpnt,
- Application processors are partitioned into m
groups. - ith chkpnt processor calculates the parity of the
chkpnts in group i - On failure of ith proc,
- Same as in Parity encoding
- Can tolerate
- One processor failure per group
- Remarks
- More efficient in communication and computation
Example n4, m2
Application Processor
Chkpnt Processor
j-th byte of Application processor i
182-Dimensional Parity
- To chkpnt,
- Application processors are arranged logically in
a two-dimensional grid - Each chkpnt processor calculates the parity of
the row or the column - On failure of ith proc,
- Same as in Parity encoding
- Can tolerate
- Any two-processor failures
- Remarks
- Multicast
Example n4, m4
Application Processor
Chkpnt Processor
j-th byte of Application processor i
19Reed-Solomon Coding (m)
- To chkpnt,
- Vandermonde matrix F, s.t. f(i,j)j(i-1)
- Use matrix-vector multiplication to calculate
chkpnt - To recover,
- Use Gaussian Elimination
- Can tolerate
- Any m failures
- Remarks
- Use Galois Fields to perform arithmetic
- Computation overhead
20Optimizations
- Sending and calculating the encoding in
RAID level 5-based encodings (e.g. Parity)
(b) FAN-IN log(n) step
(a) DIRECT C1 bottleneck
21Encoding the Chkpnts (review)
- Parity (RAID level 5, m1)
- Only one failure, bottleneck
- Mirroring (mn)
- Up to n failures (unless both app and chkpnt
fail), fast - 1-Dimensional Parity
- One failure per group, more efficient than Parity
- 2-Dimensional Parity
- Any two failures, comm overhead w/o multicast
- Reed-Solomon Coding
- Any m failures, computation overhead
- DIRECT vs. FAN-IN
22Testing Applications (1)
- CPU-Intensive parallel programs
- Instances that took 1.52 hrs on 16 processors
- NBODY N-body interactions among particles in a
system - Particles are partitioned among processors
- Location field of each particle is updated
- Expectation
- Poor with incremental chkpnt
- Good with diff-based compression
- MAT FP matrix product of two square matrices
(Cannons alg.) - All three matrices are partitioned in square
blocks among processors - In each step, adds the product and passing the
input submatrices - Expectation
- Incremental chkpnt
- Very poor with diff-based compression
23Testing Applications (2)
- PSTSWM Nonlinear shallow water equations on a
rotating sphere - Majority pages, but only few bytes per page are
modified - Expectation
- Poor with incremental chkpnt
- Good with diff-based compression
- CELL Parallel cellular automaton simulation
program - Two (sparse) grids of cellular automata
(current/next) - Expectation
- Poor with incremental chkpnt
- Good with compression
- PCG Axb for a large, sparse matrix
- First, converted to a small, dense format
- Expectation
- Incremental chkpnt
- Very poor with diff-based compression
24Diskless Checkpointing
25Disk-based vs. Diskless Chkpnt
 Disk-based Diskless
Where to chkpnt? In stable storage In local memory
How to recover? Restore from stable storage Re-calculate
Remarks Can tolerate whole failure Cannot tolerate whole failure
Low BW to stable storage Memory is much faster
  Encoding (communication) overhead
26Recalculate the lost chkpnt?
Error Detection Correction in Digital
Communication
Chkpnt Recovery in Diskless Chkpnt
1-bit Parity (m1)
110010111 (right) 110000111
(detectable) 110010110 (detectable) 110000110
(oops)
110010111 (chkpnt) 1100X0111
(tolerable) 11001011X (tolerable) 1100X011X
(intolerable)
Mirroring (mn)
1100101111001011 (right) 1100101111001010
(detectable) 1100101100111100
(detectable) 1100101011001010 (oops)
1100101111001011 (right) 110010111100101X
(tolerable) 11001011XXXXXXXX (tolerable) 1100101
X1100101X (intolerable)
- Remarks
- Difference we can easily know that which node is
wrong in chkpnt system. - Some codings can be used to recover from errors
in Digital Comm, too. (e.g. Reed-Solomon)
27Performance
- Criteria
- Latency time between chkpnt initiated and ready
for recovery - Overhead increase in execution time with chkpnt
- Applications
App Description Pattern
NBODY N-body interactions PSTSWM Simulation of
the states on 3-D system CELL Parallel cellular
automaton
Majority pages, but only few bytes per page are
modified
Only small parts are updated, but updated in
their entirety
MAT FP Matrix multiplication (Canons) PCG PCG
for sparse matrix
28Implementation
- BASE No chkpnt
- DISK-FORK Disk-based chkpnt w/ fork()
- SIMP Simple diskless
- INC Incremental diskless
- FORK Forked diskless
- INC-FORK Incremental, forked diskless
- C-SIMP w/ diff-based compression
- C-INC
- C-FORK
- C-INC-FORK
29Experiment Framework
- Network of 24 Sun Sparc5 w/s connected to each
other by a fast, switched Ethernet 5MB/s - Each w/s has
- 96MB of physical memory
- 38MB of local disk storage
- Disks with bandwidth of 1.7MB/s are connected via
Ethernet, and NFS on Ethernet achieved a
bandwidth of 0.13 MB/s - Latency time between chkpnt initiated and ready
for recovery - Overhead increase in execution time with chkpnt
30(No Transcript)
31(No Transcript)
32Discussion
- Latency diskless has much lower latency than
disk-based. - Lowers the expected running time of the
application in the presence of failures (has
small recovery time) - Overhead comparable
33Recommendations
- DISK-FORK
- If chkpnt are small
- If the likelihood of wholesale system failures
are high - C-FORK
- If many pages, but a few bytes per page are
modified - INC-FORK
- If not a significant number of pages are modified
34Reference
- J. S. Plank, K. Li, and M.A. Puening. "Diskless
checkpointing." IEEE Transactions on Parallel
Distributed Systems, 9(10)972986, Oct. 1998