Uniprocessor Checkpointing - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Uniprocessor Checkpointing

Description:

Only writing the active' areas of the stack and heap provides dead memory opt. ... Heap items might have been de-allocated. Regions of memory might be dead or clean ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 36
Provided by: marq2
Category:

less

Transcript and Presenter's Notes

Title: Uniprocessor Checkpointing


1
Uniprocessor Checkpointing
  • CS 717 Fall 2001
  • 9/25/01

2
The Need to Save State
  • Many of the FT systems we have discussed need a
    way to restart processes from previous points in
    their computation
  • A checkpoint is just a snapshot of a process
    (or system) at a certain point in time
  • A checkpointing system provides a way to take
    these snapshots, and to restart from them

3
Types of Ckpt Systems
  • Kernel Level
  • OS supports ckpt recovery
  • Transparent to the application and developer
  • User Level
  • Application linked against (user) library
  • Library functions perform ckpt and recovery
  • Transparent to application
  • Limitations (cannot restore PID, PPID, etc.)
  • Application Level
  • Applications coded to ckpt themselves, and to
    restart from a checkpoint

4
Comparison of Levels
  • Kernel User (System) Level
  • Easy to add checkpointing to existing code
  • Works with (almost) any programs
  • General, coarse, approach
  • Application Level
  • Could require complete re-write, or extensive
    modifications
  • Specific, fine-grained solutions

5
System Level Checkpointing
  • Libckpt (1994)
  • Plank, Beck, Kingsley (UTK), Li (Princeton)
  • User level library for UNIX

6
Libckpt
  • User Level Checkpoint Library
  • Goals
  • Transparent
  • Requires minimal modifications to code and
    re-re-linking
  • Low Overhead
  • Automatic optimizations to reduce ckpt file size
  • Allow user directed checkpointing

7
Libckpt Overview
  • Taking the snapshot
  • Suspend the process
  • Write process memory and registers to a file
  • Recovery
  • Reload executable from original file
  • Reconstruct memory and register state from
    checkpoint file

8
Libckpt Operation
  • Application main() is re-named ckpt_target()
  • Library main() checks if in restore mode
    (specified using command line option) otherwise
    reads checkpoint parameters from file

9
Libckpt Operation (2)
  • main() sets a timer to interrupt application
    every n seconds
  • On signal
  • Uses setjmp to record registers, pc, etc.
  • Writes the stack and heap segments to file
  • Resumes application code

10
Libckpt Operation
  • If application started with recover as command
    line option
  • Application begins, recovering Text segments
  • Open checkpoint file
  • Recover heap from file
  • Recover stack from file
  • Restores register file (using longjmp)

11
Virtual Address Space
Bottom of Stack
Stack
SP
sbrk(0)
Heap
edata
Data (Static)
etext
Text
0
12
Checkpoint And Recovery Algorithms
  • main()
  • if(recovery)
  • restore stack
  • restore heap
  • pos top of stack
  • longjmp(pos, 1)
  • // restore regs.
  • else
  • run usual code
  • signal_handler()
  • jmp_buf pos
  • if(setjmp(pos)0)
  • //saved reg. in known //position on stack
  • write stack
  • write heap
  • else
  • // process recovered
  • return

13
Illustration
  • main()
  • user_main()
  • fun1()
  • fun2()
  • signal
  • save regs on stack
  • save stack to file
  • save heap to file
  • resume
  • main()
  • restore()
  • restore stack
  • restore heap
  • take jump

14
Optimization Incremental Checkpointing
  • Observation between taking two checkpoints, only
    a portion of the memory has actually been changed
  • Optimization save only what has been changed
    since last ckpt, the rest can be read from
    previous ckpts

15
Taking Incremental Ckpts.
  • After taking a ckpt (and after init.), set
    protection on all pages to read-only
  • Write to page will cause a protection violation
  • Libckpt library catches that signal, and sets
    page protection to read-write, page is marked
    as dirty
  • When writing checkpoint file, only write dirty
    pages

16
Drawbacks to Incremental Ckpt
  • Required to keep multiple copies of the
    checkpoint file
  • On recovery, will unnecessarily restore old
    copies of data

17
Optimization Asynchronous Checkpointing
  • Observation the process must be suspended while
    the checkpoint file is written
  • Optimization a separate thread could write the
    checkpoint file while the main thread was allowed
    to continue

18
Asynchronous Checkpointing
  • Make a copy of the process space
  • 2nd thread takes writes copy to disk
  • 1st thread continues without halting

19
Asynchronous Checkpointing(2)
  • Unix fork() provides the necessary behavior
  • When about to take ckpt, process forks
  • OS makes a complete copy of the original process
    space
  • Clone writes ckpt file, then dies
  • Original continues computing

20
Copy-On-Write Checkpointing
  • Like asynchronous checkpointing, but only copy
    page if the two versions are about to differ
  • Some (most?) OS implement fork() in this manner,
    so benefit is automatic

21
Checkpoint Compression
  • Use a standard data compression algorithm to
    shrink the size of the checkpoint file
  • Only improves overhead if the speed of
    compression is faster than the speed of disk
    writes, and compression is significant
  • For uniprocessor checkpointing, this is not the
    case
  • Not implemented in libckpt

22
User Directed Checkpointing
  • As described so far, libckpt is (almost) entirely
    transparent to the programmer
  • Compare to application level checkpoint requiring
    extensive code changes
  • Is there a middle ground?
  • Libckpt allows programmers to annotate
    application code with directives that guide the
    checkpointing

23
Memory Exclusion
  • Certain areas of memory can be excluded from the
    checkpoint
  • Dead memory will never be read or written
  • Clean memory values have not changed since
    previous checkpoint
  • Incremental Ckpt provides clean memory opt. at a
    coarse level (page size)
  • Only writing the active areas of the stack and
    heap provides dead memory opt.

24
User Directed Memory Exclusion
  • Libckpt provides the app. programer with two
    functions
  • exclude_bytes(ptr, length, usage)
  • Specify an area of memory to exclude from future
    checkpoints
  • include_bytes(ptr, length)
  • Add a previously excluded area of memory to
    future checkpoints

25
Clean Memory
  • If mem is clean
  • exclude_bytes(mem, , CKPT_READONLY)
  • Include mem in next checkpoint, but exclude in
    all subsequent
  • Cannot write to mem until after call to
    include_bytes(mem)
  • Restore last saved version of mem

26
Clean Memory Example
  • for ()
  • A init_A()
  • exclude_bytes(A,,CKPT_READONLY)
  • do_stuff(A) //assuming A does not change
  • include_bytes(A)

27
Dead Memory
  • If mem is dead
  • exclude_bytes(mem, , CKPT_DEAD)
  • Do not checkpoint mem
  • Cannot read mem until after include_bytes(mem)
  • Will not restore mem

28
Dead Memory Example
  • for ()
  • A init_A()
  • do_stuff(A)
  • exclude_bytes(ADEAD)
  • do_other_stuff() // assumes will not read A
  • include_bytes(A)

29
Using Memory Exclusion
  • There can be a dramatic reduction in the size of
    the checkpoint file
  • Must be used very carefully
  • Inadvertently excluding a live region from a
    checkpoint could cause erroneous behavior on
    restart

30
Synchronous Checkpointing
  • At different points in the programs execution
    the amount of live state varies widely
  • The stack might be much smaller (shallower call
    graph)
  • Heap items might have been de-allocated
  • Regions of memory might be dead or clean

31
Synchronous Ckpt (2)
  • If checkpoints are taken at times where there is
    relatively little live state, the checkpoint file
    size (and overhead) will be smaller
  • Allow user to specify where in a program a
    checkpoint should be taken
  • Independent of timers (signals)

32
Sync. Ckpt. Example
  • for ()
  • checkpoint_here()
  • A malloc()
  • do_stuff(A)
  • free A

33
Synchronous Ckpt (3)
  • To avoid checkpointing too frequently, mintime
    parameter specifies the minimal amount of time
    between two checkpoints
  • If checkpoint_here() is called less than mintime
    seconds after the last checkpoints, the call is
    ignored

34
Synchronous Ckpt (4)
  • To ensure that checkpoints are taken frequently
    enough to be of use, maxtime parameter specifies
    the maximum time allowed to elapse between two
    checkpoints
  • If maxtime passes, an asynchronous checkpoint is
    taken

35
Combining Mem. Exclusion and Sync. Checkpointing
  • main()
  • D malloc
  • f file
  • while(!done)
  • D read(f)
  • perform_calc(D)
  • output_result()
  • ckpt_target()
  • D malloc
  • f file
  • while(!done)
  • D read(f)
  • perform_calc(D)
  • output_result()
  • exclude_bytes(D, DEAD)
  • checkpoint_here()
  • include_bytes(D)
Write a Comment
User Comments (0)
About PowerShow.com