Title: MovementBased Checkpointing and Logging for Recovery in Mobile Computing Systems
1Movement-Based Check-pointing and Logging for
Recovery in Mobile Computing Systems
- Sapna E. George, Ing-Ray Chen, Ying Jin
- Dept. of Computer Science
- Virginia Polytechnic and State University
2Outline
- Background
- Problem Definition Failure Recovery in the
Mobile Computing Environment - Proposed Solution Movement-Based Check-pointing
and Logging - Performance Analysis
- Analytic Model of the System
- Analysis Results and Conclusions
- Future Work
3Background
4Mobile Computing
- Advances in wireless networking and portable
device technologies are revolutionizing computing - Mobile Computing A type of distributed
computing - Involves hosts that may be mobile
- Host network connectivity maintained through
wireless communications
5Fault-tolerance in Distributed systems
- Check-pointing, Logging, Rollback recovery
- Check-pointing ? failure-free operations
- Save system state to stable storage
- This snapshot is called a checkpoint
- Logging ? failure-free operations
- All non-deterministic events and the information
necessary to replay these events are logged to
the stable storage - In addition to checkpoints
6Fault-tolerance in Distributed systems
- Failure Recovery
- Failed process rolls back to the latest
checkpoint - Replays all the logged events in their original
order - Recreates pre-failure state independently
7Problem Definition
- Failure Recovery in the Mobile Computing
Environment
8Effects of Properties of MC Env.
- Mobility of hosts
- If checkpointing requires coordination, the MH
must be searched and located first before control
messages can be delivered this increases
communication delay - Data related to recovery, such as checkpoints and
logs, may be distributed over many MSS a
mechanism is required for efficient storage,
retrieval and management of this dispersed
information
9Effects of Properties of MC Env.
- Low bandwidth and unreliable network connectivity
- A recovery mechanism that requires a large number
of messages or large size of messages imposes
undue burden on the wireless resources and
increases the cost of providing fault tolerance.
10Effects of Properties of MC Env.
- Limited battery life of host devices
- Communication is energy intensive.
- Recovery mechanism must keep communication (the
number of messages and the size of messages) to a
minimum.
11Effects of Properties of MC Env.
- Lack of stable storage on host devices
- Devices are vulnerable to physical damage
- Devices are small and are equipped with limited
memory - MHs disk cannot reliably function as the stable
storage required to store recovery information.
12Effects of Properties of MC Env.
- Different types of failures
- Voluntary disconnection and hardware failure must
be handled differently - A disconnected host may reconnect after a while
and expect to resume operations - A MH that is currently unreachable cannot be
expected to participate in a checkpointing or
recovery operation. - A scheme that requires synchronization or
coordination with other MHs would either block
until the MH reconnected or would fail.
13The Problem
- Traditional recovery schemes suffer from many
shortcomings when applied to the mobile computing
environment. - The failure-prone nature of the environment makes
it essential to provide some form of explicit
recovery mechanism.
14The Problem
- In general, application recovery mechanisms try
to balance - Recovery cost (failure-free operational cost)
- Recovery time
- Storage requirements for recovery related
information
15The Problem
- Adaptations of traditional recovery schemes for
the mobile computing environment - Do not consider mobility in the selection of
checkpointing interval - Use periodic checkpointing
- Subsequently control the proliferation of
recovery information using techniques that merge
logs and move the information closer to the MH.
16Proposed Solution
- Movement-Based Check-pointing and Logging
17Assumed Mobile Computing System
- A set of mobile hosts (MHs)
- They maintain network connectivity through a
wireless link to a static mobile support station
(MSS) - A MSS handles all communications to and from MHs
within its area of influence known as a cell - Each MSS is equipped with enough volume of stable
storage to store the state and log information
18Assumed Mobile Computing System
- Interactions between the MH and the network
infrastructure relevant to failure recovery - Handoff Cell boundary crossing
- Disconnection For power conservation
- Reconnection Possibly in a cell different from
the one in which it disconnected
19Assumed Mobile Computation
- A distributed computation ? a number of processes
executing concurrently on multiple hosts. - Process states
- Normal- executing application related
computations, receiving user inputs or sending
and receiving messages. - Save - saves its state as a checkpoint to the
stable storage - Between checkpoints, the process also logs all
events (Normal state) - Recovery Loads checkpoints and applies logs
20Movement-Based Checkpointing and Logging
- Interval between checkpoints is governed by the
number of handoffs experienced by the MH and is
not fixed - MH maintains a handoff counter which is
incremented by 1 every time a handoff occurs. - When the value of the counter becomes greater
than a threshold M, a checkpoint is taken. - In between checkpoints, all write events related
to a MH is also logged to the local MSS.
21Movement-Based Checkpointing and Logging
- The threshold M is a configurable parameter.
Depends on - User mobility rate
- Network the failure rate
- Application log arrival rate
22Movement-Based Checkpointing and Logging
- Thus, depending on the variability in the MHs
mobility, the time interval between successive
checkpoints differs. - Recovery MH recovers independently without
coordination with other MHs - Upon reconnection, MH informs local MSS.
- Local MSS contacts MSS with latest checkpoint
- Local MSS contacts all MSS storing logs
- All data transferred to local MSS via wired
network and to MH via wireless link - MH rolls back and applies logs
23Movement-Based Checkpointing and Logging
- The performance of this scheme depends on
identifying the optimal movement threshold M per
user and application. - Checkpoints and logs remain within acceptable
range of the MHs current location and eliminates
the need for information consolidation. - Ensures acceptable recovery time since M bounds
the number of MSSs from which logs must be
retrieved.
24Performance Analysis
25Stochastic Petri-Net (SPN) Model
26SPN Model Parameters
27SPN Model Parameters
- Parameter Tk- Checkpoint rate of the MH
- Parameter Ti- Recovery rate of the MH inverse
of recovery time - i - number of handoffs experienced by the MH
since the last checkpoint and before failure.
28Analytic Model Recovery Time
29Analytic Model Recovery Time
- Treq_rec - Time spent on recovery information
requests - Nmss_logs Number of MSSs storing logs
- Dmss - average hop count between MSScp and MSSrec
30Analytic Model Recovery Time
- Tckp_tx - Time spent on transmitting the latest
checkpoint to the MH
- Tlog_tx - Time spent on transmitting the logs to
the MH
- Trec - Time spent to rollback to the last
checkpoint and apply the logs
31Analytic Model Cost of Recovery
- Tr Average Recovery time per failure
- Fr Recovery probability
- Tc Cost of recovery
No. of checkpoints before failure
No. of logs before failure
32SPN Evaluation Parameters
- Size of a log entry - 50B
- Size of a checkpoint - 2000B
- Bandwidth of wired network-2Mbps
- Ratio of bandwidth of wireless to wired network
(r) - 0.1 - Time required to apply a log entry (Telog) -
0.0001s - Time required to transmit a log entry through the
wireless channel (Tlog_w) - 0.002s - Time required to transmit a checkpoint through
the wireless channel (Tckp_w) - 0.08s
33Performance Analysis
34Recovery Probability vs. Recovery Time
35Recovery Probability vs. Log Arrival Rate
36Recovery Probability vs. Failure Rate
37Recovery Probability Recovery Time vs. Movement
Threshold
38Determining Optimal Movement Threshold that
Minimizes Recovery Cost Per Failure
39Conclusion Proposed Scheme
- An efficient failure recovery scheme for mobile
computing systems based on movement-based
checkpointing and logging - Movement-based checkpointing and logging scheme
takes a checkpoint only after the mobile node has
made M movements (mobility handoffs). - The value of M is governed by the failure rate,
log arrival rate, and the mobility rate of the
application and MH. - Identify the optimal movement threshold M, when
given the failure, mobility and log arrival
rates, to minimize the cost of recovery per
failure.
40Conclusion Practical Application
- Build a table at configuration time covering
possible parameter values of the mobility rate
and failure rate of the MH and log arrival rate
of the mobile applications, and listing the
optimal M value that would minimize the recovery
cost per failure. - At runtime, based on the measured rates, the
optimal M may be selected dynamically to minimize
the recovery cost per failure. - Optimal M selected must also satisfy the
specified recovery probability when given an
application deadline to recover from a failure.