EnergyAware Deterministic Fault Tolerance in Distributed RealTime Embedded Systems - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

EnergyAware Deterministic Fault Tolerance in Distributed RealTime Embedded Systems

Description:

Energy-Aware Deterministic Fault Tolerance in Distributed. Real-Time Embedded Systems ... Boeing 777:1280 embedded processors, 4 million lines of software. Motivation ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 27

Provided by: yin106

Category:

more less

Transcript and Presenter's Notes

Title: EnergyAware Deterministic Fault Tolerance in Distributed RealTime Embedded Systems

1
Energy-Aware Deterministic Fault Tolerance in
DistributedReal-Time Embedded Systems

Ying Zhang?
Robert P. Dick?
Krishnendu Chakrabarty?
Department of Electrical Computer Engineering
Duke University
?Department of Electrical Computer Engineering
Northwestern University

2
Motivation

Complex embedded systems
Boeing 7771280 embedded processors, 4 million
lines of software

Fault tolerance and power
management addressed separately
Can fault tolerance and power management be
combined in an integrated fashion?

3
Motivation

Goal low power, fault-tolerant real-time
systems
Responsive to task deadlines
Fault-tolerant
Energy-efficient
Challenges tradeoffs
Low energy vs. real-time responsiveness
Fault tolerance vs. real-time responsiveness
Low energy vs. fault tolerance

Real-time responsiveness
Fault tolerance
Energy efficiency
4
Outline

Motivation
Background
Constant speed case
Incorporating DVS
Conclusions and future work

5
Checkpointing
2. Background

Checkpointing intermediate states of a task are
saved periodically
Rollback recovery computation is resumed from
the most recent checkpoint rather than the
beginning
Save re-execution time ? desirable for real-time
systems
Transient faults caused by cosmic rays,
high-energy particles, crosstalk, etc.

Checkpoint
Checkpoint
Release time
Deadline
Normal execution
6
Dynamic Voltage Scaling (I)

DVS Dynamic voltage scaling
run-time variation of processor supply voltage
P ? CoutVdd2f
f ? (Vdd - Vt)2/Vdd
cubic reductions in power, quadratic reductions
in energy
Lower Vdd ? lower power consumption longer
execution times
May cause tasks to miss their deadlines in
real-time systems

7
Dynamic Voltage Scaling (II)

DVS for fault tolerance
Speed up ? increase slack ? easier to provide
fault tolerance
Tradeoffs
Speed ? ? Energy?, Execution time?,
Fault-tolerance capability?
Speed ? ? Energy?, Execution time?,
Fault-tolerance capability?

Voltage levels
V2
High fault tolerance
Low fault tolerance
V1
Task execution time
8
Practical Issues in Checkpointing

Stable storage in embedded systems
Storage types SRAM, DRAM, ROM, and flash memory
DRAM is appropriate for checkpoint saving
access speed capacity 1
Checkpoint overhead
Checkpoint size in the order of KBytes in many
embedded system 1
Typical time to save a checkpoint to DRAM in the
order of ?s
Energy cost dependent on memory access power
Fault during checkpointing and recovery
Must be taken into account due to high fault
arrival rate
Always rollback and restore state wherever a
fault occurs

1 C.-Y. Lin et al., A checkpointing tool
forPalm operating system, Proc. DSN, pp. 71-76,
2001.
9
System and Fault Model
System model

Communication based on message passing
Each PE has its own processor and DRAM
Checkpoints saved in DRAM
Processors can be constant-speed or
variable-speed (DVS-capable)
Fixed communication speed
System implemented using CAD synthesis tool

Fault model

Target transient faults single-event upsets,
crosstalk glitches, etc.
Permanent faults handled through manufacturing
and testing techniques
Errors due to transient faults detected through
appropriate schemes
Fault arrival k-fault-tolerance
Value of k determined by fault arrival rate
task application time

10
Problem formulation
3. Constant speed case

Program modeled by a DAG G (V, E)
Node vi (ai, di, ti)
ai arrival time
di deadline
ni computation time
cij communication cost between vi and vj

Problem Given G (V, E) and a system level
synthesis tool Determine a checkpointing and
voltage scaling scheme (in the presence of
faults) Such that (1) all jobs complete on time
(2) energy savings achieved
11
Checkpointing Consistency
Consistent state t1 lt tA lt tf
Inconsistent state tA lt t1 lt tf
Consistent state if the checkpointing state of a
processor reflects a message receipt, then the
checkpointing state of the corresponding sender
processor should indicate that the message has
been sent out.
12
Synchronized Checkpointing Ensuring Consistency

Globally synchronized signal used for triggering
local checkpointing clock or coarse-grained
signal
All processors take checkpoints according to a
global time
No costly coordination message
Predictable recovery cost (bounded by
checkpointing period T)

P0
P1
P2
tf
T
t0T
t0
13
Feasibility Analysis under Constant Speed

Assumption
At most k faults occur in the system (including
all PEs), before the program deadline (largest
job deadline)
Time to store (retrieve) a checkpoint cw (cr)
Synchronized checkpointing with interval ?
Key observation
Maximum time penalty for one fault ? ? cw
cr
Approach
Perform topological-sort for G
Calculate worst-case-finish-time (wcft) for each
job vi
Examine deadline constraints feasible iff
wcft(vi )di , ?vi ?V
Complexity O(VE)

14
Illustrative ExampleProgram Composed of 3 jobs
t1
v1
PE1
d1
a1
t2
v2
PE2
d2
a2
c12
t3
v3
PE3
a3
d3
c23
15
Illustrative Example Key Points

Incorporate additional timing cost checkpointing
rollback recovery
Examine the effect of fault occurrences for all
predecessor jobs
Compare each predecessors WCFT with arrival time
of current job
Divide the problem into three cases

16
Illustrative Example Case 1
a2
wcft1(n1) c12
n1
k
17
Illustrative Example Case 2
wcft1(n1) c12
a2
n1
k
18
Illustrative Example Case 3
wcft1(n1) c12
a2
k
n1
19
Theoretical Basis
20
Simulation Results