Title: EnergyAware Deterministic Fault Tolerance in Distributed RealTime Embedded Systems
1Energy-Aware Deterministic Fault Tolerance in
DistributedReal-Time Embedded Systems
- Ying Zhang?
- Robert P. Dick?
- Krishnendu Chakrabarty?
- Department of Electrical Computer Engineering
- Duke University
- ?Department of Electrical Computer Engineering
- Northwestern University
2Motivation
- Complex embedded systems
- Boeing 7771280 embedded processors, 4 million
lines of software
- Fault tolerance and power
- management addressed separately
- Can fault tolerance and power management be
combined in an integrated fashion?
3Motivation
- Goal low power, fault-tolerant real-time
systems - Responsive to task deadlines
- Fault-tolerant
- Energy-efficient
- Challenges tradeoffs
- Low energy vs. real-time responsiveness
- Fault tolerance vs. real-time responsiveness
- Low energy vs. fault tolerance
Real-time responsiveness
Fault tolerance
Energy efficiency
4Outline
- Motivation
- Background
- Constant speed case
- Incorporating DVS
- Conclusions and future work
5Checkpointing
2. Background
- Checkpointing intermediate states of a task are
saved periodically - Rollback recovery computation is resumed from
the most recent checkpoint rather than the
beginning - Save re-execution time ? desirable for real-time
systems - Transient faults caused by cosmic rays,
high-energy particles, crosstalk, etc.
Checkpoint
Checkpoint
Release time
Deadline
Normal execution
6Dynamic Voltage Scaling (I)
- DVS Dynamic voltage scaling
- run-time variation of processor supply voltage
- P ? CoutVdd2f
- f ? (Vdd - Vt)2/Vdd
- cubic reductions in power, quadratic reductions
in energy - Lower Vdd ? lower power consumption longer
execution times - May cause tasks to miss their deadlines in
real-time systems
7Dynamic Voltage Scaling (II)
- DVS for fault tolerance
- Speed up ? increase slack ? easier to provide
fault tolerance - Tradeoffs
- Speed ? ? Energy?, Execution time?,
Fault-tolerance capability? - Speed ? ? Energy?, Execution time?,
Fault-tolerance capability?
Voltage levels
V2
High fault tolerance
Low fault tolerance
V1
Task execution time
8Practical Issues in Checkpointing
- Stable storage in embedded systems
- Storage types SRAM, DRAM, ROM, and flash memory
- DRAM is appropriate for checkpoint saving
access speed capacity 1 - Checkpoint overhead
- Checkpoint size in the order of KBytes in many
embedded system 1 - Typical time to save a checkpoint to DRAM in the
order of ?s - Energy cost dependent on memory access power
- Fault during checkpointing and recovery
- Must be taken into account due to high fault
arrival rate - Always rollback and restore state wherever a
fault occurs
1 C.-Y. Lin et al., A checkpointing tool
forPalm operating system, Proc. DSN, pp. 71-76,
2001.
9System and Fault Model
System model
- Communication based on message passing
- Each PE has its own processor and DRAM
- Checkpoints saved in DRAM
- Processors can be constant-speed or
variable-speed (DVS-capable) - Fixed communication speed
- System implemented using CAD synthesis tool
Fault model
- Target transient faults single-event upsets,
crosstalk glitches, etc. - Permanent faults handled through manufacturing
and testing techniques - Errors due to transient faults detected through
appropriate schemes - Fault arrival k-fault-tolerance
- Value of k determined by fault arrival rate
task application time
10Problem formulation
3. Constant speed case
- Program modeled by a DAG G (V, E)
- Node vi (ai, di, ti)
- ai arrival time
- di deadline
- ni computation time
- cij communication cost between vi and vj
Problem Given G (V, E) and a system level
synthesis tool Determine a checkpointing and
voltage scaling scheme (in the presence of
faults) Such that (1) all jobs complete on time
(2) energy savings achieved
11Checkpointing Consistency
Consistent state t1 lt tA lt tf
Inconsistent state tA lt t1 lt tf
Consistent state if the checkpointing state of a
processor reflects a message receipt, then the
checkpointing state of the corresponding sender
processor should indicate that the message has
been sent out.
12Synchronized Checkpointing Ensuring Consistency
- Globally synchronized signal used for triggering
local checkpointing clock or coarse-grained
signal - All processors take checkpoints according to a
global time - No costly coordination message
- Predictable recovery cost (bounded by
checkpointing period T)
P0
P1
P2
tf
T
t0T
t0
13Feasibility Analysis under Constant Speed
- Assumption
- At most k faults occur in the system (including
all PEs), before the program deadline (largest
job deadline) - Time to store (retrieve) a checkpoint cw (cr)
- Synchronized checkpointing with interval ?
- Key observation
- Maximum time penalty for one fault ? ? cw
cr - Approach
- Perform topological-sort for G
- Calculate worst-case-finish-time (wcft) for each
job vi - Examine deadline constraints feasible iff
wcft(vi )di , ?vi ?V - Complexity O(VE)
14Illustrative ExampleProgram Composed of 3 jobs
t1
v1
PE1
d1
a1
t2
v2
PE2
d2
a2
c12
t3
v3
PE3
a3
d3
c23
15Illustrative Example Key Points
- Incorporate additional timing cost checkpointing
rollback recovery - Examine the effect of fault occurrences for all
predecessor jobs - Compare each predecessors WCFT with arrival time
of current job - Divide the problem into three cases
16Illustrative Example Case 1
a2
wcft1(n1) c12
n1
k
17Illustrative Example Case 2
wcft1(n1) c12
a2
n1
k
18Illustrative Example Case 3
wcft1(n1) c12
a2
k
n1
19Theoretical Basis
20Simulation Results
- E3S benchmarks
- Embedded system synthesis benchmarks suite
- Automotive systems, telecommunications and
consumer electronics
214. Incorporating DVS
- System model
- Message-passing system composed of identical PEs
- HW/SW co-synthesis tool CORDS
- Each processor has l variable speeds f1, f2, ,
fl
- Two-step method
- Step 1 Pre-processing using CORDS (under
fault-free - conditions highest processor speed)
- Allocation of PEs and communication links
- Assignment of jobs communications on PEs/links
- Valid task scheduling
- Step 2 Determine checkpointing interval and
speed - scaling in the presence of faults
- All jobs meet deadlines
- Energy saving without violating deadlines
22Incorporating DVS (II)
a1? c01 2 a2? c02 5 a3? c03 3
Key point Speed is scaled down without delaying
any successor jobs
23Simulation Setup
- AMD K6 processor
- Operate at 1.4 V, 1.5 V, and 1.6 V
- Consider 4 schemes
- without checkpointing and DVS (S1)
- with checkpointing but without DVS (S2)
- without checkpointing and with DVS (S3)
- with checkpointing and DVS (S4)
24Simulation Results
- Checkpointing enhances fault tolerance capability
- DVS achieves energy savings
- 13.3 energy reduction
- Checkpointing is cost-effective
- Negligible energy cost (lt1)
255. Conclusions
- Goal distributed embedded systems with
- Real-time responsiveness
- Fault tolerance
- Low energy consumption
- Contributions a unified approach for
faulttolerance energy saving - Fault tolerance achieved through synchronized
checkpointing scheme - Energy saving achieved through DVS without
violating deadlines
26Future Work
- More general real-time task model
- More efficient energy saving
- Improvement of checkpointing algorithm
- Further integration with synthesis tools