Title: Lifetime ReliabilityAware Task Allocation and Scheduling for MPSoC Platforms
1Lifetime Reliability-Aware Task Allocation and
Scheduling for MPSoC Platforms
- Lin Huang, Feng Yuan and Qiang Xu
- Reliable Computing Laboratory
- Department of Computer Science Engineering
- The Chinese University of Hong Kong
- DATE09
2Lifetime Reliability of Embedded Multiprocessor
Platform
- Multiprocessor system-on-a-chip (MPSoC)
- Platform-based design
- Hardware / software co-synthesis
- Reliability issue
- IC product wear-out ? lifetime reliability
threats - Time dependent dielectric breakdown (TDDB),
electromigration (EM), stress migration (SM),
negative bias temperature instability (NBTI) - Soft errors
3Prior Work
- Prior work in reliability-driven task allocation
and scheduling - Constant failure rate
- Limitation of thermal-aware task scheduling
- Might improve the systems lifetime reliability
implicitly - Not readily applicable, especially for
heterogeneous MPSoC
4Problem Motivation Example
- Electromigration
-
- Suppose , and all other
- parameters are the same
- P1 ages much faster than P2,
- dominating the MPSoC lifetime
5Problem Formulation
- Task allocation and scheduling
- Output
- Aim to maximize the expected service life (mean
time to failure, MTTF) of the MPSoC system under
the performance constraint
Binding Scheduling
6Lifetime Reliability Estimation
- Electromigration
- Denote by the reliability of a
single processor at time - Expected service life
- Weibull distribution
Computed by existing hard error models
Reflect some important factors (e.g.,
architecture properties)
7Main Approach Simulated Annealing
- Solution representation
- (schedule order sequence resource assignment
sequence) - For example, (0, 1, 3, 2, 4 P2, P2, P2, P1, P1)
- Schedule order sequence partial order defined by
task graph - Every solution corresponds to a feasible schedule
- Schedule Reconstruction
8Main Approach Simulated Annealing
- Transforms of directed acyclic graph
- Expanded task graph
- Undirected complement graph
- Lemma Given a valid schedule order
, swapping adjacent nodes
leads to another valid schedule order, provided
there is an edge between these two nodes in the
complement graph
T0
T1
T0
T1
T0
T1
T2
T3
T4
T2
T3
T4
T2
T3
T4
Task Graph
Expanded Task Graph
Complement Graph
9Main Approach Simulated Annealing
- Theorem Starting from a valid schedule order
we are able to reach any
other valid schedule order - after finite times of adjacent swapping
- For example
2
3
0
4
1
0
2
3
4
1
2
0
3
4
1
0
2
3
1
4
T0
T1
T0
T1
T0
T1
T2
T3
T4
T2
T3
T4
T2
T3
T4
Task Graph
Expanded Task Graph
Complement Graph
10Main Approach Simulated Annealing
- Moves
- M1 Swap two adjacent nodes in both schedule
order sequence and resource assignment sequence,
if there is an edge between these two nodes in
the complement graph - M2 Swap two adjacent nodes in resource
assignment sequence - M3 Change the resource assignment of a task
T0
T1
T0
T1
T0
T1
T2
T3
T4
T2
T3
T4
T2
T3
T4
Task Graph
Expanded Task Graph
Complement Graph
11Main Approach Simulated Annealing
- Three moves are defined, so that
- Starting from a valid schedule order A, we are
able to reach any other valid schedule order B
after finite times of adjacent swapping - Cost function
- First term guarantees a schedule meet all tasks
deadlines - Second term indicates the system lifetime
Significant large
12Main Approach Simulated Annealing
- Key problem Computation time
- Source of time overhead
- Run temperature simulator EVERY TIME
- we reach a new solution
- Simulator is called 3105 times
- Every time trace the temperature variation
- for entire service life
- In range of years
- Accurate calculation requires fine-
- grained variation trace file
- Significant / within very short time
- An efficient cost computation strategy is
essential !
SA parameters
13Revisit System Lifetime Reliability Estimation
Speedup I
- It will be better if we are able to compute MTTF
by tracing the temperature variation of only one
period
14Revisit System Lifetime Reliability Estimation
Speedup I
15Revisit System Lifetime Reliability Estimation
Speedup I
- Given
- Aging effect in one period
- Property does not vary from period to period
- This property enables us to trace the temperature
variation of only ONE period
16Revisit System Lifetime Reliability Estimation
Speedup I
- The expected service life of one processor is
- Provided no redundant processors in the system,
expected service life of entire system is
17Revisit System Lifetime Reliability Estimation
Speedup II
- Given
-
- Instead of computing the
- aging effect in every period,
- we propose to compute the
- aging effect of periods at
- one time
18Revisit System Lifetime Reliability Estimation
Speedup III
- Accurate calculation requests setting the
length of time intervals as very small value - Use steady temperature rather than accurate
temporal temperature
Temperature Variation Example
Task Schedule
19Revisit System Lifetime Reliability Estimation
Speedup IV
- Need to run temperature simulator every time we
reach a new solution - There can be at most
kinds of processor usage
combinations in task schedules - Given 3, 4, we need only 255 times
pre-computation, each for a steady temperature - Estimate processors temperature for various
processor usage combinations in pre-calculation
phase only
20Revisit System Lifetime Reliability Estimation
Speedup IV
Processor index under usage
- Time slot
- The set of under-used processors
- The power consumption of the tasks running on
these processors - Categorize the tasks into types according to
power consumption - E.g.,
Task power consumption type
21Revisit System Lifetime Reliability Estimation
Speedup IV
- Pre-calculate the steady temperature of
processor in time slot - The aging effect in unit time in this case is
therefore - The aging effect of P1 in this schedule in a
period is
22Revisit System Lifetime Reliability Estimation
Summary
- A summary of speedup techniques
- Rewrite MTTF expression in terms of aging effect
in one period - Compute the aging effect of several periods at
one time - Approximate aging effect in one period based on
the task changes and using steady temperature - Call temperature estimation simulator in the
pre-calculation phase only - The time consumption of pre-calculation can be
even reduced
23Experimental Setup
- Random task graphs generated by TGFF
- Task numbers range from 20 to 260
- Hypothetical MPSoC platforms
- Processor core numbers range from 2 to 8
- Homogeneous / Heterogeneous
- Take electromigration model in Goel-IEEEPress07
as example - Note that, our model also applied to other
failure mechanisms - Compare our method with a thermal-aware task
scheduling algorithm proposed in Xie-JVLSISP06
24Accuracy
- Comparison between approximated MTTF and accurate
value
25Lifetime Reliability of Various Platforms with
Various Task Graphs
? Difference ratio between MTTF of simulated
annealing and that of thermal aware
DR Deadline Relaxation
26Lifetime Reliability of 8-Processor Platforms
27Efficiency
- The simulated annealing process requests 50-200s
of CPU time on Intel(R) Core(TM) 2 CPU 2.13GHz
for each case - 4 processors 49 tasks 84s
- 8 processors 101 tasks 158s
- The CPU time spending on pre-calculation ranges
from 3s to 160s
28Conclusion
- Technology advancement has brought with adverse
impact of on lifetime reliability of MPSoC
embedded systems - Prior work on task allocation and scheduling does
not explicitly take wearout failure into account - We propose an analytical model to estimate the
lifetime reliability of multiprocessor platforms
under periodical tasks - We present a novel lifetime reliability-aware
algorithm based on simulated annealing technique - We propose several speedup techniques to simplify
the design space exploration process with
satisfactory solution quality - Experimental results demonstrate the effectiveness
29 Lifetime Reliability-Aware Task Allocation and
Scheduling for MPSoC Platforms
Thank you for your attention !