Title: Scheduling Task Dependence Graphs with Variable Task Execution Times onto Heterogeneous Multiprocess
1Scheduling Task Dependence Graphs with Variable
Task Execution Times onto Heterogeneous
Multiprocessors
- N. R. Satish
- K. Ravindran
- K. Keutzer
- University of California at Berkeley
2Outline
- Motivation
- Static Scheduling
- Statistical Variations in real-life applications
- Statistical Scheduling
- Optimization Techniques
- Results
3Outline
- Motivation
- Static Scheduling
- Statistical Variations in real-life applications
- Statistical Scheduling
- Optimization Techniques
- Results
4Static task allocation and scheduling
- Important step in deploying concurrent
applications on multiprocessors - Key components
- Allocate tasks to processors
- Schedule task executions in time
- Assume knowledge of workload and parallel tasks
at compile time - Relevant when dynamic scheduling is prohibitive
- Viable for design space exploration
Multiprocessor platform
Application description
Platform Constraints
Task Graph
Allocation/Scheduling
Implementation
5Limitations of static models
- Static models do not capture variations in task
execution times and dependencies - Dependence on inputs
- Variations in memory access time due to cache
effects 1 - Static scheduling methods rely on corner-case or
average-case estimates - Worst-case estimates are used in hard real-time
apps - Soft real-time applications (video
encoding/decoding, networking, gaming) only
require statistical guarantees on latency and
throughput hard to provide with static methods
1 P. Moge and A. Kalavade, A Tool for
Performance Estimation of Networked Embedded
End-Systems, DAC 98, pages 257-262, 1998
6Proposed solution
- Capture runtime variations using a statistical
model for task execution times - Compute statistical distributions for the metric
of interest (schedule length/makespan) - Compute percentiles to provide statistical
guarantees
7Outline
- Motivation
- Static Scheduling
- Statistical Variations in real-life applications
- Statistical Scheduling
- Optimization Techniques
- Results
8Static Scheduling Model
- Task dependence graph
- G (V, E)
- DAG with V set of tasks, E task dependencies
- w(v,p) execution time of task v on processor p
- each task is executed sequentially without
preemption - c(e,l) communication delay along edge e
- incurred when tasks (u,v) on edge e communicate
over communication link l - Multiprocessor model
- P set of processors
- Connected by communication links
Task dependence graph for IPv4 packet forwarding
Architecture model for a pipeline of three
processors instantiated on Xilinx FPGA
M1
M2
P1
P2
P3
- Restrictions
- Lookups have to be on P2 or P3
- Recv must be on P1
- Send must be on P3
9Optimization Problem Definition
- Find
- Allocation
- Schedule
- Subject to
Makespan 165
Longest path Makespan
10Outline
- Motivation
- Static Scheduling
- Statistical Variations in real-life applications
- Statistical Scheduling
- Optimization Techniques
- Results
11Variability in IPv4 packet forwarding
- IPv4 forwarding involves route table lookup
- Longest prefix match lookup on a trie table
- Number of IP address bits required for lookup can
vary - We implement IPv4 on a soft multiprocessor on
FPGA minimal variation due to architecture
12Example H.264 video decoding
Decoding
Parsing
Filtering
- H.264 video is organized into frames and
16x16-pixel macro blocks (MBs) - H.264 standard contains two types of MBs
- Intra MB Depend on decoded neighboring MBs in
current frame - Predicted MB Depend on MBs from previously
decoded frames - Both type contain input dependent variable amount
of residual data
Intra Frame
Predicted Frame
Task graphs for decoding in intra (I) and
predicted (P) frames
13Variations in H.264 decoding
- Probabilistic dependencies
- At compile-time, we can only get probabilities of
a macro-block being a I- or P- macroblock - Probabilistic execution times
- Execution times of I and P macroblocks depend on
residual data present in macroblock and also on
memory access time variations
Prob (MB is of type I) p gt Prob. of existence
of each edge p
Variation in I- (a) and P- macroblock (b)
execution times in H.264 decoding
(b)
(a)
14Outline
- Motivation
- Static Scheduling
- Statistical Variations in real-life applications
- Statistical Scheduling
- Optimization Techniques
- Results
15Statistical Optimization
- Optimization metrics like throughput will be
distributions - Typically optimize for a fixed QoS (soft
real-time applications like media and networking) - For instance, we may want to define the makespan
as the 99th percentile of completion time - We could consider worst-case execution times, but
that is too conservative, while average-case
could be too optimistic
16Statistical Models
- Model task execution times using distributions to
account for variations - Simulation based model
- Use discretized bins for storing the pdf
- Task dependencies may be probabilistic
- edges may have probabilities associated with them
17Statistical Analysis Monte Carlo simulations
- Given a schedule, compute the finish time
(makespan) distribution - Add edges corresponding to total ordering of
tasks within a processor (ordering edges) - Longest path problem on the graph with ordering
edges - Monte Carlo analysis
Deterministic worst case 190 Deterministic
average case 125
18Statistical Analysis chooses better schedules
Det. avg. case 125 (opt) Det worst case
190 99 percentile 170
Det. avg. case 150 Det worst case 165
(opt) 99 percentile 150
Det. avg. case 125 Det worst case 165 99
percentile 145 (opt)
Worst and average case scheduling can judge
sub-optimal solutions to be optimal !
19Outline
- Motivation
- Static Scheduling
- Statistical Variations in real-life applications
- Statistical Scheduling
- Optimization Techniques
- Results
20Techniques for Statistical Optimization
- Heuristics
- List scheduling
- Clustering
- Force directed scheduling
- Evolutionary Algorithms
- Simulated Annealing
- Hill climbing, tabu search, genetic algorithms,
ant-colony optimization - Exact constraint optimization based techniques
- Based on mathematical programming
21Static List Scheduling Example (DLS)
while( all tasks not scheduled) Compute a
priority level for task-processor pairs
Select the task-processor pair with highest
priority and schedule task to that processor
Algorithm execution
Task dependence graph
50
30
90
70
110
150
130
Lookup 1
Lookup 2
Lookup 3
Lookup 4
Lookup 5
Lookup 6
Lookup 7
20
20
20
20
20
20
20
5
5
5
5
5
5
5
5
40
155
80
35
10
Recv
Verify TTL
Send
Update TTL
Update checksum
5
40
5
25
20
20
20
20
10
65
5
Worst-case Makespan 205 99 of makespan
185 Optimum 99 145
5
Verify checksum
25
static_level(v) delay of the longest
path from v to snk
priority_level(v,p) static_level(v)
max earliest_start_time(v,p),
earliest_ready_time(p)
Ref G. C. Sih and E. A. Lee, A Compile-time
Scheduling Heuristic for Interconnection-Constrain
ed Heterogeneous Processor Architectures, IEEE
Trans. Parallel Distrib. Syst. 4, 6, June 1993,
pp 75-87.
22Problem with static list scheduling
- List scheduling works by using task criticalities
- Static analysis wrongly evaluates critical tasks
90 /70
70 /50
130 /110
110 /90
150 /130
Worst case
50
30
At 99th percentile
hhh
Lookup 1
Lookup 2
L ookup 3
Lookup 4
Lookup 5
Lookup 6
Lookup 7
20
20
0
20
20
20
20
5
5
5
5
5
5
155/135
5
5
40
80
35
10
Recv
Verify TTL
Send
Update TTL
Update checksum
5
40
5
25
20
20
20
20
10
65
5
5
Verify checksum
25
static_level(v) delay of the longest
path from v to snk
- Solution use statistically analyzed static
levels, proc. finish times - Rest of algorithm is unchanged obtain makespan
165 instead of 185
priority_level(v,p) statistical_static_level(v)
max statistical_earliest_start_time(v,p
), statistical_earliest_ready_time(
p)
23Simulated Annealing (SA)
Inputs Initial state s0 , Initial temperature
t0,, final temperature t8
- Key characteristics
- Function Temp defines the temperature update
function - Function Move specifies state neighborhoods
- Function Cost the optimization objective
- Function Prob transition acceptance
probabilities - Parameters t0 , t8 initial and final
temperatures
Iteration count i 0
ti Temp(t0, i)
s Move(si)
Increment i
? Cost(s) Cost(s)
Yes
? 0 or Prob(? ,ti) Rand(0,1)
ti lt t8
No
No
Output best state
Yes
Accept transition si s Update best seen state
24SA Strategy for Statistical Scheduling
- State space set of valid allocations and
schedules - Cost The required percentile of the makespan of
a valid allocation and schedule (statistical
analysis / deterministic worst-case analysis) - Move Follow the approach for static scheduling
in Koch et al. - Move a randomly selected task from one processor
to a random position in another processor - Not all positions are acceptable ordering or
dependence constraints may be violated. If so, we
undo the move - Compute the new schedule
25Results
- Two sets of experiments
- Practical applications IPv4 packet forwarding,
H.264 video decoding, MPEG2 encoding - Code profiled for IPv4 on soft Microblaze, others
on a 2.4 GHz Pentium - IPv4 task graph is replicated for multiple input
ports - For H.264, we assume knowledge of per-macroblock
probabilities of being an I-macroblock - Random task graphs from Davidovic et al.
- Means taken from benchmarks, standard dev
0-0.7mean - Compared statistical DLS and statistical SA to
worst-case and average-case SA results
26Results
Statistical scheduling works 10-30 better than
static scheduling
- Det. worst-case is about 10-20 off of
statistical makespans at ? 99 and 30 off at ?
90 - Expected trend since worst-case estimates become
worse predictors of makespan at lower ? - Statistical optimization is customized for a
particular ? value - More improvement from larger benchmarks
Percentage makespan difference between det.
worst-case and statistical scheduling at
different percentiles on realistic apps
Average percentage makespan difference between
det. worst-case and statistical scheduling at
different percentiles for random task graph
instances on 4,6 and 8 processors
27Thank you for your attention!
- Thank you for your attention!
- Questions?