Title: MultiObjective Scheduling of Streaming Workflows
1Multi-ObjectiveScheduling of Streaming Workflows
2nd Scheduling in Aussois Workshop May 18-21,
2008
Bi-criteriaScheduling of Streaming Workflows
- Naga Vydyanathan 1, Umit V. Catalyurek 2,3,
- Tahsin Kurc 2, P. Sadayappan 1 and Joel Saltz 1,2
- 1 Dept. of Computer Science Engineering
- 2 Dept. of Biomedical Informatics
- 3 Dept. of Electrical Computer Engineering
- The Ohio State University
2Current and Emerging Applications
Satellite Data Processing
High Energy Physics
Quantum Chemistry
DCE-MRI Analysis
Image Processing
Multimedia
Video Surveillance
Montage
3Challenges
- Complex and diverse processing structures
Data Analysis Applications
Task
File
Sequential or Parallel Task
4Challenges
- Complex and diverse processing structures
- Varied parallelism
5Challenges
- Complex and diverse processing structures
- Varied parallelism
- Bag-of-tasks applications task-parallelism
6Challenges
- Complex and diverse processing structures
- Varied parallelism
7Challenges
- Complex and diverse processing structures
- Varied parallelism
- Bag-of-tasks task-parallelism
- Non-streaming workflows task- and
data-parallelism
8Challenges
- Complex and diverse processing structures
- Varied parallelism
9Challenges
- Complex and diverse processing structures
- Varied parallelism
- Bag-of-tasks task-parallelism
- Non-streaming workflows task- and
data-parallelism - Streaming workflows task-, data- and
pipelined-parallelism
10Challenges
- Different performance criteria
- Bag-of-tasks batch execution time CCGrid05,
HCW05, JSSPP06, HPDC06 - Non-streaming workflows makespan ICPP05,
HCW06, ICPP06, Cluster06 - Streaming workflows latency, throughput SC02,
EuroPar07, ICPP08 - Significant communication/data transfer overheads
11Scheduling Streaming Workflows
Data Analysis Applications
Bag-of-Tasks Applications
Workflows
Non-streaming
Streaming
12Scheduling Streaming Workflows
- Image processing, multimedia, computer vision
applications often act on a stream of input data - Scheduling challenges
- Multiple performance criteria
- Latency (time to process one data item)
- Throughput (aggregate rate of processing)
- Multiple forms of parallelism
- Pipelined parallelism
- Task parallelism
- Data parallelism
13An Example Pipelined Schedule
10
T1
8
12
T4(k-2)
T4(2)
T3
T2
T3
T4(1)
T3(2)
T3(2)
T4
15
T4
T3(1)
T3(k-1)
T3(3)
T3(1)
T3(1)
T2(2)
T2(1)
T2(k-1)
T2(3)
T2(1)
T2(1)
Throughput0.1 Latency37
T1(1)
T1(2)
T1(3)
T1(k)
T1(4)
T1(2)
Pipelined Parallelism
Task Parallelism
Data Parallelism
14Optimizing Latency while meeting Throughput
Constraints
- Given
- A workflow DAG with runtime and data volume
estimates - A collection of homogeneous processors
- A throughput constraint
- Goal
- Generate a schedule that meets the throughput
constraint while minimizing workflow latency
This requires leveraging pipelined, task and data
parallelism in a co-ordinated manner
15Pipelined Scheduling Heuristic
- Three-phase approach
- Phase 1 Satisfying the throughput requirement
- Assumes unbounded number of processors
- Employs clustering, replication and duplication
to meet throughput requirement - Phase 2 Limiting the number of processors used
- Merges task clusters to reduce the number of
processors used until a feasible schedule is
obtained - Preference given to decisions that minimize
latency - Phase 3 Minimizing the workflow latency
- Minimizes communication costs along the critical
path by duplication and clustering
16Task Clustering
17Task Replication
- Throughput 0.1
- Replication for
- Improve computation throughput
- Improve communication throughput
10
T1
T1
18
T3
T2
T3
T2
12
8
T4
15
T4
18Task Duplication
Sample application DAG (a) Schedule without
duplication (b) Schedule with duplication
19Duplication based Scheduling of Streaming
Workflows
- In the context of streaming workflows,
- duplication can be used to avoid bottleneck data
transfers without compromising task parallelism - Minimize workflow latency
5
Let T0.1 and P4
T1
T1
15
15
5
T3
T2
5
8
8
T4
10
20Duplication-based Scheduling of Streaming
Workflows
- However,
- Duplication can require more processors due to
redundant computation - Depends on weight of duplicated task and
throughput constraint - Extra communication to broadcast input data to
duplicates - May increase latency too!
- Selectively duplicate ancestors
- Duplication is done only if
- There are available processors
- It proves beneficial in terms of latency
- It does not involve expensive communications that
violate throughput requirement
21Estimating Throughput and Latency
- Execution Model
- Realistic k-port communication model
- Communication computation overlaps
- Throughput Estimate min (CompRate, CommRate)
- Computation Rate (CompRate) Estimate
- Min Procs(Ci) / exec_time(Ci) for all Cis
- Communication Rate (CommRate) Estimate
- Greedy priority based scheduling of communication
to channels ports - Min ParallelTransfers (trj)/ min_cycle_time
(trj) for all trj - Latency Estimate
- Takes into account both communication and
computation dependencies
22An Example
8
T1
- P 4, Throughput constraint T 0.1
- Satisfying the throughput
- nr(T1) 0.8, nr(T2)1, nr(T3)0.4, nr(T4)0.5,
nr(T5)0.4, nr(T6)0.2 - Expensive communications eT1T2, eT3T4, eT3T5
- Cluster T1 and T2
- Duplicate T3
- Limiting the number of processors
- Pused 5
- Two options to reduce Pused
- Merging T1, T2 and T6
- Merging T3, T5 and T6
- Merge T3, T5 and T6 -gt reduces latency
- Minimizing latency
- Nothing to be done!
12
6
6
T3
T2
10
4
T3
4
11
12
5
T4
T5
4
4
3
8
T6
2
23The Pipelined Schedule
8
Throughput 0.1, Latency 28
T1
6
6
T3
T2
10
4
T3
4
T3 (1)
T5 (1)
T6 (1)
T3 (2)
T5 (2)
T6 (2)
T3 (3)
T5 (3)
T6 (3)
5
T4
T5
4
4
T3 (1)
T4 (1)
T3 (2)
T4 (2)
T3 (3)
T4 (3)
3
Processors
P1 P2 P3 P4
T1(2)
T2(2)
T1(4)
T2(4)
T6
T1(1)
T2(1)
T1(3)
T2(3)
2
8
18
28
38
48
10
14
23
33
Time
24Performance on Synthetic Benchmarks
(a) CCR 0.1
(b) CCR 1
- As CCR is increased, more instances where FCP and
EXPERT do not meet throughput requirement - Proposed approach always meet throughput
requirement and produces lower latencies
(c) CCR 10
25Benefit of Task Duplication
(a) CCR 1
(b) CCR 10
- As throughput constraint relaxed, greater benefit
observed (more processors for duplication) - For negligible throughput constraint, clustering
doesnt have much adverse impact on latency
26Performance on Applications
(a)
(b)
Performance of MPEG Video Compression on 32
processors, (a) Latency Ratio and (b) Utilization
Ratio
- MPEG frames are processed in order of arrival
no replication - Throughput constraint assumed to be reciprocal of
weight of largest task - Proposed approach yields similar latency as FCP,
but has lower resource utilization - Proposed approach generates lower latency than
EXPERT
27Throughput Optimization under Latency Constraint
- Relation between throughput and latency
- Monotonically increasing
- Binary search algorithm on the inverse problem
- L latency required
- If L gt L_max, output T_max
- If L_min lt L lt L_max, do binary search
(TT_max/2.) - However, as we use heuristics, the monotonic
relation is not guaranteed - We use look-ahead techniques to avoid local optima
(L_min, 0)
(L_max, T_max)
28Throughput Optimization under Latency Constraint
(a) CCR 0.1
(b) CCR 1
- Proposed approach generates schedules with larger
throughputs that meet the latency constraints - Meets latency constraints even when other schemes
fail
28
29Related Work
- Bag-of-Tasks applications
- H. Casanova, D. Zagorodnov, F. Berman, and A.
Legrand. Heuristics for scheduling parameter
sweep applications in grid environments. HCW00. - Arnaud Giersch, Yves Robert, and Frédéric Vivien.
Scheduling tasks sharing files on heterogeneous
master-slave platforms. Journal of Systems
Architecture, 2006. - K Kaya and C Aykanat. Iterative-improvement-based
heuristics for adaptive scheduling of tasks
sharing files on heterogeneous master-slave
environments. IEEE TPDS, 2006. - Non-streaming workflows
- S Ramaswamy, S Sapatnekar, and P Banerjee. A
framework for exploiting task and data
parallelism on distributed memory multicomputers.
IEEE TPDS 1997. - A. Radulescu and A. van Gemund. A low-cost
approach towards mixed task and data parallel
scheduling. ICPP, 2001. - A Radulescu, C Nicolescu, A J. C. van Gemund, and
P Jonker. Cpr Mixed task and data parallel
scheduling for distributed systems. IPDPS, 2001. - K. Aida and H. Casanova. Scheduling
Mixed-Parallel Applications with Advance
Reservations. HPDC, 2008. - Streaming workflows
- F. Guirado, A.Ripoll, C. Roig, and E. Luque.
Optimizing latency under throughput requirements
for streaming applications on cluster execution.
Cluster Computing, 2005. - Matthew Spencer, Renato Ferreira, Michael Beynon,
Tahsin Kurc, Umit Catalyurek, Alan Sussman, and
Joel Saltz. Executing multiple pipelined data
analysis operations in the grid. SC, 2002 - Anne Benoit and Yves Robert. Mapping pipeline
skeletons onto heterogeneous platforms. Technical
Report LIP RR-2006-40, 2006. - Anne Benoit and Yves Robert. Complexity results
for throughput and latency optimization of
replicated and data-parallel workflows. Technical
Report LIP RR-2007-12, 2007. - Anne Benoit, Harald Kosch, Veronika Rehn-Sonigo
and Yves Robert. Optimizing latency and
reliability of pipeline workflow applications,
Technical Report LIP RR-2008-12, 2008.
30Conclusions Future Work
- Streaming Workflows
- Co-ordinated use of task-, data- and
pipelined-parallelism - Multiple performance objectives (latency and
throughput) - Consistently meets throughput requirements
- Lower latency schedules using fewer resources
- Larger throughput schedules while meeting latency
requirements - Future Work
- Scheduling for multi-core clusters
- Deeper memory hierarchies
- Power-aware approaches
- Fault-tolerant approaches
31Thanks
- Questions?
- Contact Info Umit Catalyurek
- umit_at_bmi.osu.edu
- http//bmi.osu.edu/umit
- OSU Dept. of Biomedical Informatics
http//bmi.osu.edu