Scheduling Generic Parallel Applications Metascheduling - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Scheduling Generic Parallel Applications Metascheduling

Description:

Single-site scheduling a job does not span across ... Assessment and Enhancement of Meta-Schedulers...(Sabin et. al. ... Sabin et. al. HPDC 2005. References ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 49

Provided by: SathishV4

Category:

more less

Transcript and Presenter's Notes

Title: Scheduling Generic Parallel Applications Metascheduling

1
Scheduling Generic Parallel Applications
Meta-scheduling

Sathish Vadhiyar
Sources/Credits/Taken from Papers listed in
References slide

2
Scheduling Architectures

Centralized schedulers
Single-site scheduling a job does not span
across sites
Multi-site the opposite
Hierarchical structures - A central scheduler
(metascheduler) for global scheduling and local
scheduling on individual sites
Decentralized scheduling distributed schedulers
interact, exchange information and submit jobs to
remote systems
Direct communication local scheduler directly
contacts remote schedulers and transfers some of
its jobs
Communication via central job pool jobs that
cannot be immediately executed are pushed to a
central pool, other local schedulers pull the
jobs out of the pool

3
Various Scheduling Architectures
4
Various Scheduling Architectures
5
Metascheduler across MPPs

Types
Centralized
A meta scheduler and local dispatchers
Jobs submitted to meta scheduler
Hierarchical
Combination of central and local schedulers
Jobs submitted to meta scheduler
Meta scheduler sends job to the site for which
earliest start time is expected
Local schedulers can follow their own policies
Distributed
Each site has a metascheduler and a local
scheduler
Jobs submitted to local metascheduler
Jobs can be transffered to sites with lowest load

6
Evaluation of schemes
Centralized

Global knowledge of all resources hence
optimized schedules
Can act as a bottleneck for large number of
resources and jobs
May take time to transfer jobs from meta
scheduler to local schedulers need strategic
position of meta scheduler

Hierarchical

Medium level overhead
Sub optimal schedules
Still need strategic position of central scheduler

Distributed

No bottleneck workload evenly distributed
Needs all-to-all connections between MPPs

7
Evaluation of Various Scheduling Architectures

Experiments to evaluate slowdowns in the 3
schemes
Based on actual trace from a supercomputer centre
5000 job set
4 sites were simulated 2 with the same load as
trace, other 2 where run time was multiplied by
1.7
FCFS with EASY backfilling was used
slowdown (wait_time run_time) / run_time
2 more schemes
Independent when local schedulers acted
independently, i.e. sites are not connected
United resources of all processors are combined
to form a single site

8
Results
9
Observations
1. Centralized and hierarchical performed
slightly better than united
a. Compared to hierarchical, scheduling decisions
have to be made for all jobs and all resources in
united overhead and hence wait time is high

b. Comparing united and centralized.
4 categories of jobs corresponding to 4 different
combinations of 2 parameters execution time
(short, long) and number of resources requested
(narrow, wide)
Usually larger number of long narrow jobs than
short wide jobs
Why is centralized and hierarchical better than
united?

2. Distributed performed poorly

Short narrow jobs incurred more slowdown
short narrow jobs are large in number and best
candidates for back filling
Back filling dynamics are complex
A site with an average light may not always be
the best choice. SN jobs may find earliest holes
in a heavily loaded site.

10
Newly Proposed Models

K-distributed model
Distributed scheme where local metascheduler
distributes jobs to k least loaded sites
When job starts on a site, notification is sent
to the local metascheduler which in turn asks the
k-1 schedulers to dequeue
K-Dual queue model
2 queues are maintained at each site one for
local jobs and other for remote jobs
Remote jobs are executed only when they dont
affect the start times of the local jobs
Local jobs are given priority during backfilling

11
Results Benefits of new schemes
45 improvement
15 improvement
12
Results Usefulness of K-Dual schemeGrouping
jobs submitted at lightly loaded sites and
heavily loaded sites
13
Assessment and Enhancement of Meta-Schedulers(Sab
in et. al.)

Metascheduling working examples (LSF and Moab)
2 different modes
Standard or centralized (all scheduling decisions
are made in a centralized manner)
Forces local sites to accept advance reservations
from the metascheduler
Delegated
Does not provide a known scheduling policy for
grid jobs

14
Centralized

Metascheduler queries local schedulers to obtain
information regarding current schedule
Metascheduler makes advance reservation on the
best of local schedulers
Reservations honored by local sites possibly
delaying local jobs
Metascheduler tries to find better reservations
for all jobs at periodic intervals
If a better reservation is found, metascheduler
cancels existing reservation and moves job to
another local scheduler
This model requires close interactions between
local and metaschedulers

15
Delegated

Metascheduler determines best site for each
grid job
Delegates scheduling responsibilities to local
schedulers
After the job is sent to the local site, there is
no interaction between meta and local scheduler
Meta scheduler queries the local scheduler for
the metric that serves as basis for site choice
This model is more scalable and allows local
schedulers to retain autonomy

16
EvaluationSystem wide average response time
Centralized outperforms delegated since
centralized revisits its scheduling decisions
17
EvaluationAverage response time of jobs from the
least loaded site

Metascheduling has a detrimental effect on users
at the least loaded site
At low loads, centralized is best jobs
submitted at a least loaded site may run faster
at another site
This is a case of least loaded sites getting
discouraged from joining the grid!

18
To avoid deterioration at least loaded sites
Dues Based Queues

Goal is to improve priority of jobs originating
from lightly loaded sites
For each site-pair, relative resource usage
surplus/deficit is maintained
Each site maintains processor seconds that it has
provided to other sites jobs also processor
seconds that its jobs consumed in other sites
si sets priority for all of sjs jobs to be
duessj
For lightly loaded sites, it is usually surplus.
Hence other sites will have to pay dues to
lightly loaded sites by increasing priorities of
jobs submitted at lightly loaded sites

19
Dues Based Queues

s1 runs a 100 processor second job for s2
duess2 -100 duess1100
S2 runs a 300 processor-second job for s1 s2
will be paying the dues to s1
duess2 200 duess1 -200
Queue order at each site is determined by dues
values of the submitting site
Can be implemented in centralized
Dues-based queuing scheme at the meta scheduler
Or delegated
Dues based queues at the local scheduler

20
EvaluationSystem wide average response time
Dues-based scheme performs worse than the
corresponding schemes
21
EvaluationAverage response time of jobs from
least loaded site
Centralized dues perform the best
22
Another method Local Priority with Job Sharing

Dual queue
Dual queue at local schedulers
Local jobs will have higher priority than remote
jobs
Dual queue with local copy
In dual queue model, remote jobs may suffer
starvation
Jobs from a lightly loaded site sent to a remote
site may suffer
In this scheme, all jobs have a copy sent to the
originating sites scheduler in addition to one
remote site

23
EvaluationSystem wide average response time
Dual queue with local copy performs the best
24
EvaluationAverage response times of jobs from
the least loaded site
Dual queue with local copy performs as good as
nosharing scheme
25
Summary
26
References

A taxonomy of scheduling in general-purpose
distributed computing systems. IEEE Transactions
on Software Engineering. Volume 14 , Issue 2
(February 1988) Pages 141 - 154 Year of
Publication 1988 Authors T. L. Casavant J. G.
Kuhl
Evaluation of Job-Scheduling Strategies for Grid
ComputingSourceLecture Notes In Computer Science.
Proceedings of the First IEEE/ACM International
Workshop on Grid Computing. Pages 191 - 202
Year of Publication 2000 ISBN3-540-41403-7.
Volker Hamscher Uwe Schwiegelshohn Achim Streit
Ramin Yahyapour
"Distributed Job Scheduling on Computational
Grids using Multiple Simultaneous Requests" Vijay
Subramani, Rajkumar Kettimuthu, Srividya
Srinivasan, P. Sadayappan, Proceedings of 11th
IEEE Symposium on High Performance Distributed
Computing (HPDC 2002), July 2002

27
References

Assessment and Enhancement of Meta-Schedulers for
Multi-Site Job Scheduling. Sabin et. al. HPDC 2005

28
References

Vadhiyar, S., Dongarra, J. and Yarkhan, A.
GrADSolve - RPC for High Performance Computing
on the Grid". Euro-Par 2003, 9th International
Euro-Par Conference, Proceedings, Springer, LCNS
2790, p. 394-403, August 26 -29, 2003.
Vadhiyar, S. and Dongarra, J. Metascheduler for
the Grid. Proceedings of the 11th IEEE
International Symposium on High Performance
Distributed Computing, pp 343-351, July 2002,
Edinburgh, Scotland.
Vadhiyar, S. and Dongarra, J. GrADSolve - A
Grid-based RPC system for Parallel Computing with
Application-level Scheduling". Journal of
Parallel and Distributed Computing, Volume 64,
pp. 774-783, 2004.
Petitet, A., Blackford, S., Dongarra, J., Ellis,
B., Fagg, G., Roche, K., Vadhiyar, S. "Numerical
Libraries and The Grid The Grads Experiments
with ScaLAPACK, " Journal of High Performance
Applications and Supercomputing, Vol. 15, number
4 (Winter 2001) 359-374.

29
Coallocation in Multicluster Systems

Processor coallocation allowing jobs to use
processors in multiple clusters simultaneously
Jobs consist of one or more components each of
which has to be scheduled on a different cluster
Multi-component jobs scheduled across different
clusters equal to the number of components

30
Queuing Structures

Single central scheduler with one global queue
for the entire set of clusters all clusters
submit single and multi-component jobs to the
global queue
Local schedulers with only local queues at the
clusters each cluster submits single and
multi-component jobs to its local queue
A global queue for the system and local queues
for the clusters a cluster submits single
component jobs to its local queue and
multi-component jobs to the global queue

31
Scheduling

Scheduling multi-component jobs WorstFit
Order the job components in decreasing size
Order the clusters according to decreasing number
of idle processors
Traverse one-by-one through both lists trying to
fit job components on clusters
Leaves in each cluster as much room as possible
for subsequent jobs

32
Scheduling

Invoked during job departure
A queue is enabled when the corresponding
scheduler is allowed to start jobs from the
queue. When a queue is enabled, the job at the
head of the queue is scheduled if it fits
When a job departs, all or some of the non-empty
queues are enabled
Enabled queues are repeatedly visited in some
order
What non-empty queues are enabled and what order
are they visited is defined by a scheduling policy

33
Scheduling Policies

GS global scheduler policy with single queue
LS each cluster has only local queues. At a job
departure, in which order should the non-empty
queues be disabled?
Local schedulers that have not scheduled jobs for
the longest time gets the first chance
For systems with both global queue and local
queues
GP global priority. Local queues are enabled
only when the global queue is empty
LP local priority. Global queue is only enabled
when at least one local queue is empty. In which
order should the local queues and the global
queue be enabled?
Global queue is first enabled and then the local
queues

34
Coallocation Rules

no only single component jobs are admitted. No
coallocation
co both single and multi-component jobs. No
restriction
rco restriction on size of job components.
fco restriction on size and number of job
components

35
Testbed

DAS system in Netherlands 5 clusters, 1
72-nodes, other 32-nodes
Intra cluster communication Myrinet LAN (1200
Mbit/s)
Inter cluster communication 100 Mbit/s WAN

36
Evaluation

2 applications
Ensflow simulating streams and eddies in the
ocean
Poisson solution of 2-D Poisson equation

Execution times measured on DAS
37
Results
38
Conclusions

co gives the worst performance. Due to
simultaneous presence of large single-component
jobs and jobs with many components
rco and fco improve performance
LS and LP provide best results for coallocation
cases
Performance of GS is better when there are only
single-component jobs

39
Conclusions

Processor co-allocation is beneficial atleast
when the overhead due to wide-area communication
is not high
Restrictions to the job component sizes and to
the number of job components improve the
performance of coallocation

40
Reference

Scheduling Policies for Processor Coallocation in
MultiCluster Systems. Bucur and Epema. TPDS. July
2007.

41
Grid Application Development Software (GrADS)
Architecture
Resource Selector
Grid Routine / Application Manager
User
Matrix size, block size
Final schedule subset of resources
MDS
Resource characteristics, Problem characteristics
NWS
Performance Modeler
42
Performance Modeler
Grid Routine / Application Manager
The scheduling heuristic passed only those
candidate schedules that had sufficient
memory This is determined by calling a function
in simulation model
Final schedule subset of resources
All resources, Problem parameters
Scheduling Heuristic
Final Schedule
Performance Modeler
All resources, problem parameters
Candidate resources
Execution cost
Simulation Model
43
Simulation Model

Simulation of the ScaLAPACK right looking LU
factorization
More about the application
Iterative each iteration corresponding to a
block
Parallel application in which columns are
block-cyclic distributed
Right looking LU based on Gaussian elimination

44
Operations

The LU application in each iteration involves
Block factorization (ibn, ibib) floating
point operations
Broadcast for multiply message size equals
approximately nblock_size
Each process does its own multiply
Remaining columns divided by number of processors

45
Back to the simulation model

double getExecTimeCost(int matrix_size, int
block_size, candidate_schedule)
for(i0 iltnumber_of_blocks i)
/ find the proc. Belonging to the column.
Note its speed, its connections to other procs.
/
tfact / simulate block factorization.
Depends on processor_speed, machine_load,
flop_count of factorization /
tbcast max(bcast times for each proc.)
/ scalapack follows split ring broadcast.
Simulate broadcast algorithm for each proc.
Depends on elements of matrix to be broadcast,
connection bandwidth and latency /
tupdate max(matrix multiplies across all
proc.) / depends on flop count of matrix
multiply, processor speed, load /
return (tfact tbcast tupdate)

46
Initial GrADS Architecture
Resource Selector
Grid Routine / Application Manager
User
Matrix size, block size
Problem, parameters, app. Location, final schedule
MDS
Resource characteristics, Problem characteristics
NWS
App Launcher
Performance Modeler
Contract Monitor
Application
47
Performance Model Evaluation
48
GrADS Benefits
8 mscs, 8 torcs
8 mscs, 8 torcs
8 mscs, 7 torcs
8
8
8
MSC TORC Cluster
7
7
5
MSC Cluster
Even though performance worsened when using
multiple clusters, larger problem sizes can be
solved without incurring costly disk accesses

Write a Comment

User Comments (0)