Scheduling Generic Parallel Applications Metascheduling - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Scheduling Generic Parallel Applications Metascheduling

Description:

Single-site scheduling a job does not span across ... Assessment and Enhancement of Meta-Schedulers...(Sabin et. al. ... Sabin et. al. HPDC 2005. References ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 49
Provided by: SathishV4
Category:

less

Transcript and Presenter's Notes

Title: Scheduling Generic Parallel Applications Metascheduling


1
Scheduling Generic Parallel Applications
Meta-scheduling
  • Sathish Vadhiyar
  • Sources/Credits/Taken from Papers listed in
    References slide

2
Scheduling Architectures
  • Centralized schedulers
  • Single-site scheduling a job does not span
    across sites
  • Multi-site the opposite
  • Hierarchical structures - A central scheduler
    (metascheduler) for global scheduling and local
    scheduling on individual sites
  • Decentralized scheduling distributed schedulers
    interact, exchange information and submit jobs to
    remote systems
  • Direct communication local scheduler directly
    contacts remote schedulers and transfers some of
    its jobs
  • Communication via central job pool jobs that
    cannot be immediately executed are pushed to a
    central pool, other local schedulers pull the
    jobs out of the pool

3
Various Scheduling Architectures
4
Various Scheduling Architectures
5
Metascheduler across MPPs
  • Types
  • Centralized
  • A meta scheduler and local dispatchers
  • Jobs submitted to meta scheduler
  • Hierarchical
  • Combination of central and local schedulers
  • Jobs submitted to meta scheduler
  • Meta scheduler sends job to the site for which
    earliest start time is expected
  • Local schedulers can follow their own policies
  • Distributed
  • Each site has a metascheduler and a local
    scheduler
  • Jobs submitted to local metascheduler
  • Jobs can be transffered to sites with lowest load

6
Evaluation of schemes
Centralized
  • Global knowledge of all resources hence
    optimized schedules
  • Can act as a bottleneck for large number of
    resources and jobs
  • May take time to transfer jobs from meta
    scheduler to local schedulers need strategic
    position of meta scheduler

Hierarchical
  • Medium level overhead
  • Sub optimal schedules
  • Still need strategic position of central scheduler

Distributed
  • No bottleneck workload evenly distributed
  • Needs all-to-all connections between MPPs

7
Evaluation of Various Scheduling Architectures
  • Experiments to evaluate slowdowns in the 3
    schemes
  • Based on actual trace from a supercomputer centre
    5000 job set
  • 4 sites were simulated 2 with the same load as
    trace, other 2 where run time was multiplied by
    1.7
  • FCFS with EASY backfilling was used
  • slowdown (wait_time run_time) / run_time
  • 2 more schemes
  • Independent when local schedulers acted
    independently, i.e. sites are not connected
  • United resources of all processors are combined
    to form a single site

8
Results
9
Observations
1. Centralized and hierarchical performed
slightly better than united
a. Compared to hierarchical, scheduling decisions
have to be made for all jobs and all resources in
united overhead and hence wait time is high
  • b. Comparing united and centralized.
  • 4 categories of jobs corresponding to 4 different
    combinations of 2 parameters execution time
    (short, long) and number of resources requested
    (narrow, wide)
  • Usually larger number of long narrow jobs than
    short wide jobs
  • Why is centralized and hierarchical better than
    united?

2. Distributed performed poorly
  • Short narrow jobs incurred more slowdown
  • short narrow jobs are large in number and best
    candidates for back filling
  • Back filling dynamics are complex
  • A site with an average light may not always be
    the best choice. SN jobs may find earliest holes
    in a heavily loaded site.

10
Newly Proposed Models
  • K-distributed model
  • Distributed scheme where local metascheduler
    distributes jobs to k least loaded sites
  • When job starts on a site, notification is sent
    to the local metascheduler which in turn asks the
    k-1 schedulers to dequeue
  • K-Dual queue model
  • 2 queues are maintained at each site one for
    local jobs and other for remote jobs
  • Remote jobs are executed only when they dont
    affect the start times of the local jobs
  • Local jobs are given priority during backfilling

11
Results Benefits of new schemes
45 improvement
15 improvement
12
Results Usefulness of K-Dual schemeGrouping
jobs submitted at lightly loaded sites and
heavily loaded sites
13
Assessment and Enhancement of Meta-Schedulers(Sab
in et. al.)
  • Metascheduling working examples (LSF and Moab)
  • 2 different modes
  • Standard or centralized (all scheduling decisions
    are made in a centralized manner)
  • Forces local sites to accept advance reservations
    from the metascheduler
  • Delegated
  • Does not provide a known scheduling policy for
    grid jobs

14
Centralized
  • Metascheduler queries local schedulers to obtain
    information regarding current schedule
  • Metascheduler makes advance reservation on the
    best of local schedulers
  • Reservations honored by local sites possibly
    delaying local jobs
  • Metascheduler tries to find better reservations
    for all jobs at periodic intervals
  • If a better reservation is found, metascheduler
    cancels existing reservation and moves job to
    another local scheduler
  • This model requires close interactions between
    local and metaschedulers

15
Delegated
  • Metascheduler determines best site for each
    grid job
  • Delegates scheduling responsibilities to local
    schedulers
  • After the job is sent to the local site, there is
    no interaction between meta and local scheduler
  • Meta scheduler queries the local scheduler for
    the metric that serves as basis for site choice
  • This model is more scalable and allows local
    schedulers to retain autonomy

16
EvaluationSystem wide average response time
Centralized outperforms delegated since
centralized revisits its scheduling decisions
17
EvaluationAverage response time of jobs from the
least loaded site
  • Metascheduling has a detrimental effect on users
    at the least loaded site
  • At low loads, centralized is best jobs
    submitted at a least loaded site may run faster
    at another site
  • This is a case of least loaded sites getting
    discouraged from joining the grid!

18
To avoid deterioration at least loaded sites
Dues Based Queues
  • Goal is to improve priority of jobs originating
    from lightly loaded sites
  • For each site-pair, relative resource usage
    surplus/deficit is maintained
  • Each site maintains processor seconds that it has
    provided to other sites jobs also processor
    seconds that its jobs consumed in other sites
  • si sets priority for all of sjs jobs to be
    duessj
  • For lightly loaded sites, it is usually surplus.
    Hence other sites will have to pay dues to
    lightly loaded sites by increasing priorities of
    jobs submitted at lightly loaded sites

19
Dues Based Queues
  • s1 runs a 100 processor second job for s2
  • duess2 -100 duess1100
  • S2 runs a 300 processor-second job for s1 s2
    will be paying the dues to s1
  • duess2 200 duess1 -200
  • Queue order at each site is determined by dues
    values of the submitting site
  • Can be implemented in centralized
  • Dues-based queuing scheme at the meta scheduler
  • Or delegated
  • Dues based queues at the local scheduler

20
EvaluationSystem wide average response time
Dues-based scheme performs worse than the
corresponding schemes
21
EvaluationAverage response time of jobs from
least loaded site
Centralized dues perform the best
22
Another method Local Priority with Job Sharing
  • Dual queue
  • Dual queue at local schedulers
  • Local jobs will have higher priority than remote
    jobs
  • Dual queue with local copy
  • In dual queue model, remote jobs may suffer
    starvation
  • Jobs from a lightly loaded site sent to a remote
    site may suffer
  • In this scheme, all jobs have a copy sent to the
    originating sites scheduler in addition to one
    remote site

23
EvaluationSystem wide average response time
Dual queue with local copy performs the best
24
EvaluationAverage response times of jobs from
the least loaded site
Dual queue with local copy performs as good as
nosharing scheme
25
Summary
26
References
  • A taxonomy of scheduling in general-purpose
    distributed computing systems. IEEE Transactions
    on Software Engineering. Volume 14 ,  Issue 2
     (February 1988) Pages 141 - 154   Year of
    Publication 1988 Authors T. L. Casavant J. G.
    Kuhl
  • Evaluation of Job-Scheduling Strategies for Grid
    ComputingSourceLecture Notes In Computer Science.
    Proceedings of the First IEEE/ACM International
    Workshop on Grid Computing. Pages 191 - 202  
    Year of Publication 2000 ISBN3-540-41403-7.
    Volker Hamscher Uwe Schwiegelshohn Achim Streit
    Ramin Yahyapour
  • "Distributed Job Scheduling on Computational
    Grids using Multiple Simultaneous Requests" Vijay
    Subramani, Rajkumar Kettimuthu, Srividya
    Srinivasan, P. Sadayappan, Proceedings of 11th
    IEEE Symposium on High Performance Distributed
    Computing (HPDC 2002), July 2002

27
References
  • Assessment and Enhancement of Meta-Schedulers for
    Multi-Site Job Scheduling. Sabin et. al. HPDC 2005

28
References
  • Vadhiyar, S., Dongarra, J. and Yarkhan, A.
    GrADSolve - RPC for High Performance Computing
    on the Grid". Euro-Par 2003, 9th International
    Euro-Par Conference, Proceedings, Springer, LCNS
    2790, p. 394-403, August 26 -29, 2003.
  • Vadhiyar, S. and Dongarra, J. Metascheduler for
    the Grid. Proceedings of the 11th IEEE
    International Symposium on High Performance
    Distributed Computing, pp 343-351, July 2002,
    Edinburgh, Scotland.
  • Vadhiyar, S. and Dongarra, J. GrADSolve - A
    Grid-based RPC system for Parallel Computing with
    Application-level Scheduling". Journal of
    Parallel and Distributed Computing, Volume 64,
    pp. 774-783, 2004.
  • Petitet, A., Blackford, S., Dongarra, J., Ellis,
    B., Fagg, G., Roche, K., Vadhiyar, S. "Numerical
    Libraries and The Grid The Grads Experiments
    with ScaLAPACK, " Journal of High Performance
    Applications and Supercomputing, Vol. 15, number
    4 (Winter 2001) 359-374.

29
Coallocation in Multicluster Systems
  • Processor coallocation allowing jobs to use
    processors in multiple clusters simultaneously
  • Jobs consist of one or more components each of
    which has to be scheduled on a different cluster
  • Multi-component jobs scheduled across different
    clusters equal to the number of components

30
Queuing Structures
  • Single central scheduler with one global queue
    for the entire set of clusters all clusters
    submit single and multi-component jobs to the
    global queue
  • Local schedulers with only local queues at the
    clusters each cluster submits single and
    multi-component jobs to its local queue
  • A global queue for the system and local queues
    for the clusters a cluster submits single
    component jobs to its local queue and
    multi-component jobs to the global queue

31
Scheduling
  • Scheduling multi-component jobs WorstFit
  • Order the job components in decreasing size
  • Order the clusters according to decreasing number
    of idle processors
  • Traverse one-by-one through both lists trying to
    fit job components on clusters
  • Leaves in each cluster as much room as possible
    for subsequent jobs

32
Scheduling
  • Invoked during job departure
  • A queue is enabled when the corresponding
    scheduler is allowed to start jobs from the
    queue. When a queue is enabled, the job at the
    head of the queue is scheduled if it fits
  • When a job departs, all or some of the non-empty
    queues are enabled
  • Enabled queues are repeatedly visited in some
    order
  • What non-empty queues are enabled and what order
    are they visited is defined by a scheduling policy

33
Scheduling Policies
  • GS global scheduler policy with single queue
  • LS each cluster has only local queues. At a job
    departure, in which order should the non-empty
    queues be disabled?
  • Local schedulers that have not scheduled jobs for
    the longest time gets the first chance
  • For systems with both global queue and local
    queues
  • GP global priority. Local queues are enabled
    only when the global queue is empty
  • LP local priority. Global queue is only enabled
    when at least one local queue is empty. In which
    order should the local queues and the global
    queue be enabled?
  • Global queue is first enabled and then the local
    queues

34
Coallocation Rules
  • no only single component jobs are admitted. No
    coallocation
  • co both single and multi-component jobs. No
    restriction
  • rco restriction on size of job components.
  • fco restriction on size and number of job
    components

35
Testbed
  • DAS system in Netherlands 5 clusters, 1
    72-nodes, other 32-nodes
  • Intra cluster communication Myrinet LAN (1200
    Mbit/s)
  • Inter cluster communication 100 Mbit/s WAN

36
Evaluation
  • 2 applications
  • Ensflow simulating streams and eddies in the
    ocean
  • Poisson solution of 2-D Poisson equation

Execution times measured on DAS
37
Results
38
Conclusions
  • co gives the worst performance. Due to
    simultaneous presence of large single-component
    jobs and jobs with many components
  • rco and fco improve performance
  • LS and LP provide best results for coallocation
    cases
  • Performance of GS is better when there are only
    single-component jobs

39
Conclusions
  • Processor co-allocation is beneficial atleast
    when the overhead due to wide-area communication
    is not high
  • Restrictions to the job component sizes and to
    the number of job components improve the
    performance of coallocation

40
Reference
  • Scheduling Policies for Processor Coallocation in
    MultiCluster Systems. Bucur and Epema. TPDS. July
    2007.

41
Grid Application Development Software (GrADS)
Architecture
Resource Selector
Grid Routine / Application Manager
User
Matrix size, block size
Final schedule subset of resources
MDS
Resource characteristics, Problem characteristics
NWS
Performance Modeler
42
Performance Modeler
Grid Routine / Application Manager
The scheduling heuristic passed only those
candidate schedules that had sufficient
memory This is determined by calling a function
in simulation model
Final schedule subset of resources
All resources, Problem parameters
Scheduling Heuristic
Final Schedule
Performance Modeler
All resources, problem parameters
Candidate resources
Execution cost
Simulation Model
43
Simulation Model
  • Simulation of the ScaLAPACK right looking LU
    factorization
  • More about the application
  • Iterative each iteration corresponding to a
    block
  • Parallel application in which columns are
    block-cyclic distributed
  • Right looking LU based on Gaussian elimination

44
Operations
  • The LU application in each iteration involves
  • Block factorization (ibn, ibib) floating
    point operations
  • Broadcast for multiply message size equals
    approximately nblock_size
  • Each process does its own multiply
  • Remaining columns divided by number of processors

45
Back to the simulation model
  • double getExecTimeCost(int matrix_size, int
    block_size, candidate_schedule)
  • for(i0 iltnumber_of_blocks i)
  • / find the proc. Belonging to the column.
    Note its speed, its connections to other procs.
    /
  • tfact / simulate block factorization.
    Depends on processor_speed, machine_load,
    flop_count of factorization /
  • tbcast max(bcast times for each proc.)
    / scalapack follows split ring broadcast.
    Simulate broadcast algorithm for each proc.
    Depends on elements of matrix to be broadcast,
    connection bandwidth and latency /
  • tupdate max(matrix multiplies across all
    proc.) / depends on flop count of matrix
    multiply, processor speed, load /
  • return (tfact tbcast tupdate)

46
Initial GrADS Architecture
Resource Selector
Grid Routine / Application Manager
User
Matrix size, block size
Problem, parameters, app. Location, final schedule
MDS
Resource characteristics, Problem characteristics
NWS
App Launcher
Performance Modeler
Contract Monitor
Application
47
Performance Model Evaluation
48
GrADS Benefits
8 mscs, 8 torcs
8 mscs, 8 torcs
8 mscs, 7 torcs
8
8
8
MSC TORC Cluster
7
7
5
MSC Cluster
Even though performance worsened when using
multiple clusters, larger problem sizes can be
solved without incurring costly disk accesses
Write a Comment
User Comments (0)
About PowerShow.com