Title: Juli Rew
1Scheduler Basics
- Juli Rew
- CISL User Forum
- May 19, 2005
2Overview
- IBM Scheduling
- Life of a Job
- Submit Filter
- Batch Priority Scheduler
- Factors Affecting BPS Job Scheduling
- LoadLeveler
- Load Sharing Facility Scheduling
- LSF Scheduling on Linux Systems
- Differences from IBM Scheduling
3IBM Scheduling Life of a Job
llsubmit job
Submit Filter Requirements Processing
BPS Job Ordering
LoadLeveler Job Execution
Build Ordered List of Jobs
Job Starts
Requirements Not Met Reject Job
Requirements Problem Staff Rejects Job
Job Completes
Done
Done
Done
4Submit Filter Features
- Checks the LoadLeveler job script for
- - valid parameters
- - valid queue name
- - consistent combinations of features, eg.,
shared/not_shared, tasks_per_node/node options - Moves jobs with allocation holds to hold queues
- Moves jobs with cutoff projects to standby queue
5Batch Priority Job Scheduler Features
- Written at NCAR
- Orders jobs based on policy
- Creates separate facilities (Community, Climate
System Laboratory) - Further separates jobs into proposal groups
(NCAR/UNIV, CCSM/oCSL) - Hands the final order list to LoadLeveler
- Allows for backfilling of jobs to avoid idle
resources
6Bluesky Queue Priorities
all_spec all_spec all_spec all_spec
all_sp32 all_sp32 all_sp8 all_sp8
CSL NCAR UNIV csl_sp32 csl_pr32 .. csl_sb32 COM NCAR UNIV com_sp32 com_pr32 .. com_sb32 CSL CCSM oCSL csl_sp8 csl_pr8 .. csl_sb8 COM CCSM oCSL com_sp8 com_pr8 .. com_sb8
interactive, debug, share, test interactive, debug, share, test interactive, debug, share, test interactive, debug, share, test
7Prioritization of Jobs by BPS
- all_spec jobs run with the highest priority and
can access all nodes - Below that, all com and csl jobs divided equally
- Round Robin by Group/User
- ------------------
- all_spec
- ------------------
- com csl
- \ /
- top job
- 50-50 split not hard
8Other Factors Affecting Job Scheduling
- Backfilling - Jobs that will not interfere with
start of highest priority job allowed to slip in - - Sweet spot lt 3 hours and small node count
- Allocation Holds - Job flagged if a
project/division exceeds its 30-day or 90-day
allocation thresholds - - H1 and H2 jobs reordered at a priority above
standby but below non-flagged jobs - Special Initiatives - Nodes reserved for
real-time or other special runs
9Documentation and Utilities
- batchview command gives snapshot of current
ordering - Basic information on scheduling given at
- http//www.scd.ucar.edu/docs/ibm/ref/llsched.html
10LoadLeveler
- IBM's batch control job system
- Allows jobs to be started, stopped, or cancelled
- Controls allocation of resources (CPU, memory)
- Allows custom scheduler plug-in (e.g., BPS)
- Two mutually-exclusive options LoadLeveler
scheduler or custom scheduler.
11Load Sharing Facility
- Commercial product from Platform Computing
- Currently being used on major Linux platforms
- Also available for IBM, but still in evaluation
- Ability to do Hierarchical Fair-Share Scheduling
with Backfill, based on same facility scheme used
in BPS - Community/CSL facility division implemented
implicitly within the scheduler rather than
explicitly by queue name - Can schedule among multiple platforms - "Grid
12Questions?