Title: 6d.1
1Schedulers and Resource Brokers
ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B.
Wilkinson.
2Scheduler
- Job manager submits jobs to scheduler.
- Scheduler assigns work to resources to achieve
specified time requirements.
3Scheduling
- From "Introduction to Grid Computing with
Globus," IBM Redbooks
4Executing GT 4 jobs
- Globus has two modes
- Interactive/interactive-streaming
- Batch
5GT 4 Fork Scheduler
- GT 4 comes with a fork scheduler which attempts
to execute the job immediately - Provided for starting and controlling a job on a
local host if job does not require any special
software loaded or requirements. - Other schedulers have to be added separately,
using an adapter.
6Batch scheduling
- Batch, a term form old computing days, when one
submitted a pack of punched cards as the program
to a computer and one would come back after the
program had been run on the computer, maybe
overnight.
7Relationship between GT4 GRAM and a Local
Scheduler
GT4 Java Container
Compute element
Local job control
Job functions
GRAM services
GRAM services
GRAM adapter
Local scheduler
Client
User job
Various possible
I Foster
8Scheduler adapters included in GT 4
- PBS (Portable Batch System)
- Condor
- LSF (Load Sharing Facility)
- Third party adapter provided for
- SGE (Sun Grid Engine)
9Meta-schedulers
- Loosely defined as a higher level scheduler that
can scheduler jobs between sites.
10(Local) Scheduler Issues
- Distribute job
- Based on load and characteristics of machines,
available disk storage, network characteristics,
. - Both globally and locally.
- Runtime scheduling!
- Arrange data in right place (Staging)
- Data Replication and movement as needed
- Data Error checking
11Scheduler Issues (continued)
- Performance
- Error checking check pointing
- Monitoring job, progress monitoring
- QOS (Quality of service)
- Cost (an area considered by Nimrod-G)
- Security
- Need to authenticate and authorize remote user
for job submission - Fault Tolerance
12Batch Scheduling policies
- First-in, First-out
- Favor certain types of jobs
- Shortest job first
- Smallest (or largest) memory first
- Short(or long) running job first
- Fair sharing or priority to certain users
- Dynamic policies
- Depending upon time of day and load
- Custom, preemptive, process migration
13Advance Reservation
- Requesting actions at times in future.
- A service level agreement in which the
conditions of the agreement start at some
agreed-upon time in the future 2 - 2 The Grid 2, Blueprint for a New Computing
Infrastructure, I. Foster and C. Kesselman
editors, Morgan Kaufmann, 2004.
14Resource Broker
- A scheduler that optimizers the performance of a
particular resource. Performance may be measured
by such criteria as fairness (to ensure that all
requests for the resources are satisfied) or
utilization (to measure the amount of the
resource used). 2
15- Scheduler/Resource Broker Examples
- Schedulers/Resource Brokers available that work
with Globus - Condor/Condor-G
-
- Sun Grid Engine
- To be covered by James Ruff and to be used in
Assignment 4 this year.
16Condor
- First developed at University of
Wisconsin-Madison in mid 1980s to convert a
collection of distributed workstations and
clusters into a high-throughput computing
facility. - Key concept - using wasted computer power of idle
workstations.
17Condor
- Converts collections of distributed workstations
and dedicated clusters into a distributed
high-throughput computing facility.
18Features
- Include
- Resource finder
- Batch queue manager
- Scheduler
- Checkpoint/restart
- Process migration
19- Intended to complete job even if
- Machines crash
- Disk space exhausted
- Software not installed
- Machines are needed by others
- Machines are managed by others
- Machines are far away
20Uses
- Consider following scenario
- I have a simulation that takes two hours to run
on my high-end computer - I need to run it 1000 times with slightly
different parameters each time. - If I do this on one computer, it will take at
least 2000 hours (or about 3 months)
From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, 2004
21- Suppose my department has 100 PCs like mine that
are mostly sitting idle overnight (say 8 hours a
day). - If I could use them when their legitimate users
are not using them, so that I do not
inconvenience them, I could get about 800 CPU
hours/day. - This is an ideal situation for Condor.
- I could do my simulations in 2.5 days.
From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, 2004
22- The Condor high-throughput computing system
- Condor-G agent for grid computing
23HTCS
- Distributed batch computing system
- Provide large amounts of fault-tolerant computing
power. - Opportunistic computing.
- Effectively utilizing all resources on the
network - Scavenger (but polite!)
24Tools
- ClassAds Flexible framework for matching
resource requests and providers. - Job checkpoint and migration
- Remote system calls.
- redirect I/O back to local machine
25Condor-G
- Globus contributes protocols for secure
inter-domain communications and standardized
access to remote batch systems. - Condor provides everything else
26(No Transcript)
27Condor Core Components
28- User submits job to agent
- Keeps job and finds resources willing to run
them. - Agents and resources advertise themselves to a
matchmaker. - E-harmony.com 29 dimensions of compatibility.
- Agent contacts resource
29- Agent creates shadow provides all details
necessary to run job. - Resource creates sandbox a sage execution
environment for the job and the resource. - All independent and individually responsible for
enforcing their owners policies. - This led to Condor Pools
30(No Transcript)
31Direct Flocking (Multiple pools)
32Globus
- To develop worldwide Grid, needed uniform
interface for batch execution. - Grid Resource Access and Management protocol
(GRAM). - Provides abstraction for remote process queuing
and execution (with security and GridFTP). - Globus provides a server that speaks GRAM,
converts its commands into a form understood by
local schedulers
33- GRAM does not
- Remember what jobs have been submitted, where
they are, what they are doing. - Analyze job failure and resubmit
- Provide queuing, prioritization, logging,
accounting. - Decouple resource allocation and job execution.
34- Agent must direct a particular job, executable
image and all, to a particular queue. - Gosh, what if there is a backlog and no
reasonably available resources?
35- Condor adapted standard agent to speak GRAM and
uses own middleware. - Gliding
36(No Transcript)
37- Directed Acyclic Graph
- Manager (DAGMan)
- Meta-scheduler
- Allows one to specify dependencies between Condor
Jobs.
38- Example
- Do not run Job B until Job A completed
successfully -
- Especially important to jobs working together (as
in Grid computing).
39Directed Acyclic Graph(DAG)
- A data structure used to represent dependencies.
- Directed graph.
- No cycles.
- Each job is a node in the DAG.
- Each node can have any number of parents and
children as long as there are no loops (Acyclic
graph).
40DAG
Do job A. Do jobs B and C after job A
finished Do job D after both jobs B and C
finished.
41Defining a DAG
- Defined by a .dag file, listing each of the nodes
and their dependencies. - Each job statement has an abstract job name
(say A) and a file (say a.condor) - PARENT-CHILD statement describes relationship
between two or more jobs - Other statements available.
42 diamond.dag Job A a.sub Job B b.sub Job C
c.sub Job D d.sub Parent A Child B C Parent B C
Child D
43Running a DAG
- DAGMan acts as a scheduler managing the
submission of jobs to Condor based upon DAG
dependencies. - DAGMan holds and submits jobs to Condor queue at
appropriate times.
44Job Failures
- DAGMan continues until it cannot make progress
and then creates a rescue file holding current
state of DAG. - When failed job ready to re-run, rescue file used
to restore prior state of DAG.
45Summary of Key Condor Features
- High throughput computing using an
opportunitistic environment. - Provides a mechanisms for running jobs on remote
machines. - Matchmaking
- Checkpointing
- DAG scheduling