Title: 6d'1
1Schedulers and Resource Brokers
2Scheduler
- Job manager submits jobs to scheduler.
- Scheduler assigns work to resources to achieve
specified time requirements.
3Scheduling
- From "Introduction to Grid Computing with
Globus," IBM Redbooks
4Why scheduling?
- Efficient use of Grid resources requires powerful
and flexible Grid scheduling - For Grid technology to be successful, there must
automatic features to determine available Grid
resources and to coordinate the allocation of
these resources in accordance with the
requirements, dependencies, and objectives of the
user. - From GGF7-workshop on Grid scheduling
5Scheduling architecture
- Is a current area of study for Grid
- Eventually, there must be a definition of a
scheduling architecture - The cooperation of different scheduling instances
for arbitrary resources available in the grid - The interaction of local resource management
(e.g., PBS, LoadLeveler) and data management
6Service Level Agreements
- Resource SLA, (RSLA) i.e., reservation
- A promise that a resource will be available when
it is needed - The client will utilize the promise in subsequent
SLAs - Task SLA, (TSLA) i.e., execution
- A promise to perform a task
- There may be complex task requirements and may
reference an RSLA implicitly
7SLAs, continued
- Binding SLA (BSLA), i.e., a claim
- Binds a resource capability to a TSLA
- May reference an RSLA, or be implicit
- May be created lazily to provision the task
8Advance Reservation
- Requesting actions at times in future. (A
service level agreement in which the conditions
of the agreement start at some agreed-upon time
in the future 2) - 2 The Grid 2, Blueprint for a New Computing
Infrastructure, I. Foster and C. Kesselman
editors, Morgan Kaufmann, 2004.
9Resource Broker
- A scheduler that optimizers the performance of a
particular resource. Performance may be measured
by such criteria as fairness (to ensure that all
requests for the resources are satisfied) or
utilization (to measure the amount of the
resource used). 2
10Community Scheduling
- Individual users
- Require service
- Have application goals
- Community schedulers
- Broker service
- Aggregate scheduling
- Individual resources
- Provide service
- Have policy autonomy
- Serve higher-level layers
11Scheduling in Globus (not)
- Fully-fledged scheduler/resource broker not in
Globus. - For example, Globus does not currently have
advance reservation. - Scheduler/resource broker need to be provided
separately on top of Globus, using basic services
provided in Globus.
12- Resource Broker Examples
- Condor-G, Nimrod/G, Grid Canada
13Condor
- System first developed at University of
Wisconsin-Madison in mid 1980s to convert a
collection of distributed workstations and
clusters into a high-throughput computing
facility. - Key concept - using wasted computer power of idle
workstations.
14Condor
- Converts collections of distributed workstations
and dedicated clusters into a distributed
high-throughput computing facility.
15Features
- Include
- Resource finder
- Batch queue manager
- Scheduler
- Checkpoint/restart
- Process migration
16- Intended to run job even if
- Machines crash
- Disk space exhausted
- Software not installed
- Machines are needed by others
- Machines are managed by others
- Machines are far away
17Uses
- Consider following scenario
- I have a simulation that takes two hours to run
on my high-end computer - I need to run it 1000 times with slightly
different parameters each time. - If I do this on one computer, it will take at
least 2000 hours (or about 3 months)
From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, ,2004
18- Suppose my department has 100 PCs like mine that
are mostly sitting idle overnight (say 8 hours a
day) - If I could use them when their legitimate users
are not using them, so that I do not
inconvenience them, I could get about 800 CPU
hours/day. - This is an ideal situation for Condor.
- I could do my simulations in 2.5 days.
From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, ,2004
19How does Condor work?
- A collection of machines running Condor called a
pool. - Individual pools can be joined together in a
process called flocking.
From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, ,2004
20Machine Roles
- Machines have one or more of four roles
- Central manager
- Submit machine (Submit host)
- Execution machine (Execute host)
- Checkpoint server
21Central Manager
- Resource broker for a pool. Keeps track of which
machines are available, what jobs are running,
negotiates which machine will run which job, etc. - Only one central manager per pool.
22Submit Machine
- Machine which submits jobs to pool.
- Must be at least one submit machine in a pool,
and usually more than one.
23Execute Machine
- Machine on which jobs can be run.
- Must be at least one execute machine in a pool,
and usually more than one.
24Checkpoint Server
- Machine which stores al checkpoint files produced
by job which checkpoint. - Can only be one checkpoint machine in a pool.
- Optional to have a checkpoint machine.
25Possible Configuration
- A central manager.
- Some machine that can only be submit hosts.
- Some machine that can be only execute hosts.
- Some machines that can be both submit and execute
hosts.
26(No Transcript)
27Submitting a job
- Job submitted to submit host
- Submit host tells the central ,manager about job
using Condors ClassAd Mechanism which may
include - What it requires
- What it desires
- What it prefers, and
- What it will accept
28- 1. Central manager monitoring execute hosts so
knows what is available and what type of machines
each execute host is, and software. - 2. Execute hosts periodically send a ClassAd
describing themselves to the central manager.
29- 3. At times, the central manager enters a
negotiation cycle where it matches waiting jobs
with available execute hosts. - 4. Eventually job is matched with a suitable
execute host (hopefully) .
30- 5. Central manager informs chosen execute host
that is has been claimed and gives it a ticket. - 6. Central manage informs submit host which
execute host to use and gives it a matching
ticket.
31- 7. Submit host contacts execute host presenting
its matching ticket and transfers jobs
executable and date files to execute host if
necessary. (shared file system also possible.) - 8. When job finished, results returned to submit
host (unless shared file system in use between
submit and execute hosts).
32- Connections
- Connection between submit and execute host
usually done with a TCP connection. - If connection dies, job resubmitted to Condor
pool. - Some jobs might access files and resources on
submit host via remote procedure calls.
33Checkpointing
- Certain jobs can checkpoint, both periodically
for safety and when interrupted. - If checkpointed job interrupted, it will resume
at the last checkpointed state when it starts
again. - Generally no change to source code - need to link
Condors Standard Universe support library (see
later).
34Types of Jobs
- Classified according to environment it provides.
Currently seven environments - Standard
- Vanilla
- PVM
- MPI
- Globus
- Java
- Scheduler
35Standard
- For jobs compiled with Condor libraries
- Allows for checking pointing and remote system
calls. - Must be single threaded.
- Not available under Windows.
36Vanilla
- For jobs that cannot be compiled with Condor
libraries, and for shell scripts and Windows
batch files. - No checkpointing or remote system calls.
37- Job Universes continued
- PVM
- For PVM programs.
- MPI
- For MPI programs (MPICH).
- Globus
- For submitting jobs to resources managed by
Globus (version 2.2 and higher).
38- Java
- For Java programs (written for Java Virtual
Interface). - Scheduler
- A universe not normally used by end-user.
Ignores any requirements and runs job on submit
host. Never preempted.
39Directed Acyclic Graph Manager (DAGMan)
- Allows one to specify dependencies between Condor
Jobs. - Example
- Do not run Job B until Job A completed
successfully - Especially important to jobs working together
(as in Grid computing).
40Directed Acyclic Graph(DAG)
- A data structure used to represent dependencies.
- Each job is a node in the DAG.
- Each node can have any number of parents and
childred as long as there are no loops (Acyclic
graph).
41Defining a DAG
- DAG defined by a .dag file, listing each of the
nodes and their dependencies - Example
diamond.dag Job A a.sub Job B b.sub Job C
c.sub Job D d.sub Parent A Child B C Parent B C
Child D
Job A
Job C
Job B
Job D
42Running a DAG
- DASGMan acts as a scheduler managing the
submission of jobs to Condor based upon DAG
dependencies. - DAGMan holds and submits jobs to Condor queue at
appropriate times.
43Job Failures
- DAGMan continues until it cannot make progress
and then creates a rescue file holding current
state of DAG. - When failed job ready to re-run, rescue file used
to restore prior state of DAG.
44ClassAd Matchmaking
- Used to ensure job done according to constraints
of users and owners. - Example of user constraints
- I need a Pentium IV with at least 512 Mbytes of
RAM and speed of at least 3.5 Ghz - Example of machine owner constraints
- Never run jobs owned by Fred
45Condor Submit Description File
Describes job to Condor. Used with Condor _submit
command. Description File Example
- This is a comment, condor submit file
- Universe vanilla
- Executable /home/abw/condor/myProg
- Input myProg.stdin
- Output myProg.stdout
- Error myProg.stderr
- Arguments -arg1 -arg2
- InitialDir /home/abw/condor/assignment4
- Queue
46Submitting Multiple Jobs
- Submit file can specify multiple jobs
- Queue 500 will submit 500 jobs at once
- Condor calls groups of jobs a cluster
- Each job within cluster called a process
- Condor job ID is the cluster number, a period and
process number, for example 26.2 - Single jobs also a cluster but with a single
process (process 0)
47Specifying Requirements
- A C/Java-like Boolean expression that evaluates
to TRUE for a match. - This is a comment, condor submit file
- Universe vanilla
- Executable /home/abw/condor/myProg
- InitialDir /home/abw/condor/assignment4
- Requirements Memory gt 512 Disk gt 10000
- queue 500
48Summary of Key Condor Features
- High throughput computing using an
opportunitistic environment. - Matchmaking
- Checkpointing
- DAG scheduling
49Condor-G
- Grid enabled version of Condor.
- Uses Globus Toolkit for
- Security (GSI)
- managing remote jobs on grid (GRAM)
- file handling and remote I/O (GSI-FTP)
50Remote execution by Condor-G on Globus-managed
resources
FromCondor-G A Computation Management Agent for
Multi-Institutional Grids by J. Frey, T.
Tannenbaum, M. Livny, I. Foster and S. Tuecke.
Figure probably refers to Globus version 2.
51More Information