PPT – 6d'1 PowerPoint presentation | free to view

About This Presentation

Title:

6d'1

Description:

The interaction of local resource management (e.g., PBS, LoadLeveler) ... Each job is ... DAG defined by a .dag file, listing each of the nodes and their ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 52

Provided by: barry199

Category:

Tags: areas | in | job | listing | local

more less

Transcript and Presenter's Notes

Title: 6d'1

1
Schedulers and Resource Brokers
2
Scheduler

Job manager submits jobs to scheduler.
Scheduler assigns work to resources to achieve
specified time requirements.

3
Scheduling

From "Introduction to Grid Computing with
Globus," IBM Redbooks

4
Why scheduling?

Efficient use of Grid resources requires powerful
and flexible Grid scheduling
For Grid technology to be successful, there must
automatic features to determine available Grid
resources and to coordinate the allocation of
these resources in accordance with the
requirements, dependencies, and objectives of the
user.
From GGF7-workshop on Grid scheduling

5
Scheduling architecture

Is a current area of study for Grid
Eventually, there must be a definition of a
scheduling architecture
The cooperation of different scheduling instances
for arbitrary resources available in the grid
The interaction of local resource management
(e.g., PBS, LoadLeveler) and data management

6
Service Level Agreements

Resource SLA, (RSLA) i.e., reservation
A promise that a resource will be available when
it is needed
The client will utilize the promise in subsequent
SLAs
Task SLA, (TSLA) i.e., execution
A promise to perform a task
There may be complex task requirements and may
reference an RSLA implicitly

7
SLAs, continued

Binding SLA (BSLA), i.e., a claim
Binds a resource capability to a TSLA
May reference an RSLA, or be implicit
May be created lazily to provision the task

8
Advance Reservation

Requesting actions at times in future. (A
service level agreement in which the conditions
of the agreement start at some agreed-upon time
in the future 2)
2 The Grid 2, Blueprint for a New Computing
Infrastructure, I. Foster and C. Kesselman
editors, Morgan Kaufmann, 2004.

9
Resource Broker

A scheduler that optimizers the performance of a
particular resource. Performance may be measured
by such criteria as fairness (to ensure that all
requests for the resources are satisfied) or
utilization (to measure the amount of the
resource used). 2

10
Community Scheduling

Individual users
Require service
Have application goals
Community schedulers
Broker service
Aggregate scheduling
Individual resources
Provide service
Have policy autonomy
Serve higher-level layers

11
Scheduling in Globus (not)

Fully-fledged scheduler/resource broker not in
Globus.
For example, Globus does not currently have
advance reservation.
Scheduler/resource broker need to be provided
separately on top of Globus, using basic services
provided in Globus.

Resource Broker Examples
Condor-G, Nimrod/G, Grid Canada

13
Condor

System first developed at University of
Wisconsin-Madison in mid 1980s to convert a
collection of distributed workstations and
clusters into a high-throughput computing
facility.
Key concept - using wasted computer power of idle
workstations.

14
Condor

Converts collections of distributed workstations
and dedicated clusters into a distributed
high-throughput computing facility.

15
Features

Include
Resource finder
Batch queue manager
Scheduler
Checkpoint/restart
Process migration

Intended to run job even if
Machines crash
Disk space exhausted
Software not installed
Machines are needed by others
Machines are managed by others
Machines are far away

17
Uses

Consider following scenario
I have a simulation that takes two hours to run
on my high-end computer
I need to run it 1000 times with slightly
different parameters each time.
If I do this on one computer, it will take at
least 2000 hours (or about 3 months)

From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, ,2004
18

Suppose my department has 100 PCs like mine that
are mostly sitting idle overnight (say 8 hours a
day)
If I could use them when their legitimate users
are not using them, so that I do not
inconvenience them, I could get about 800 CPU
hours/day.
This is an ideal situation for Condor.
I could do my simulations in 2.5 days.

From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, ,2004
19
How does Condor work?

A collection of machines running Condor called a
pool.
Individual pools can be joined together in a
process called flocking.

From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, ,2004
20
Machine Roles

Machines have one or more of four roles
Central manager
Submit machine (Submit host)
Execution machine (Execute host)
Checkpoint server

21
Central Manager

Resource broker for a pool. Keeps track of which
machines are available, what jobs are running,
negotiates which machine will run which job, etc.
Only one central manager per pool.

22
Submit Machine

Machine which submits jobs to pool.
Must be at least one submit machine in a pool,
and usually more than one.

23
Execute Machine

Machine on which jobs can be run.
Must be at least one execute machine in a pool,
and usually more than one.

24
Checkpoint Server

Machine which stores al checkpoint files produced
by job which checkpoint.
Can only be one checkpoint machine in a pool.
Optional to have a checkpoint machine.

25
Possible Configuration

A central manager.
Some machine that can only be submit hosts.
Some machine that can be only execute hosts.
Some machines that can be both submit and execute
hosts.

26
(No Transcript)
27
Submitting a job

Job submitted to submit host
Submit host tells the central ,manager about job
using Condors ClassAd Mechanism which may
include
What it requires
What it desires
What it prefers, and
What it will accept

1. Central manager monitoring execute hosts so
knows what is available and what type of machines
each execute host is, and software.
2. Execute hosts periodically send a ClassAd
describing themselves to the central manager.

3. At times, the central manager enters a
negotiation cycle where it matches waiting jobs
with available execute hosts.
4. Eventually job is matched with a suitable
execute host (hopefully) .

5. Central manager informs chosen execute host
that is has been claimed and gives it a ticket.
6. Central manage informs submit host which
execute host to use and gives it a matching
ticket.

7. Submit host contacts execute host presenting
its matching ticket and transfers jobs
executable and date files to execute host if
necessary. (shared file system also possible.)
8. When job finished, results returned to submit
host (unless shared file system in use between
submit and execute hosts).

Connections
Connection between submit and execute host
usually done with a TCP connection.
If connection dies, job resubmitted to Condor
pool.
Some jobs might access files and resources on
submit host via remote procedure calls.

33
Checkpointing

Certain jobs can checkpoint, both periodically
for safety and when interrupted.
If checkpointed job interrupted, it will resume
at the last checkpointed state when it starts
again.
Generally no change to source code - need to link
Condors Standard Universe support library (see
later).

34
Types of Jobs

Classified according to environment it provides.
Currently seven environments
Standard
Vanilla
PVM
MPI
Globus
Java
Scheduler

35
Standard

For jobs compiled with Condor libraries
Allows for checking pointing and remote system
calls.
Must be single threaded.
Not available under Windows.

36
Vanilla

For jobs that cannot be compiled with Condor
libraries, and for shell scripts and Windows
batch files.
No checkpointing or remote system calls.

Job Universes continued
PVM
For PVM programs.
MPI
For MPI programs (MPICH).
Globus
For submitting jobs to resources managed by
Globus (version 2.2 and higher).

Java
For Java programs (written for Java Virtual
Interface).
Scheduler
A universe not normally used by end-user.
Ignores any requirements and runs job on submit
host. Never preempted.

39
Directed Acyclic Graph Manager (DAGMan)

Allows one to specify dependencies between Condor
Jobs.
Example
Do not run Job B until Job A completed
successfully
Especially important to jobs working together
(as in Grid computing).

40
Directed Acyclic Graph(DAG)

A data structure used to represent dependencies.
Each job is a node in the DAG.
Each node can have any number of parents and
childred as long as there are no loops (Acyclic
graph).

41
Defining a DAG

DAG defined by a .dag file, listing each of the
nodes and their dependencies
Example

diamond.dag Job A a.sub Job B b.sub Job C
c.sub Job D d.sub Parent A Child B C Parent B C
Child D
Job A
Job C
Job B
Job D
42
Running a DAG

DASGMan acts as a scheduler managing the
submission of jobs to Condor based upon DAG
dependencies.
DAGMan holds and submits jobs to Condor queue at
appropriate times.

43
Job Failures

DAGMan continues until it cannot make progress
and then creates a rescue file holding current
state of DAG.
When failed job ready to re-run, rescue file used
to restore prior state of DAG.

44
ClassAd Matchmaking

Used to ensure job done according to constraints
of users and owners.
Example of user constraints
I need a Pentium IV with at least 512 Mbytes of
RAM and speed of at least 3.5 Ghz
Example of machine owner constraints
Never run jobs owned by Fred

45
Condor Submit Description File
Describes job to Condor. Used with Condor _submit
command. Description File Example

This is a comment, condor submit file
Universe vanilla
Executable /home/abw/condor/myProg
Input myProg.stdin
Output myProg.stdout
Error myProg.stderr
Arguments -arg1 -arg2
InitialDir /home/abw/condor/assignment4
Queue

46
Submitting Multiple Jobs

Submit file can specify multiple jobs
Queue 500 will submit 500 jobs at once
Condor calls groups of jobs a cluster
Each job within cluster called a process
Condor job ID is the cluster number, a period and
process number, for example 26.2
Single jobs also a cluster but with a single
process (process 0)

47
Specifying Requirements

A C/Java-like Boolean expression that evaluates
to TRUE for a match.
This is a comment, condor submit file
Universe vanilla
Executable /home/abw/condor/myProg
InitialDir /home/abw/condor/assignment4
Requirements Memory gt 512 Disk gt 10000
queue 500

48
Summary of Key Condor Features

High throughput computing using an
opportunitistic environment.
Matchmaking
Checkpointing
DAG scheduling

49
Condor-G

Grid enabled version of Condor.
Uses Globus Toolkit for
Security (GSI)
managing remote jobs on grid (GRAM)
file handling and remote I/O (GSI-FTP)

50
Remote execution by Condor-G on Globus-managed
resources
FromCondor-G A Computation Management Agent for
Multi-Institutional Grids by J. Frey, T.
Tannenbaum, M. Livny, I. Foster and S. Tuecke.
Figure probably refers to Globus version 2.
51
More Information