6d.1 - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

6d.1

Description:

Condor ... Condor-G ... Each 'job' statement has an abstract job name (say A) and a file (say a.condor) ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 46

Provided by: barry200

Learn more at: http://www.umcs.maine.edu

Category:

Tags: condor

more less

Transcript and Presenter's Notes

Title: 6d.1

1
Schedulers and Resource Brokers
ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B.
Wilkinson.
2
Scheduler

Job manager submits jobs to scheduler.
Scheduler assigns work to resources to achieve
specified time requirements.

3
Scheduling

From "Introduction to Grid Computing with
Globus," IBM Redbooks

4
Executing GT 4 jobs

Globus has two modes
Interactive/interactive-streaming
Batch

5
GT 4 Fork Scheduler

GT 4 comes with a fork scheduler which attempts
to execute the job immediately
Provided for starting and controlling a job on a
local host if job does not require any special
software loaded or requirements.
Other schedulers have to be added separately,
using an adapter.

6
Batch scheduling

Batch, a term form old computing days, when one
submitted a pack of punched cards as the program
to a computer and one would come back after the
program had been run on the computer, maybe
overnight.

7
Relationship between GT4 GRAM and a Local
Scheduler
GT4 Java Container
Compute element
Local job control
Job functions
GRAM services
GRAM services
GRAM adapter
Local scheduler
Client
User job
Various possible
I Foster
8
Scheduler adapters included in GT 4

PBS (Portable Batch System)
Condor
LSF (Load Sharing Facility)
Third party adapter provided for
SGE (Sun Grid Engine)

9
Meta-schedulers

Loosely defined as a higher level scheduler that
can scheduler jobs between sites.

10
(Local) Scheduler Issues

Distribute job
Based on load and characteristics of machines,
available disk storage, network characteristics,
.
Both globally and locally.
Runtime scheduling!
Arrange data in right place (Staging)
Data Replication and movement as needed
Data Error checking

11
Scheduler Issues (continued)

Performance
Error checking check pointing
Monitoring job, progress monitoring
QOS (Quality of service)
Cost (an area considered by Nimrod-G)
Security
Need to authenticate and authorize remote user
for job submission
Fault Tolerance

12
Batch Scheduling policies

First-in, First-out
Favor certain types of jobs
Shortest job first
Smallest (or largest) memory first
Short(or long) running job first
Fair sharing or priority to certain users
Dynamic policies
Depending upon time of day and load
Custom, preemptive, process migration

13
Advance Reservation

Requesting actions at times in future.
A service level agreement in which the
conditions of the agreement start at some
agreed-upon time in the future 2
2 The Grid 2, Blueprint for a New Computing
Infrastructure, I. Foster and C. Kesselman
editors, Morgan Kaufmann, 2004.

14
Resource Broker

A scheduler that optimizers the performance of a
particular resource. Performance may be measured
by such criteria as fairness (to ensure that all
requests for the resources are satisfied) or
utilization (to measure the amount of the
resource used). 2

Scheduler/Resource Broker Examples
Schedulers/Resource Brokers available that work
with Globus
Condor/Condor-G
Sun Grid Engine
To be covered by James Ruff and to be used in
Assignment 4 this year.

16
Condor

First developed at University of
Wisconsin-Madison in mid 1980s to convert a
collection of distributed workstations and
clusters into a high-throughput computing
facility.
Key concept - using wasted computer power of idle
workstations.

17
Condor

Converts collections of distributed workstations
and dedicated clusters into a distributed
high-throughput computing facility.

18
Features

Include
Resource finder
Batch queue manager
Scheduler
Checkpoint/restart
Process migration

Intended to complete job even if
Machines crash
Disk space exhausted
Software not installed
Machines are needed by others
Machines are managed by others
Machines are far away

20
Uses

Consider following scenario
I have a simulation that takes two hours to run
on my high-end computer
I need to run it 1000 times with slightly
different parameters each time.
If I do this on one computer, it will take at
least 2000 hours (or about 3 months)

From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, 2004
21

Suppose my department has 100 PCs like mine that
are mostly sitting idle overnight (say 8 hours a
day).
If I could use them when their legitimate users
are not using them, so that I do not
inconvenience them, I could get about 800 CPU
hours/day.
This is an ideal situation for Condor.
I could do my simulations in 2.5 days.

From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, 2004
22

The Condor high-throughput computing system
Condor-G agent for grid computing

23
HTCS

Distributed batch computing system
Provide large amounts of fault-tolerant computing
power.
Opportunistic computing.
Effectively utilizing all resources on the
network
Scavenger (but polite!)

24
Tools

ClassAds Flexible framework for matching
resource requests and providers.
Job checkpoint and migration
Remote system calls.
redirect I/O back to local machine

25
Condor-G

Globus contributes protocols for secure
inter-domain communications and standardized
access to remote batch systems.
Condor provides everything else

26
(No Transcript)
27
Condor Core Components
28

User submits job to agent
Keeps job and finds resources willing to run
them.
Agents and resources advertise themselves to a
matchmaker.
E-harmony.com 29 dimensions of compatibility.
Agent contacts resource

Agent creates shadow provides all details
necessary to run job.
Resource creates sandbox a sage execution
environment for the job and the resource.
All independent and individually responsible for
enforcing their owners policies.
This led to Condor Pools

30
(No Transcript)
31
Direct Flocking (Multiple pools)
32
Globus

To develop worldwide Grid, needed uniform
interface for batch execution.
Grid Resource Access and Management protocol
(GRAM).
Provides abstraction for remote process queuing
and execution (with security and GridFTP).
Globus provides a server that speaks GRAM,
converts its commands into a form understood by
local schedulers

GRAM does not
Remember what jobs have been submitted, where
they are, what they are doing.
Analyze job failure and resubmit
Provide queuing, prioritization, logging,
accounting.
Decouple resource allocation and job execution.

Agent must direct a particular job, executable
image and all, to a particular queue.
Gosh, what if there is a backlog and no
reasonably available resources?

Condor adapted standard agent to speak GRAM and
uses own middleware.
Gliding

36
(No Transcript)
37

Directed Acyclic Graph
Manager (DAGMan)
Meta-scheduler
Allows one to specify dependencies between Condor
Jobs.

Example
Do not run Job B until Job A completed
successfully
Especially important to jobs working together (as
in Grid computing).

39
Directed Acyclic Graph(DAG)

A data structure used to represent dependencies.
Directed graph.
No cycles.
Each job is a node in the DAG.
Each node can have any number of parents and
children as long as there are no loops (Acyclic
graph).

40
DAG
Do job A. Do jobs B and C after job A
finished Do job D after both jobs B and C
finished.
41
Defining a DAG

Defined by a .dag file, listing each of the nodes
and their dependencies.
Each job statement has an abstract job name
(say A) and a file (say a.condor)
PARENT-CHILD statement describes relationship
between two or more jobs
Other statements available.

Example

diamond.dag Job A a.sub Job B b.sub Job C
c.sub Job D d.sub Parent A Child B C Parent B C
Child D
43
Running a DAG

DAGMan acts as a scheduler managing the
submission of jobs to Condor based upon DAG
dependencies.
DAGMan holds and submits jobs to Condor queue at
appropriate times.

44
Job Failures

DAGMan continues until it cannot make progress
and then creates a rescue file holding current
state of DAG.
When failed job ready to re-run, rescue file used
to restore prior state of DAG.

45
Summary of Key Condor Features