6d'1 - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

6d'1

Description:

The interaction of local resource management (e.g., PBS, LoadLeveler) ... Each job is ... DAG defined by a .dag file, listing each of the nodes and their ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 52
Provided by: barry199
Category:
Tags: areas | in | job | listing | local

less

Transcript and Presenter's Notes

Title: 6d'1


1
Schedulers and Resource Brokers
2
Scheduler
  • Job manager submits jobs to scheduler.
  • Scheduler assigns work to resources to achieve
    specified time requirements.

3
Scheduling
  • From "Introduction to Grid Computing with
    Globus," IBM Redbooks

4
Why scheduling?
  • Efficient use of Grid resources requires powerful
    and flexible Grid scheduling
  • For Grid technology to be successful, there must
    automatic features to determine available Grid
    resources and to coordinate the allocation of
    these resources in accordance with the
    requirements, dependencies, and objectives of the
    user.
  • From GGF7-workshop on Grid scheduling

5
Scheduling architecture
  • Is a current area of study for Grid
  • Eventually, there must be a definition of a
    scheduling architecture
  • The cooperation of different scheduling instances
    for arbitrary resources available in the grid
  • The interaction of local resource management
    (e.g., PBS, LoadLeveler) and data management

6
Service Level Agreements
  • Resource SLA, (RSLA) i.e., reservation
  • A promise that a resource will be available when
    it is needed
  • The client will utilize the promise in subsequent
    SLAs
  • Task SLA, (TSLA) i.e., execution
  • A promise to perform a task
  • There may be complex task requirements and may
    reference an RSLA implicitly

7
SLAs, continued
  • Binding SLA (BSLA), i.e., a claim
  • Binds a resource capability to a TSLA
  • May reference an RSLA, or be implicit
  • May be created lazily to provision the task

8
Advance Reservation
  • Requesting actions at times in future. (A
    service level agreement in which the conditions
    of the agreement start at some agreed-upon time
    in the future 2)
  • 2 The Grid 2, Blueprint for a New Computing
    Infrastructure, I. Foster and C. Kesselman
    editors, Morgan Kaufmann, 2004.

9
Resource Broker
  • A scheduler that optimizers the performance of a
    particular resource. Performance may be measured
    by such criteria as fairness (to ensure that all
    requests for the resources are satisfied) or
    utilization (to measure the amount of the
    resource used). 2

10
Community Scheduling
  • Individual users
  • Require service
  • Have application goals
  • Community schedulers
  • Broker service
  • Aggregate scheduling
  • Individual resources
  • Provide service
  • Have policy autonomy
  • Serve higher-level layers

11
Scheduling in Globus (not)
  • Fully-fledged scheduler/resource broker not in
    Globus.
  • For example, Globus does not currently have
    advance reservation.
  • Scheduler/resource broker need to be provided
    separately on top of Globus, using basic services
    provided in Globus.

12
  • Resource Broker Examples
  • Condor-G, Nimrod/G, Grid Canada

13
Condor
  • System first developed at University of
    Wisconsin-Madison in mid 1980s to convert a
    collection of distributed workstations and
    clusters into a high-throughput computing
    facility.
  • Key concept - using wasted computer power of idle
    workstations.

14
Condor
  • Converts collections of distributed workstations
    and dedicated clusters into a distributed
    high-throughput computing facility.

15
Features
  • Include
  • Resource finder
  • Batch queue manager
  • Scheduler
  • Checkpoint/restart
  • Process migration

16
  • Intended to run job even if
  • Machines crash
  • Disk space exhausted
  • Software not installed
  • Machines are needed by others
  • Machines are managed by others
  • Machines are far away

17
Uses
  • Consider following scenario
  • I have a simulation that takes two hours to run
    on my high-end computer
  • I need to run it 1000 times with slightly
    different parameters each time.
  • If I do this on one computer, it will take at
    least 2000 hours (or about 3 months)

From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, ,2004
18
  • Suppose my department has 100 PCs like mine that
    are mostly sitting idle overnight (say 8 hours a
    day)
  • If I could use them when their legitimate users
    are not using them, so that I do not
    inconvenience them, I could get about 800 CPU
    hours/day.
  • This is an ideal situation for Condor.
  • I could do my simulations in 2.5 days.

From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, ,2004
19
How does Condor work?
  • A collection of machines running Condor called a
    pool.
  • Individual pools can be joined together in a
    process called flocking.

From Condor What it is and why you should
worry about it, by B. Beckles, University of
Cambridge, Seminar, June 23, ,2004
20
Machine Roles
  • Machines have one or more of four roles
  • Central manager
  • Submit machine (Submit host)
  • Execution machine (Execute host)
  • Checkpoint server

21
Central Manager
  • Resource broker for a pool. Keeps track of which
    machines are available, what jobs are running,
    negotiates which machine will run which job, etc.
  • Only one central manager per pool.

22
Submit Machine
  • Machine which submits jobs to pool.
  • Must be at least one submit machine in a pool,
    and usually more than one.

23
Execute Machine
  • Machine on which jobs can be run.
  • Must be at least one execute machine in a pool,
    and usually more than one.

24
Checkpoint Server
  • Machine which stores al checkpoint files produced
    by job which checkpoint.
  • Can only be one checkpoint machine in a pool.
  • Optional to have a checkpoint machine.

25
Possible Configuration
  • A central manager.
  • Some machine that can only be submit hosts.
  • Some machine that can be only execute hosts.
  • Some machines that can be both submit and execute
    hosts.

26
(No Transcript)
27
Submitting a job
  • Job submitted to submit host
  • Submit host tells the central ,manager about job
    using Condors ClassAd Mechanism which may
    include
  • What it requires
  • What it desires
  • What it prefers, and
  • What it will accept

28
  • 1. Central manager monitoring execute hosts so
    knows what is available and what type of machines
    each execute host is, and software.
  • 2. Execute hosts periodically send a ClassAd
    describing themselves to the central manager.

29
  • 3. At times, the central manager enters a
    negotiation cycle where it matches waiting jobs
    with available execute hosts.
  • 4. Eventually job is matched with a suitable
    execute host (hopefully) .

30
  • 5. Central manager informs chosen execute host
    that is has been claimed and gives it a ticket.
  • 6. Central manage informs submit host which
    execute host to use and gives it a matching
    ticket.

31
  • 7. Submit host contacts execute host presenting
    its matching ticket and transfers jobs
    executable and date files to execute host if
    necessary. (shared file system also possible.)
  • 8. When job finished, results returned to submit
    host (unless shared file system in use between
    submit and execute hosts).

32
  • Connections
  • Connection between submit and execute host
    usually done with a TCP connection.
  • If connection dies, job resubmitted to Condor
    pool.
  • Some jobs might access files and resources on
    submit host via remote procedure calls.

33
Checkpointing
  • Certain jobs can checkpoint, both periodically
    for safety and when interrupted.
  • If checkpointed job interrupted, it will resume
    at the last checkpointed state when it starts
    again.
  • Generally no change to source code - need to link
    Condors Standard Universe support library (see
    later).

34
Types of Jobs
  • Classified according to environment it provides.
    Currently seven environments
  • Standard
  • Vanilla
  • PVM
  • MPI
  • Globus
  • Java
  • Scheduler

35
Standard
  • For jobs compiled with Condor libraries
  • Allows for checking pointing and remote system
    calls.
  • Must be single threaded.
  • Not available under Windows.

36
Vanilla
  • For jobs that cannot be compiled with Condor
    libraries, and for shell scripts and Windows
    batch files.
  • No checkpointing or remote system calls.

37
  • Job Universes continued
  • PVM
  • For PVM programs.
  • MPI
  • For MPI programs (MPICH).
  • Globus
  • For submitting jobs to resources managed by
    Globus (version 2.2 and higher).

38
  • Java
  • For Java programs (written for Java Virtual
    Interface).
  • Scheduler
  • A universe not normally used by end-user.
    Ignores any requirements and runs job on submit
    host. Never preempted.

39
Directed Acyclic Graph Manager (DAGMan)
  • Allows one to specify dependencies between Condor
    Jobs.
  • Example
  • Do not run Job B until Job A completed
    successfully
  • Especially important to jobs working together
    (as in Grid computing).

40
Directed Acyclic Graph(DAG)
  • A data structure used to represent dependencies.
  • Each job is a node in the DAG.
  • Each node can have any number of parents and
    childred as long as there are no loops (Acyclic
    graph).

41
Defining a DAG
  • DAG defined by a .dag file, listing each of the
    nodes and their dependencies
  • Example

diamond.dag Job A a.sub Job B b.sub Job C
c.sub Job D d.sub Parent A Child B C Parent B C
Child D
Job A
Job C
Job B
Job D
42
Running a DAG
  • DASGMan acts as a scheduler managing the
    submission of jobs to Condor based upon DAG
    dependencies.
  • DAGMan holds and submits jobs to Condor queue at
    appropriate times.

43
Job Failures
  • DAGMan continues until it cannot make progress
    and then creates a rescue file holding current
    state of DAG.
  • When failed job ready to re-run, rescue file used
    to restore prior state of DAG.

44
ClassAd Matchmaking
  • Used to ensure job done according to constraints
    of users and owners.
  • Example of user constraints
  • I need a Pentium IV with at least 512 Mbytes of
    RAM and speed of at least 3.5 Ghz
  • Example of machine owner constraints
  • Never run jobs owned by Fred

45
Condor Submit Description File
Describes job to Condor. Used with Condor _submit
command. Description File Example
  • This is a comment, condor submit file
  • Universe vanilla
  • Executable /home/abw/condor/myProg
  • Input myProg.stdin
  • Output myProg.stdout
  • Error myProg.stderr
  • Arguments -arg1 -arg2
  • InitialDir /home/abw/condor/assignment4
  • Queue

46
Submitting Multiple Jobs
  • Submit file can specify multiple jobs
  • Queue 500 will submit 500 jobs at once
  • Condor calls groups of jobs a cluster
  • Each job within cluster called a process
  • Condor job ID is the cluster number, a period and
    process number, for example 26.2
  • Single jobs also a cluster but with a single
    process (process 0)

47
Specifying Requirements
  • A C/Java-like Boolean expression that evaluates
    to TRUE for a match.
  • This is a comment, condor submit file
  • Universe vanilla
  • Executable /home/abw/condor/myProg
  • InitialDir /home/abw/condor/assignment4
  • Requirements Memory gt 512 Disk gt 10000
  • queue 500

48
Summary of Key Condor Features
  • High throughput computing using an
    opportunitistic environment.
  • Matchmaking
  • Checkpointing
  • DAG scheduling

49
Condor-G
  • Grid enabled version of Condor.
  • Uses Globus Toolkit for
  • Security (GSI)
  • managing remote jobs on grid (GRAM)
  • file handling and remote I/O (GSI-FTP)

50
Remote execution by Condor-G on Globus-managed
resources
FromCondor-G A Computation Management Agent for
Multi-Institutional Grids by J. Frey, T.
Tannenbaum, M. Livny, I. Foster and S. Tuecke.
Figure probably refers to Globus version 2.
51
More Information
  • www.cs.wisc.org/condor
Write a Comment
User Comments (0)
About PowerShow.com