CONDOR - PowerPoint PPT Presentation

About This Presentation
Title:

CONDOR

Description:

Title: CONDOR Author: preethi Last modified by: preethi Created Date: 3/19/2003 12:11:46 AM Document presentation format: On-screen Show Company: CCM - University of ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 19
Provided by: pree4
Category:
Tags: condor | shadow

less

Transcript and Presenter's Notes

Title: CONDOR


1
CONDOR
  • CISC 879 Parallel Computation
  • Spring 2003
  • Preethi Natarajan

2
Outline
  • Condor Goals Overview
  • Components
  • Matchmaking - ClassAds
  • RPC in Condor
  • Checkpoint/Restart
  • Glance _at_ APIs

3
Condor Objectives
  • Condor s goal is to hunt for idle resources that
    can be exploited by user applications
  • Performance Vs. Throughput
  • High Performance Computing
  • CPU cycles/second under ideal circumstances. How
    fast can I run simulation X on this machine?
  • High Throughput Computing
  • CPU cycles/day (week, month, year?) under
    non-ideal circumstances. How many times can I
    run simulation X in the next month using all
    available machines?
  • How much computing power is available to me?
  • Condor converts collections of distributively
    owned workstations (different platforms) and
    dedicated clusters into a distributed
    high-throughput computing facility

4
Condor - Overview
  • Customers advertise their job requirements to
    Condor Resource Requests
  • Resource owners advertise their resource
    descriptions Resource Offers
  • Condor provides
  • Matchmaking between jobs and resources
  • Notification of Matches
  • Transparent access to jobs files during
    execution
  • Opportunistic Scheduling Schedule resources
    when there is an opportunity
  • Checkpoint (save) job state when current resource
    needs to be preempted
  • Restart job from checkpointed state in another
    available resource

Condor Central Manager
Resource found appropriate for the job
Site at which job submitted
5
Condor Components
  • CUSTOMER AGENT
  • Submits Resource Requests (job requirements) in
    an application queue ordered by a priority scheme
  • Implementation is called the Scheduling daemon
    schedd

Accountant
Collector
Negotiator
Notify Match
Resource Requests
  • RESOURCE AGENT
  • Periodically extracts resources state
    information and updates its Resource Offers
  • Implementation is called the startd

Resource Offers
startd
schedd
Job submission
Customer Agent
Resource Agent
6
Condor Components (Cont.)
  • CENTRAL MANAGER
  • Is the condor kernel of the condor pool
  • Collector - Periodically collects
  • Resource Offers from startds
  • Resource Requests Schedds
  • Negotiator
  • Matchmaking between Resource Requests and Offers
  • Notification about the match to the entities of
    the matched pair
  • Claiming Protocol followed between the respective
    Customer and Resource Agents
  • Accountant Logs resource(s) usage by jobs

7
ClassAds
  • Classified Advertisement is a flexible and
    extensible data model used to represent
  • Resource Offers - Resource services available
  • Resource Requests - Job Requirements
  • Access Policies - Constraints on resource
    allocations requirements
  • Is a mapping from attribute names to expressions
    defines semantics for evaluating the attributes

8
ClassAds - Access Policies
  • Resource access policy specifies
  • Who may use resource
  • How they may use resource
  • When they may use resource

Policy Specification Example
  • Access Policy Specification in Condor is done
    using the following ClassAd Attributes

Expression Type Evaluation Semantics for an application
Requirements True gt Application may use resource
Rank Larger Value gt Application is highly preferred over others
Suspend True gt Suspend active application
Continue True gt Unsuspend active application
Vacate True gt Active application notified to stop using the resource
Kill True gt Active application should be immediately stopped
9
Matchmaking
  • ClassAd Specification
  • ClassAds describing Resource Requests and
    Resource Offers with attributes like Type, Rank,
    Requirements, Vacate etc
  • Advertising Protocol
  • Entity periodically communicates the ClassAd and
    contact address to the Central Manager
    (Matchmaker)
  • Matchmaking Algorithm
  • Matches based on Requirements specified in the
    Resource Requests and Offers.
  • Match with the highest Rank is selected.
  • Use of past resource usage (log) for fair
    scheduling

10
Matchmaking (cont. )
  • Matchmaking Protocol
  • Match notified to the two parties that were
    matched _at_ their contact address along with the
    matched ClassAd
  • (Possible) Authentication via hand-off of a
    session-key
  • Claiming Protocol
  • Match was a mutual introduction of the 2 parties
  • Customer contacts Resource directly to negotiate
    regarding resource allocation

11
After Match Notification
  • Schedd on the Initiating (Submit) machine first
    spawns a shadow process. Shadow process acts as
    the shadow of the job that will be executed on
    the remote machine
  • Shadow negotiates with Startd of remote machine
    to run the job
  • If successful, Startd on the remote, spawns
    Starter which
  • Starts the remote job by spawning
  • Manages the execution of the remote job by
    communicating with the Shadow.

12
Exploiting RPC
  • Remote Machine agrees to run submit machines job
    at its workstation. But the jobs files are
    physically located at the submit machine.
  • open(), read(), write() calls in the jobs code
    are executed at the submit machine as RPCs
  • condor_syscall_lib has to be linked to these jobs
  • If files can be accessed via NFS/AFS then it is
    preferred over RPC if it will be efficient. The
    open() routine in the condor_syscall_lib talks
    with the shadow at submit machine and makes these
    decisions

Starter process for the remote job
Local File System
spawns
Remote Jobs process Call to open(jobfile1)
Shadow process for the job
Access jobfile1 via NFS/AFS or RPC
Remote Machine
Submit Machine
13
Checkpoint
  • To checkpoint an executing program is to take a
    snapshot of its current state in such a way that
    the program can be restarted from that state at a
    later time possibly at a different resource
  • Provides
  • Preemptive-Resume scheduling
  • Fault Tolerance when checkpointing is done
    periodically
  • In Condor, checkpointing running jobs is
    optional. If it is needed, source should be
    linked with condor_syscall_lib

14
Checkpointing in Condor
  • Implemented in condor_syscall_lib as a signal
    handler
  • When condor sends a signal to checkpoint, the
    handler saves process state information in a
    checkpoint file
  • From Core - contents of processs uarea, data and
    stack segments
  • From Executable symbol and debugging info,
    initialized data, text

15
Checkpointing Restart
  • Shadow sends the latest checkpoint file to the
    new Starter during restart
  • The starter, reads the job state from the
    checkpoint file and the execution continues
  • Starter periodically sends a checkpoint signal to
    the executing job
  • Condor_syscall_lib makes job dump core and saves
    job state in the checkpoint file
  • Checkpoint file temporarily stored _at_ Remote
    Machine
  • Starter transfers latest checkpoint file to
    shadow when job vacated

Starter process for the remote job
Local File System
Checkpoint file transferred when job restarted
Checkpoint signal
Checkpoint file
Checkpoint file transferred when job vacated
Shadow process for the job
Code in condor_syscall_lib saves process state
information
Remote Machine
Submit Machine
16
CONDOR APIs - Glance
  • Compile as a condor job
  • gcc c hello.c o hello.o
  • condor_compile gcc hello.o o hello
  • Submit a condor job
  • cat gt submit.hello
  • Executable hello
  • Universe standard
  • Output hello.out
  • Log hello.log
  • Queue
  • condor_submit submit.hello creates Job ClassAd

17
CONDOR APIs (Cont. )
  • Condor_master starts other daemons
  • Condor_vacate vacate jobs running on specified
    hosts
  • Condor_status display status of condor pool
  • Condor_rm remove a condor job from queue
  • More commands _at_ http//www.cs.wisc.edu/condor/manu
    al/v6.4/

18
REFERENCES
  • Condor Project Home Page http//www.cs.wisc.edu/co
    ndor/
  • Research Publications on Condor
    http//www.cs.wisc.edu/condor/publications.html
Write a Comment
User Comments (0)
About PowerShow.com