Title: CONDOR
1CONDOR
- CISC 879 Parallel Computation
- Spring 2003
- Preethi Natarajan
2Outline
- Condor Goals Overview
- Components
- Matchmaking - ClassAds
- RPC in Condor
- Checkpoint/Restart
- Glance _at_ APIs
3Condor Objectives
- Condor s goal is to hunt for idle resources that
can be exploited by user applications - Performance Vs. Throughput
- High Performance Computing
- CPU cycles/second under ideal circumstances. How
fast can I run simulation X on this machine? - High Throughput Computing
- CPU cycles/day (week, month, year?) under
non-ideal circumstances. How many times can I
run simulation X in the next month using all
available machines? - How much computing power is available to me?
- Condor converts collections of distributively
owned workstations (different platforms) and
dedicated clusters into a distributed
high-throughput computing facility
4Condor - Overview
- Customers advertise their job requirements to
Condor Resource Requests - Resource owners advertise their resource
descriptions Resource Offers
- Condor provides
- Matchmaking between jobs and resources
- Notification of Matches
- Transparent access to jobs files during
execution - Opportunistic Scheduling Schedule resources
when there is an opportunity - Checkpoint (save) job state when current resource
needs to be preempted - Restart job from checkpointed state in another
available resource
Condor Central Manager
Resource found appropriate for the job
Site at which job submitted
5Condor Components
- CUSTOMER AGENT
- Submits Resource Requests (job requirements) in
an application queue ordered by a priority scheme - Implementation is called the Scheduling daemon
schedd
Accountant
Collector
Negotiator
Notify Match
Resource Requests
- RESOURCE AGENT
- Periodically extracts resources state
information and updates its Resource Offers - Implementation is called the startd
Resource Offers
startd
schedd
Job submission
Customer Agent
Resource Agent
6Condor Components (Cont.)
- CENTRAL MANAGER
- Is the condor kernel of the condor pool
- Collector - Periodically collects
- Resource Offers from startds
- Resource Requests Schedds
- Negotiator
- Matchmaking between Resource Requests and Offers
- Notification about the match to the entities of
the matched pair - Claiming Protocol followed between the respective
Customer and Resource Agents - Accountant Logs resource(s) usage by jobs
7ClassAds
- Classified Advertisement is a flexible and
extensible data model used to represent - Resource Offers - Resource services available
- Resource Requests - Job Requirements
- Access Policies - Constraints on resource
allocations requirements - Is a mapping from attribute names to expressions
defines semantics for evaluating the attributes
8ClassAds - Access Policies
- Resource access policy specifies
- Who may use resource
- How they may use resource
- When they may use resource
Policy Specification Example
- Access Policy Specification in Condor is done
using the following ClassAd Attributes
Expression Type Evaluation Semantics for an application
Requirements True gt Application may use resource
Rank Larger Value gt Application is highly preferred over others
Suspend True gt Suspend active application
Continue True gt Unsuspend active application
Vacate True gt Active application notified to stop using the resource
Kill True gt Active application should be immediately stopped
9Matchmaking
- ClassAd Specification
- ClassAds describing Resource Requests and
Resource Offers with attributes like Type, Rank,
Requirements, Vacate etc - Advertising Protocol
- Entity periodically communicates the ClassAd and
contact address to the Central Manager
(Matchmaker) - Matchmaking Algorithm
- Matches based on Requirements specified in the
Resource Requests and Offers. - Match with the highest Rank is selected.
- Use of past resource usage (log) for fair
scheduling
10Matchmaking (cont. )
- Matchmaking Protocol
- Match notified to the two parties that were
matched _at_ their contact address along with the
matched ClassAd - (Possible) Authentication via hand-off of a
session-key - Claiming Protocol
- Match was a mutual introduction of the 2 parties
- Customer contacts Resource directly to negotiate
regarding resource allocation
11After Match Notification
- Schedd on the Initiating (Submit) machine first
spawns a shadow process. Shadow process acts as
the shadow of the job that will be executed on
the remote machine - Shadow negotiates with Startd of remote machine
to run the job - If successful, Startd on the remote, spawns
Starter which - Starts the remote job by spawning
- Manages the execution of the remote job by
communicating with the Shadow.
12Exploiting RPC
- Remote Machine agrees to run submit machines job
at its workstation. But the jobs files are
physically located at the submit machine. - open(), read(), write() calls in the jobs code
are executed at the submit machine as RPCs - condor_syscall_lib has to be linked to these jobs
- If files can be accessed via NFS/AFS then it is
preferred over RPC if it will be efficient. The
open() routine in the condor_syscall_lib talks
with the shadow at submit machine and makes these
decisions
Starter process for the remote job
Local File System
spawns
Remote Jobs process Call to open(jobfile1)
Shadow process for the job
Access jobfile1 via NFS/AFS or RPC
Remote Machine
Submit Machine
13Checkpoint
- To checkpoint an executing program is to take a
snapshot of its current state in such a way that
the program can be restarted from that state at a
later time possibly at a different resource - Provides
- Preemptive-Resume scheduling
- Fault Tolerance when checkpointing is done
periodically - In Condor, checkpointing running jobs is
optional. If it is needed, source should be
linked with condor_syscall_lib
14Checkpointing in Condor
- Implemented in condor_syscall_lib as a signal
handler - When condor sends a signal to checkpoint, the
handler saves process state information in a
checkpoint file - From Core - contents of processs uarea, data and
stack segments - From Executable symbol and debugging info,
initialized data, text
15Checkpointing Restart
- Shadow sends the latest checkpoint file to the
new Starter during restart - The starter, reads the job state from the
checkpoint file and the execution continues - Starter periodically sends a checkpoint signal to
the executing job - Condor_syscall_lib makes job dump core and saves
job state in the checkpoint file - Checkpoint file temporarily stored _at_ Remote
Machine - Starter transfers latest checkpoint file to
shadow when job vacated
Starter process for the remote job
Local File System
Checkpoint file transferred when job restarted
Checkpoint signal
Checkpoint file
Checkpoint file transferred when job vacated
Shadow process for the job
Code in condor_syscall_lib saves process state
information
Remote Machine
Submit Machine
16CONDOR APIs - Glance
- Compile as a condor job
- gcc c hello.c o hello.o
- condor_compile gcc hello.o o hello
- Submit a condor job
- cat gt submit.hello
- Executable hello
- Universe standard
- Output hello.out
- Log hello.log
- Queue
- condor_submit submit.hello creates Job ClassAd
17CONDOR APIs (Cont. )
- Condor_master starts other daemons
- Condor_vacate vacate jobs running on specified
hosts - Condor_status display status of condor pool
- Condor_rm remove a condor job from queue
- More commands _at_ http//www.cs.wisc.edu/condor/manu
al/v6.4/
18REFERENCES
- Condor Project Home Page http//www.cs.wisc.edu/co
ndor/ - Research Publications on Condor
http//www.cs.wisc.edu/condor/publications.html