Title: Condor Tutorial DoD UGC 2000
1Condor TutorialDoD UGC 2000
2Introductions
3Tutorial Outline
- Overview What is Condor
- Condor Architecture
- Starting to Use Condor
- Adding more Resources
- ClassAds 101
- More advanced User Jobs
4Tutorial Outline
- Overview What is Condor
- What does Condor do?
- What is Condor good for?
- What kind of results can I expect?
5What is Condor?
- A system for High-Throughput Computing
- Lots of jobs over a long period of time, not a
short burst of high-performance - Condor manages both resources (machines) and
resource requests (jobs)
6What is Condor (contd)
- Condor has several unique mechanisms
- ClassAds
- Transparent checkpoint/restart
- Remote System Calls
7Whats Condor Good For?
- Managing a large number of jobs
- You specify the jobs in a file and submit them to
Condor, which runs them all and sends you email
when they complete - Mechanisms to help you manage huge numbers of
jobs (1000s), all the data, etc. - Condor can handle inter-job dependencies (DAGMan)
8Whats Condor Good For? (contd)
- Robustness
- Checkpointing allows guaranteed forward progress
of your jobs, even jobs that run for weeks before
completion - If an execute machine crashes, you only lose work
done since the last checkpoint - Condor maintains a persistent job queue - if the
submit machine crashes, Condor will recover
9Whats Condor Good For? (contd)
- Giving your job the agility to access more
computing resources - Checkpointing allows your job to run on
opportunistic resources (not dedicated) - Checkpointing also provides migration - if a
machine is no longer available, move! - With remote system calls, run on systems which do
not share a filesystem - You dont even need an
account on a machine where your job executes
10Condor will also ...
- implement your policy on when the jobs can run
on your workstation - implement your policy on the execution order of
the jobs - keep a log of your job activities
11What kind of results can I expect?
- Statistics from the UW-Madison Condor Pool
- condor-view\index.html
12Tutorial Outline
- Condor Architecture
- Whats a Condor Pool?
- The daemons and their roles
- What happens when I submit a job?
- What happens when my job runs?
13What is a Condor Pool?
- Pool can be a single machine, or a group of
machines - Determined by a central manager - the
matchmaker and centralized information repository - Each machine runs various daemons to provide
different services, either to the users who
submit jobs, the machine owners, or the pool
itself
14In BriefThe Condor Daemons
15condor_master
- Starts up all other Condor daemons
- If there are any problems and a daemon exists, it
restarts the daemon and sends email to the
administrator - Checks the time stamps on the binaries it is
configured to spawn, and if new binaries appear,
the master will gracefully shutdown the currently
running version and start the new version
16condor_master (contd)
- Provides access to many remote administration
commands - condor_reconfig
- condor_restart, condor_off, condor_on
- Default server for many other commands
- condor_config_val, etc.
- Periodically runs condor_preen to clean up any
files Condor might have left on the machine (the
rest of the daemons clean up after themselves, as
well)
17condor_startd
- Represents a machine to the Condor pool
- Enforces the wishes of the machine owner (the
owners policy) - Responsible for starting, suspending, and
stopping jobs - Spawns the appropriate condor_starter, depending
on the type of job - Provides other administrative commands (for
example, condor_vacate)
18condor_starter
- Spawned by the condor_startd to handle all the
details of starting and managing the job (for
example, transferring the jobs binary to the
executing machine or sending back exit status) - On SMP machines, you get one condor_starter per
CPU - For PVM jobs, the starter also spawns a PVM
daemon (condor_pvmd)
19condor_schedd
- Represents users to the Condor pool
- Maintains persistent queue of jobs
- Responsible for contacting available machines and
spawning waiting jobs - Services most user commands
- condor_submit
- condor_rm
- condor_q
20condor_shadow
- Represents the job on the submit machine
- Services requests from standard jobs for
remote system calls, including all file I/O - Is responsible for making decisions on behalf of
the job (for example, where to store the
checkpoint file) - There will be one condor_shadow process running
on your submit machine for each currently running
Condor job
21condor_shadow (contd)
- The shadow doesnt put much load on your submit
machine - Almost always blocked waiting for requests from
the job or doing I/O - Relatively small memory footprint
- Still, you can limit the impact of the shadows on
a given submit machine - They can be started by Condor with a nice-level
that you configure (renice) - Can put a limit on the total number of shadows
running on a machine
22condor_collector
- Collects information from all other Condor
daemons in the pool - Each daemon sends a periodic update called a
ClassAd to the collector - Services queries for information
- Queries from other Condor daemons
- Queries from users (condor_status)
23condor_negotiator
- Performs matchmaking in Condor
- Gets information from the collector about all
available machines and all idle jobs - Tries to match jobs with machines that will serve
them - Both the job and the machine must satisfy each
others requirements (this is called 2-way
matching)
24Layout of a Personal Condor
25Layout of a Condor Pool
ClassAd Communication Pathway
26What happens when you submit a job to Condor?
- condor_submit contacts condor_schedd and adds job
to the job queue - condor_schedd sends ClassAd to the
condor_collector requesting a machine - condor_negotiator matches the request with an
available machine - condor_schedd claims the machine and spawns a
condor_shadow
27What happens when you submit a job to Condor?
(part 2)
- condor_shadow contacts condor_startd and requests
the appropriate condor_starter - condor_starter actually spawns application, and
connects it to the condor_shadow - condor_startd monitors the machine and waits for
commands - Either the application completes, or the
condor_startd forces it to either suspend or
vacate
28Job Startup
Startd
Schedd
Starter
Customer Job
Shadow
Condor Syscall Lib
Submit
29Tutorial Outline
- Starting to Use Condor
- Installation
- Preparing your job for Condor
- Submitting your job to Condor
- Monitoring your jobs
30Installing Condor
- Download Condor for your operating system
- http//www.cs.wisc.edu/condor
- Stable vs- Developer Releases
31What Kind of Job Do You Have?
- You must know some things about your job to
decide if and how it will work with Condor - What kind of I/O does it do?
- Does it use TCP/IP? (network sockets)
- Can the job be resumed?
- About how long do you expect the job to run?
- Is the job multi-process (fork(), pvm_addhost(),
etc.)
32What Kind of I/O Does Your Job Do?
- Interactive TTY
- Batch TTY (just reads from STDIN and writes to
STDOUT or STDERR, but you can redirect to/from
files) - X Windows
- NFS, AFS, or another network file system
- Local file system
- TCP/IP
33What Does Condor Support?
- Condor can support various combinations of these
features in different Universes - Different Universes provide different
functionality for your job - Vanilla
- Standard
- Scheduler
- PVM (and soon MPI)
- Globus (more on this later)
34Condor Universes
- A Universe specifies a Condor runtime
environment - STANDARD
- Supports Checkpointing
- Supports Remote System Calls
- Has some limitations (no fork(), socket(), etc.)
- VANILLA
- Any Unix executable (shell scripts, etc)
- No Condor Checkpointing or Remote I/O
35Condor Universes (contd)
- PVM (Parallel Virtual Machine) and MPI
- Allows you to run parallel jobs in Condor (more
on this later) - SCHEDULER
- Special kind of Condor job the job is run on the
submit machine, not a remote execute machine - Job is automatically restarted if the
condor_schedd is shutdown - Used to schedule jobs (e.g. DAGMan)
36In Brief Submitting Jobs to Condor
- Choosing a Universe for your job (already
covered this) - Preparing your job
- Making it batch-ready
- Re-linking if checkpointing and remote system
calls are desired (condor_compile) - Creating a submit description file
- Running condor_submit
- Sends your request to the User Agent
(condor_schedd)
37Preparing Your Job
- Making your job batch-ready
- Must be able to run in the background no
interactive input, windows, GUI, etc. - Can still use STDIN, STDOUT, and STDERR (the
keyboard and the screen), but files are used for
these instead of the actual devices - If your job expects input from the keyboard, you
have to put the input you want into a file
38Preparing Your Job (contd)
- If you are going to use the standard universe
with checkpointing and remote system calls, you
must re-link your job with Condors special
libraries - To do this, you use condor_compile
- Place condor_compile in front of the command
you normally use to link your job
condor_compile gcc -o myjob myjob.c
39Creating a Submit Description File
- A plain ASCII text file
- Tells Condor about your job
- Which executable, universe, input, output and
error files to use, command-line arguments,
environment variables, any special requirements
or preferences (more on this later) - Can describe many jobs at once (a cluster) each
with different input, arguments, output, etc.
40Example Submit Description File
Example condor_submit input file (Lines
beginning with are comments) NOTE the words
on the left side are not case sensitive,
but filenames are! Universe
standard Executable /home/wright/condor/my_job.c
ondor Input my_job.stdin Output
my_job.stdout Error my_job.stderr Log
my_job.log Arguments -arg1
-arg2 InitialDir /home/wright/condor/run_1 Queue
41Example Submit Description File Described
- Submits a single job to the standard universe,
specifies files for STDIN, STDOUT and STDERR,
creates a UserLog defines command line arguments,
and specifies the directory the job should be run
in - Equivalent to (for outside of Condor)
cd /home/wright/condor/run_1
/home/wright/condor/my_job.condor -arg1 -arg2 \
gt my_job.stdout 2gt my_job.stderr \ lt
my_job.stdin
42Clusters and Processes
- If your submit file describes multiple jobs, we
call this a cluster - Each job within a cluster is called a process
or proc - If you only specify one job, you still get a
cluster, but it has only one process - A Condor Job ID is the cluster number, a
period, and the process number (23.5) - Process numbers always start at 0
43Example Submit Description File for a Cluster
Example condor_submit input file that defines
a whole cluster of jobs at once Universe
standard Executable /home/wright/condor/my_job.c
ondor Input my_job.stdin Output
my_job.stdout Error my_job.stderr Log
my_job.log Arguments -arg1
-arg2 InitialDir /home/wright/condor/run_(Proce
ss) Queue 500
44Example Submit Description File for a Cluster -
Described
- Now, the initial directory for each job is
specified with the (Process) macro, and instead
of submitting a single job, we use Queue 500 to
submit 500 jobs at once - (Process) will be expaned to the process number
for each job in the cluster (from 0 up to 499 in
this case), so well have run_0, run_1,
run_499 directories - All the input/output files will be in different
directories!
45Running condor_submit
- You give condor_submit the name of the submit
file you have created - condor_submit parses the file and creates a
ClassAd that describes your job(s) - Creates the files you specified for STDOUT and
STDERR - Sends your jobs ClassAd(s) and executable to the
condor_schedd, which stores the job in its queue
46In BriefMonitoring Your Jobs
- Using condor_q
- Using a User Log file
- Using condor_status
- Using condor_rm
- Getting email from Condor
- Once they complete, you can use condor_history to
examine them
47Using condor_q
- To view the jobs you have submitted, you use
condor_q - Displays the status of your job, how much compute
time it has accumulated, etc. - Many different options
- A single job, a single cluster, all jobs that
match a certain constraint, or all jobs - Can view remote job queues (either individual
queues, or -global)
48Using a User Log file
- A UserLog must be specified in your submit file
- Log filename
- You get a log entry for everything that happens
to your job - When it was submitted, when it starts executing,
if it is checkpointed or vacated, if there are
any problems, etc. - Very useful! Highly recommended!
49Uses for the User Log
- Event triggers for meta-schedulers
- Like DagMan
- Visualize job progress
- Condor UserLog Viewer
50Using condor_status
- To view the status of the whole Condor pool, you
use condor_status - Can use the -run option to see which machines
are running jobs, as well as - The user who submitted each job
- The machine they submitted from
- Can also view the status of various submitters
with -submitter ltnamegt
51Using condor_rm
- If you want to remove a job from the Condor
queue, you use condor_rm - You can only remove jobs that you own (you cant
run condor_rm on someone elses jobs unless you
are root) - You can give specific job IDs (cluster or
cluster.proc), or you can remove all of your jobs
with the -a option.
52Getting Email from Condor
- By default, Condor will send you email when your
jobs completes - If you dont want this email, put this in your
submit file - notification never
- If you want email every time something happens to
your job (checkpoint, exit, etc), use this - notification always
53Getting Email from Condor (contd)
- If you only want email if your job exits with an
error, use this - notification error
- By default, the email is sent to your account on
the host you submitted from. If you want the
email to go to a different address, use this - notify_user email_at_address.here
54Using condor_history
- Once your job completes, it will no longer show
up in condor_q - Now, you must use condor_history to view the
jobs ClassAd - The status field (ST) will have either a C
for completed, or an X if the job was removed
with condor_rm
55Any questions?
- Nothing is too basic
- If I was unclear, you probably are not the only
person who doesnt understand, and the rest of
the day will be even more confusing
56Tutorial Outline
- Adding more Resources
- Personal Condor
- Build a pool
- Flocking Linking pools together
- Globus Universe
- Opening the window to the Grid
- Glide-In
57What is Personal Condor?
- Condor re-focused and re-packaged to emphasize
- Use by an individual
- without scores of machines local to their
organization - without sysadmin experience
- without sysadmin authority
- Some Specific Condor Mechanisms
- Flocking
- Globus Job Universe
- GlideIn
58A Pool of One
- Typical Condor Installation
- Typical Personal Condor Installation
59Personal Condor?!Whats the benefit of a Condor
Pool with just one user and one machine?
60Your Personal Condor will ...
- keep an eye on your jobs and will keep you
posted on their progress - implement your policy on when the jobs can run
on your workstation - implement your policy on the execution order of
the jobs - add fault tolerance to your jobs
- keep a log of your job activities
61(No Transcript)
62Theres more! Build a Personal Cluster
- Install Condor on any other machines you have
- Point them at your Personal Condor installation
machine
63(No Transcript)
64Theres more! Take advantage of your friends
- Get permission from friendly Condor pools to
access their resources - Configure your Personal Condor to flock to
these pools
65(No Transcript)
66How Flocking Works
- Put in condor_config
- FLOCK_HOSTS Pool-Foo, Pool-Bar
Collector
Negotiator
Submit Machine
Central Manager (CONDOR_HOST)
Pool-Foo Central Manager
Pool-Bar Central Manager
Schedd
67Flocking Pros Cons
- Pros
- Flock hosts are tried in the order specified
until jobs are satisfied - Property of the Schedd, not the CM
- Different users can Flock to different pools
- User priority system is flocking-aware
- Cons
- Not yet well-integrated with the tools
- Job Rank does not work across pools
68Theres More! Access the National Technology Grid
- National Technology Grid
- Collaborative effort underway by many Government
labs, National Science Foundation, and several
Universities - Centralized virtual machine room for account
creation, etc. - Globus Toolkit
- Grid Certificate (X.509)
- Globus Gatekeepers
69Submit a job in Condor to run on Globus
- In submit description file given to
condor_submit, specify - Universe Globus
- Which Globus Gatekeeper to use
- Location of file containing your Globus
certificate
70Condor Submit Machine
Schedd
71Condors Globus Universe Pros/Cons
- Pros
- Persistent queue for jobs destined to run on
Globus resources - Jobs submitted to run on Globus resources are
managed - details on progress, order of execution, DAGMan,
- Cons
- Must specify a particular gateway
- What about Standard Universe features?
- Checkpoint/Restart, Remote system calls, etc.
72One Solution Condor GlideIn
- Submit your jobs as regular Condor jobs
(Standard, Vanilla, or PVM Universe) - Expand your Personal Condor pool by submitting
GlideIn Jobs - A Globus Universe job which consists of the
Condor Daemons (master, startd, starter). - When your GlideIn Jobs run on the Globus
resource, that resource will join your Personal
Condor Pool! - Then your regular Condor jobs get matched and run
on Globus resources as usual.
73(No Transcript)
74Some Problems and some Solutions
- What if all your jobs completed before a given
GlideIn job starts running?
- Solution If the Condor daemons which have
glided-in are not matched with a job in 10
minutes, they terminate.
75Some Problems and some Solutions, cont.
- My Personal Condor is flocking with a bunch of
Solaris machines, and also doing a GlideIn to a
Silicon Graphics O2K. I do not want to
statically partition my jobs.
Solution In your submit file, say Executable
myjob.(OpSys).(Arch) The (xxx) notation
is replaced with attributes from the machine
ClassAd which was matched with your job.
76In Review
- With Personal Condor you can
- manage your compute job workload
- access local machines
- access remote Condor Pools via flocking
- access remote compute resources on the Grid via
Globus Universe - carve out your own personal Condor Pool from
the Grid with GlideIn technology.
77Current Status
- Initial Personal Condor implementation exists
- Includes Flocking, Globus Uni., GlideIn
- Powered several demonstrations at SC99
(SuperComputing 1999 Conference) - In use by some collaborators
78Current Status, cont.
- Work currently in progress
- Enhance robustness, scalability
- Add features missing from Condors Globus
Universe - Streamline the process for the user
- Personal Condor distribution
- Simplified Installation Procedure
- Grid-Access Only mode
79Tutorial Outline
- ClassAds 101
- What are Classified Advertisements?
- ClassAd Matching
- ClassAds in Condor
- Requirements expression
- Rank expression
- Undefined Values
80In BriefClassified Advertisements
- ClassAds
- Language for expressing attributes
- Semantics for evaluating them
- Intuitively, a ClassAd is a set of named
expressions - Each named expression is an attribute
- Expressions are similar to C
- Constants, attribute references, operators
81Classified Advertisements Example
- MyType "Machine"
- TargetType "Job"
- Name "froth.cs.wisc.edu"
- StartdIpAddr"lt128.105.73.4433846gt"
- Arch "INTEL"
- OpSys "SOLARIS26"
- VirtualMemory 225312
- Disk 35957
- KFlops 21058
- Mips 103
- LoadAvg 0.011719
- KeyboardIdle 12
- Cpus 1
- Memory 128
- Requirements LoadAvg lt 0.300000
KeyboardIdle gt 15 60 - Rank 0
82Classified Advertisements Matching
- ClassAds are always considered in pairs
- Does ClassAd A match ClassAd B (and vice versa)?
- This is called 2-way matching
- If the same attribute appears in both ClassAds,
you can specify which attribute you mean by
putting MY. or TARGET. in front of the
attribute name
83Classified Advertisements Examples
- ClassAd B
- MyType "ApartmentRenter"
- TargetType "Apartment"
- UnderGrad False
- RentOffer 900
- Rank 1/(TARGET.RentOffer 100.0)
50HeatIncluded - Requirements OnBusLine
- SquareArea gt 2700
- ClassAd A
- MyType "Apartment"
- TargetType "ApartmentRenter"
- SquareArea 3500
- RentOffer 1000
- HeatIncluded False
- OnBusLine True
- Rank UnderGradFalse
TARGET.RentOffer - Requirements MY.RentOffer -
TARGET.RentOffer lt 150
84ClassAds in the Condor System
- ClassAds allow Condor to be a general system
- Constraints and ranks on matches expressed by the
entities themselves - Only priority logic integrated into the
Match-Maker - All principal entities in the Condor system are
represented by ClassAds - Machines, Jobs, Submitters
85Example ClassAdfor a Machine
- Friends Owner"tannenba" Owner"wright"
- Family Owneraunt" Owneruncle"
- Enemies Ownerrival Ownerriffraff
- Requirements (!Enemies)
- (Family (LoadAvglt0.3 KeyboardIdlegt1560))
- Rank Friends Family10
86ClassAd Requirements Described
- Machine will never start a job submitted by
rival or riffraff - If someone from Family (aunt or uncle)
submits a job, it will always run, regardless of
keyboard activity or load average - If anyone else submits a job, it will only run
here if the keyboard has been idle for more than
15 minutes and the load average is less than 0.3
87ClassAd Rank Described
- If the machine is running a job submitted by
owner smith, it will give this a Rank of 0,
since smith is neither Friend nor Family. - If wright or tannenba submits a job, it will
be ranked at 1 (since Friend will evaluate to 1
and Family is 0) - If aunt or uncle submit a job, it will have a
rank of 10 - A job owned by a Friend would be preempted for a
job owned by Family.
88Example ClassAdfor a Job
- Requirements
- ArchINTEL OpSysLINUX Memorygt20
- Rank
- (Memory gt 32)
- ( (Memory 100) (IsDedicated 10000)
Mips )
89ClassAd Rank andRequirements Described
- The job must run on an Intel CPU, running Linux,
with at least 20 megs of RAM - All machines with 32 megs of RAM or less are
Ranked at 0 - Machines with more than 32 megs of RAM are ranked
according to how much RAM they have, if the
machine is dedicated (which counts a lot to this
job!), and how fast the machine is, as measured
in Million Instructions Per Second
90Finding and Using the ClassAd Attributes in your
Pool
- Condor defines a number of attributes we havent
mentioned. - You can see all the attributes for a machine
with - condor_status -long lthostnamegt
- You can see all the attributes for a job with
- condor_q long ltjob-numbergt
91Customized ClassAd Attributes
- Custom attributes can be added to either Machine
or Job Ads - These attributes can then be used in Requirements
and/or Rank expressions - Useful for steering to specific resources
92Undefined Values
- Suppose that a Job requires
- Requirements (Color ! Red)
- Color Blue -gt MATCH
- Color Red -gt no match
- Color is undefined -gt no match
- Avoid this behavior with !
- Requirements (Color ! Red)
- Color Blue -gt MATCH
- Color Red -gt no match
- Color is undefined -gt MATCH
- (The same holds for and ?)
93Tutorial Outline
- More advanced User Jobs
- DagMan
- Opportunistic PVM
- MW
94Thank you!
- Check us out on the Web
- http//www.cs.wisc.edu/condor
- Email
- condor-admin_at_cs.wisc.edu