Title: Introduction to Condor and CamGrid
1 Introduction to Condor and CamGrid
Mark Calleja Cambridge eScience
Centre www.escience.cam.ac.uk
2The problem
- Ive got some computational jobs to run. Where
can I run them? - Well, if theyre few (how manys that?),
small (minimal memory footprint) and short
(less than an hour?), I may be tempted to use my
desktop
3My problems bigger than that!
- Well, if you need many gigabytes (terabytes?) of
memory, or your application can make use of
many (gt 10?) processors simultaneously, e.g.
using MPI or OpenMP, then maybe you need
4Actually, its somewhere in between
- Many (most?) scientific jobs wont require an HPC
facility, but we may have many such independent
jobs that need to be run. - So, we need access to lots of machines like my
desktop, or maybe slightly more powerful. - Enter
5Condor
- From the University of Wisconsin, since 1985.
- Allows a heterogeneous mix of machines to be
linked into a pool, making them accessible for
useful work. - These can be lowly desktops, to bigger server
types, even whole clusters. - A pool is defined by one special machine, the
central manager, and the other machines which
allow it to coordinate them. - Separate pools can cooperate, or flock (more
anon). - Machines in a pool can decide how much CPU time
theyll give to Condor, e.g. only when idle, or
outside office hours, etc.
6Great! How do I use it?
- First, need to be given access to a submit node,
but thats the only account youll need. - This is an example of the grid computing
paradigm. - Condor has some useful features beyond just
making many machines available simultaneously,
e.g. - MPI support (with some effort)
- Failure resilience
- Process checkpoint / restart / migration
- Workflow support
- Condors ideal for performing parameter sweeps.
7How do I start?
- First, make the application batch ready. This
means - No interactive input (but can redirect
stdin/stdout from/to files) - No GUIs
- Ideally statically linked, but not necessary
- Pick your universe, i.e. Condor environment to
use. - Create a submit description file. This will give
Condor clues how/where the job can run. - Submit your job!
8Condors universes
- Controls how Condor handles jobs
- Choices include
- Vanilla
- Standard
- Grid
- Java
- Parallel
- VM
- Today well just deal with the vanilla (simple,
takes your application as is), and the standard
(allows for checkpointing).
9The job description file
- This is a plain ASCII text file.
- It tells Condor about your job, i.e. which
executable, universe, input, output and error
files to use, command-line arguments, environment
variables, any special requirements or
preferences - Can describe many jobs at once (a cluster),
each with different input, arguments, output,
etc. - Suppose I have an application called a.out
which needs to takes the command line argument
-a ltnumbergt and reads input data from the files
inp1 and inp2. Output files are returned
automatically. - Furthermore, this application must run on a 32
bit Linux box. Then, a suitable job description
file would look like
10- Simple condor_submit input file
- (Lines beginning with are comments)
- NOTE the words on the left side are not
- case sensitive, but filenames are!
- Universe vanilla
- Executable a.out
- should_transfer_files YES
- when_to_transfer_output ON_EXIT_OR_EVICT
- transfer_input_files inp1, inp2
- Requirements OpSys "LINUX" Arch
"INTEL" - Arguments -a 0
- Log log.txt
- Input input.txt
- Output output.txt
- Error error.txt
- notify_user mark_at_cam.ac.uk
- Queue
11Submitting the job
- If weve created the submit description file
job, then we can submit it to our Condor pool
with - woolly condor_submit job
- Submitting job(s).
- Logging submit event(s).
- 1 job(s) submitted to cluster 684.
- We can keep track of our job with condor_q
- woolly condor_q
- -- Submitter woolly.escience.cam.ac.uk
lt172.24.116.79683gt - ID OWNER SUBMITTED RUN_TIME
ST PRI SIZE CMD - 684.0 mcal00 3/3 1456 0000001
R 0 0.0 a.out
12Submitting LOTS of jobs
- I have 600 jobs to run. Also, I can build 32 and
64 bit versions of my applications, say
a.out.INTEL and a.out.X86_64. - First, prepare 600 subdirectories with all
relevant input files (easier for sorting files).
The corresponding output files will go in these
directories. - Call these directories dir_0, dir_1, , dir_599.
- Also, I require 1GB of memory for each job, and
Id prefer to have the fastest machines. - If you submit lots of jobs, make sure you have
the available disk space to receive the output! - My job description files can now look like
13- job description file for 600 jobs
- Universe vanilla
- Executable a.out.(Arch)
- should_transfer_files YES
- when_to_transfer_output ON_EXIT_OR_EVICT
- transfer_input_files inp1, inp2
- Requirements Memory gt 1000 OpSys "LINUX"
\ - (Arch "INTEL" Arch
"X86_64") - Id prefer fastest processors
- Rank Kflops
- Arguments -a (Process)
- InitialDir dir_(Process)
- Log log.txt
- Input input.txt
- Output output.txt
- Error error.txt
14The Standard universe
- The Vanilla universe is easy to get started with,
but has some limitations - If the remote machine disappears (host or n/w
problems) then the job is restarted from scratch. - If I dont have an account on the execute host,
and the odds are that I wont, then I cant
monitor the output files as theyre being
generated (actually, we have a local solution). - The Standard universe aims to address these
shortcomings. However, we must be able to link
our applications code with Condors libraries - condor_compile gcc -o myjob myjob.c
- or even
- condor_compile make f MyMakeFile
15The Standard universe (2)
- Not many compilers are supported, but gcc, g,
g77 and now gfortran (F95 support!) work. - However, your job must be well behaved, i.e.
cant fork, use kernel threads or certain IPC
tools (pipes, shared memory). - If it passes these tests, and most scientific
codes do, then use universe standard in the
submit script and dont include anything about
file transfer. - I/O calls on the execute host will now be echoed
back to the submit machine, so youll see all
output files as theyre created. - Also, Condor will periodically save the state of
your job, so if theres an outage it will restart
your job from the last saved image, and not from
scratch (c.f. Vanilla universe).
16DAGMan Condors workflow manager
- Directed Acyclic Graph Manager
- DAGMan allows you to specify the dependencies
between your Condor jobs, so it can manage them
automatically for you. - Allows complicated workflows to be built up (can
embed DAGs). - E.g., Dont run job B until job A has
completed successfully. - Each node is a Condor job.
- A node can have any number of parent or
children nodes as long as there are no loops!
17Defining a DAG
- A DAG is defined by a .dag file, listing each of
its nodes and their dependencies - diamond.dag
- Job A a.sub
- Job B b.sub
- Job C c.sub
- Job D d.sub
- Parent A Child B C
- Parent B C Child D
- each node will run the Condor job specified by
its accompanying Condor submit file - One can also have pre- and post- jobs to run on
the submit machine before or after any node (see
examples). - To start your DAG, just run condor_submit_dag
with your .dag file, and Condor will start a
personal DAGMan daemon with which to begin
running your jobs - condor_submit_dag diamond.dag
18DAGMan continued
- DAGMan holds submits jobs to the Condor queue
at the appropriate times. - In case of a job failure, DAGMan continues until
it can no longer make progress, and then creates
a rescue file with the current state of the
DAG. - Once the failed job is ready to be re-run, the
rescue file can be used to restore the prior
state of the DAG. - DAGMan has other useful features
- nodes can have PRE POST scripts
- failed nodes can be automatically re-tried a
configurable number of times - job submission can be throttled to limit number
of active jobs in the queue.
19Some general useful commands
- condor_status View Pool Status
- condor_q View Job Queue
- condor_submit Submit new Jobs
- condor_rm Remove Jobs
- condor_history Completed Job Info
- condor_submit_dag Submit new DAG
- condor_checkpoint Force a checkpoint
- condor_compile Link Condor library
- These commands can take several arguments.
Run them with a -h argument to see the options.
20What we havent covered
- Parrot A user-space file system from the Condor
project. - Condor-C Condors delegated job submission
mechanism. - Parallel/MPI jobs Condors way of dealing with
multi-process jobs, either spanning multiple
machines, or multi-processors/cores on a single
machine, or both. - Other Universes e.g. Virtual/Java/Grid/Scheduler
- Checkpointing Vanilla Universe jobs Can be
done, but requires the user to do it, e.g. either
by using DAGMan or own shell script recursively . - Lots of bells and whistles, e.g. submit script
options.
21CamGrid
- Based around the Condor middleware that youve
just heard about. - Started in Jan 2005 by five groups (now up to
eleven 13 pools). - Each group sets up and runs its own pool, and
flocks to/from other pools. - Hence a decentralised, federated model.
- Strengths
- No single point of failure
- Sysadmin tasks shared out
- Weaknesses
- Debugging can be complicated, especially
networking issues.
22Condor Flocking
- Condor attempts to run a submitted job in its
local pool. - However, queues can be configured to try sending
jobs to other pools flocking. - User-priority system is flocking-aware
- A pools local users can have priority over
remote users flocking in. - This is how CamGrid works each group/department
maintains its own pool and flocks with the
others.
23Actually, CamGrid currently has 13 pools.
24Participating departments/groups
- Cambridge eScience Centre
- Dept. of Earth Science (2)
- High Energy Physics
- School of Biological Sciences
- National Institute for Environmental eScience (2)
- Chemical Informatics
- Semiconductors
- Astrophysics
- Dept. of Oncology
- Dept. of Materials Science and Metallurgy
- Biological and Soft Systems
25CamGrids vanilla-universe file viewer
- Condors Vanilla Universe is nice and easy to
use, but comes at the cost of no real-time
visibility as output files get generated on
execute machines. - We have our own web-based solution for CamGrid.
- First, ask me for a password (tell me what
username you submit jobs as preferably your
CRSid). - Then, use these details on the form at the bottom
of - http//www.escience.cam.ac.uk/projects/camgrid/
condor_tools.html - Uses cookies for session information (1 hour
sessions). - Has a UK eScience CA certificate for
tempo.escience.cam.ac.uk
26(No Transcript)
27(No Transcript)
28Some details
- First point of contact for help is your local CO.
- ucam-camgrid-users mailing list
- Currently have 1,000 core/processors, mostly
4-core Dell 1950 (8GB memory) like HPCF. - Pretty much all linux, and mostly x86_64.
- Run the latest Condor stable version, currently
7.0.5, but well upgrade to 7.2.2 when it appears
(which will provide Standard universe support for
gfortran). - Can run MPI jobs, but only within some individual
pools, and then preferably as multi-core SMP jobs
on individual machines. - The Condor manual is a great learning resource,
and we keep an online copy with an added search
facility at - http//holbein.escience.cam.ac.uk/condor_manual/
29Thats 808 years, back in the reign of John I,
just before the Magna Carta
30Its still only March!
56 refereed publications to date (Science, Phys.
Rev. Lett.,)
31Links
- CamGrid www.escience.cam.ac.uk/projects/camgrid/
- Condor www.cs.wisc.edu/condor/
- Email mc321_at_cam.ac.uk
- Questions?
32- Examples URL
- Please point your browsers at
- http//www.escience.cam.ac.uk/projects/camgrid/wo
rkshop/ - CamGrid Vanilla/Parallel file viewer
- Your password for this session is the same as
your username, but with the i changed to 1,
e.g. - username trainXX
- password tra1nXX