Introduction to Condor and CamGrid - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Introduction to Condor and CamGrid

Description:

Well, if they're 'few' (how many's that?), ' small' (minimal memory footprint) and short (less than an hour? ... Sysadmin tasks shared out. Weaknesses: ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 33

Provided by: drmarkc

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Condor and CamGrid

1
Introduction to Condor and CamGrid
Mark Calleja Cambridge eScience
Centre www.escience.cam.ac.uk
2
The problem

Ive got some computational jobs to run. Where
can I run them?
Well, if theyre few (how manys that?),
small (minimal memory footprint) and short
(less than an hour?), I may be tempted to use my
desktop

3
My problems bigger than that!

Well, if you need many gigabytes (terabytes?) of
memory, or your application can make use of
many (gt 10?) processors simultaneously, e.g.
using MPI or OpenMP, then maybe you need

4
Actually, its somewhere in between

Many (most?) scientific jobs wont require an HPC
facility, but we may have many such independent
jobs that need to be run.
So, we need access to lots of machines like my
desktop, or maybe slightly more powerful.
Enter

5
Condor

From the University of Wisconsin, since 1985.
Allows a heterogeneous mix of machines to be
linked into a pool, making them accessible for
useful work.
These can be lowly desktops, to bigger server
types, even whole clusters.
A pool is defined by one special machine, the
central manager, and the other machines which
allow it to coordinate them.
Separate pools can cooperate, or flock (more
anon).
Machines in a pool can decide how much CPU time
theyll give to Condor, e.g. only when idle, or
outside office hours, etc.

6
Great! How do I use it?

First, need to be given access to a submit node,
but thats the only account youll need.
This is an example of the grid computing
paradigm.
Condor has some useful features beyond just
making many machines available simultaneously,
e.g.
MPI support (with some effort)
Failure resilience
Process checkpoint / restart / migration
Workflow support
Condors ideal for performing parameter sweeps.

7
How do I start?

First, make the application batch ready. This
means
No interactive input (but can redirect
stdin/stdout from/to files)
No GUIs
Ideally statically linked, but not necessary
Pick your universe, i.e. Condor environment to
use.
Create a submit description file. This will give
Condor clues how/where the job can run.
Submit your job!

8
Condors universes

Controls how Condor handles jobs
Choices include
Vanilla
Standard
Grid
Java
Parallel
VM
Today well just deal with the vanilla (simple,
takes your application as is), and the standard
(allows for checkpointing).

9
The job description file

This is a plain ASCII text file.
It tells Condor about your job, i.e. which
executable, universe, input, output and error
files to use, command-line arguments, environment
variables, any special requirements or
preferences
Can describe many jobs at once (a cluster),
each with different input, arguments, output,
etc.
Suppose I have an application called a.out
which needs to takes the command line argument
-a ltnumbergt and reads input data from the files
inp1 and inp2. Output files are returned
automatically.
Furthermore, this application must run on a 32
bit Linux box. Then, a suitable job description
file would look like

Simple condor_submit input file
(Lines beginning with are comments)
NOTE the words on the left side are not
case sensitive, but filenames are!
Universe vanilla
Executable a.out
should_transfer_files YES
when_to_transfer_output ON_EXIT_OR_EVICT
transfer_input_files inp1, inp2
Requirements OpSys "LINUX" Arch
"INTEL"
Arguments -a 0
Log log.txt
Input input.txt
Output output.txt
Error error.txt
notify_user mark_at_cam.ac.uk
Queue

11
Submitting the job

If weve created the submit description file
job, then we can submit it to our Condor pool
with
woolly condor_submit job
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 684.
We can keep track of our job with condor_q
woolly condor_q
-- Submitter woolly.escience.cam.ac.uk
lt172.24.116.79683gt
ID OWNER SUBMITTED RUN_TIME
ST PRI SIZE CMD
684.0 mcal00 3/3 1456 0000001
R 0 0.0 a.out

12
Submitting LOTS of jobs

I have 600 jobs to run. Also, I can build 32 and
64 bit versions of my applications, say
a.out.INTEL and a.out.X86_64.
First, prepare 600 subdirectories with all
relevant input files (easier for sorting files).
The corresponding output files will go in these
directories.
Call these directories dir_0, dir_1, , dir_599.
Also, I require 1GB of memory for each job, and
Id prefer to have the fastest machines.
If you submit lots of jobs, make sure you have
the available disk space to receive the output!
My job description files can now look like

job description file for 600 jobs
Universe vanilla
Executable a.out.(Arch)
should_transfer_files YES
when_to_transfer_output ON_EXIT_OR_EVICT
transfer_input_files inp1, inp2
Requirements Memory gt 1000 OpSys "LINUX"
\
(Arch "INTEL" Arch
"X86_64")
Id prefer fastest processors
Rank Kflops
Arguments -a (Process)
InitialDir dir_(Process)
Log log.txt
Input input.txt
Output output.txt
Error error.txt

14
The Standard universe

The Vanilla universe is easy to get started with,
but has some limitations
If the remote machine disappears (host or n/w
problems) then the job is restarted from scratch.
If I dont have an account on the execute host,
and the odds are that I wont, then I cant
monitor the output files as theyre being
generated (actually, we have a local solution).
The Standard universe aims to address these
shortcomings. However, we must be able to link
our applications code with Condors libraries
condor_compile gcc -o myjob myjob.c
or even
condor_compile make f MyMakeFile

15
The Standard universe (2)

Not many compilers are supported, but gcc, g,
g77 and now gfortran (F95 support!) work.
However, your job must be well behaved, i.e.
cant fork, use kernel threads or certain IPC
tools (pipes, shared memory).
If it passes these tests, and most scientific
codes do, then use universe standard in the
submit script and dont include anything about
file transfer.
I/O calls on the execute host will now be echoed
back to the submit machine, so youll see all
output files as theyre created.
Also, Condor will periodically save the state of
your job, so if theres an outage it will restart
your job from the last saved image, and not from
scratch (c.f. Vanilla universe).

16
DAGMan Condors workflow manager

Directed Acyclic Graph Manager
DAGMan allows you to specify the dependencies
between your Condor jobs, so it can manage them
automatically for you.
Allows complicated workflows to be built up (can
embed DAGs).
E.g., Dont run job B until job A has
completed successfully.
Each node is a Condor job.
A node can have any number of parent or
children nodes as long as there are no loops!

17
Defining a DAG

A DAG is defined by a .dag file, listing each of
its nodes and their dependencies
diamond.dag
Job A a.sub
Job B b.sub
Job C c.sub
Job D d.sub
Parent A Child B C
Parent B C Child D
each node will run the Condor job specified by
its accompanying Condor submit file
One can also have pre- and post- jobs to run on
the submit machine before or after any node (see
examples).
To start your DAG, just run condor_submit_dag
with your .dag file, and Condor will start a
personal DAGMan daemon with which to begin
running your jobs
condor_submit_dag diamond.dag

18
DAGMan continued

DAGMan holds submits jobs to the Condor queue
at the appropriate times.
In case of a job failure, DAGMan continues until
it can no longer make progress, and then creates
a rescue file with the current state of the
DAG.
Once the failed job is ready to be re-run, the
rescue file can be used to restore the prior
state of the DAG.
DAGMan has other useful features
nodes can have PRE POST scripts
failed nodes can be automatically re-tried a
configurable number of times
job submission can be throttled to limit number
of active jobs in the queue.

19
Some general useful commands

condor_status View Pool Status
condor_q View Job Queue
condor_submit Submit new Jobs
condor_rm Remove Jobs
condor_history Completed Job Info
condor_submit_dag Submit new DAG
condor_checkpoint Force a checkpoint
condor_compile Link Condor library
These commands can take several arguments.
Run them with a -h argument to see the options.

20
What we havent covered

Parrot A user-space file system from the Condor
project.
Condor-C Condors delegated job submission
mechanism.
Parallel/MPI jobs Condors way of dealing with
multi-process jobs, either spanning multiple
machines, or multi-processors/cores on a single
machine, or both.
Other Universes e.g. Virtual/Java/Grid/Scheduler
Checkpointing Vanilla Universe jobs Can be
done, but requires the user to do it, e.g. either
by using DAGMan or own shell script recursively .
Lots of bells and whistles, e.g. submit script
options.

21
CamGrid

Based around the Condor middleware that youve
just heard about.
Started in Jan 2005 by five groups (now up to
eleven 13 pools).
Each group sets up and runs its own pool, and
flocks to/from other pools.
Hence a decentralised, federated model.
Strengths
No single point of failure
Sysadmin tasks shared out
Weaknesses
Debugging can be complicated, especially
networking issues.

22
Condor Flocking

Condor attempts to run a submitted job in its
local pool.
However, queues can be configured to try sending
jobs to other pools flocking.
User-priority system is flocking-aware
A pools local users can have priority over
remote users flocking in.
This is how CamGrid works each group/department
maintains its own pool and flocks with the
others.

23
Actually, CamGrid currently has 13 pools.
24
Participating departments/groups

Cambridge eScience Centre
Dept. of Earth Science (2)
High Energy Physics
School of Biological Sciences
National Institute for Environmental eScience (2)
Chemical Informatics
Semiconductors
Astrophysics
Dept. of Oncology
Dept. of Materials Science and Metallurgy
Biological and Soft Systems

25
CamGrids vanilla-universe file viewer

Condors Vanilla Universe is nice and easy to
use, but comes at the cost of no real-time
visibility as output files get generated on
execute machines.
We have our own web-based solution for CamGrid.
First, ask me for a password (tell me what
username you submit jobs as preferably your
CRSid).
Then, use these details on the form at the bottom
of
http//www.escience.cam.ac.uk/projects/camgrid/
condor_tools.html
Uses cookies for session information (1 hour
sessions).
Has a UK eScience CA certificate for
tempo.escience.cam.ac.uk

26
(No Transcript)
27
(No Transcript)
28
Some details

First point of contact for help is your local CO.
ucam-camgrid-users mailing list
Currently have 1,000 core/processors, mostly
4-core Dell 1950 (8GB memory) like HPCF.
Pretty much all linux, and mostly x86_64.
Run the latest Condor stable version, currently
7.0.5, but well upgrade to 7.2.2 when it appears
(which will provide Standard universe support for
gfortran).
Can run MPI jobs, but only within some individual
pools, and then preferably as multi-core SMP jobs
on individual machines.
The Condor manual is a great learning resource,
and we keep an online copy with an added search
facility at
http//holbein.escience.cam.ac.uk/condor_manual/

29
Thats 808 years, back in the reign of John I,
just before the Magna Carta
30
Its still only March!
56 refereed publications to date (Science, Phys.
Rev. Lett.,)
31
Links

CamGrid www.escience.cam.ac.uk/projects/camgrid/
Condor www.cs.wisc.edu/condor/
Email mc321_at_cam.ac.uk
Questions?

Examples URL
Please point your browsers at
http//www.escience.cam.ac.uk/projects/camgrid/wo
rkshop/
CamGrid Vanilla/Parallel file viewer
Your password for this session is the same as
your username, but with the i changed to 1,
e.g.
username trainXX
password tra1nXX