Condor Tutorial DoD UGC 2000 - PowerPoint PPT Presentation

1 / 94
About This Presentation
Title:

Condor Tutorial DoD UGC 2000

Description:

Tutorial Outline. Overview: What is Condor. What does Condor ... condor-viewindex.html. ondor. C. www.cs.wisc.edu/condor. Tutorial Outline. Condor Architecture ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 95
Provided by: Miron1
Category:
Tags: dod | ugc | condor | html | tutorial

less

Transcript and Presenter's Notes

Title: Condor Tutorial DoD UGC 2000


1
Condor TutorialDoD UGC 2000
2
Introductions
3
Tutorial Outline
  • Overview What is Condor
  • Condor Architecture
  • Starting to Use Condor
  • Adding more Resources
  • ClassAds 101
  • More advanced User Jobs

4
Tutorial Outline
  • Overview What is Condor
  • What does Condor do?
  • What is Condor good for?
  • What kind of results can I expect?

5
What is Condor?
  • A system for High-Throughput Computing
  • Lots of jobs over a long period of time, not a
    short burst of high-performance
  • Condor manages both resources (machines) and
    resource requests (jobs)

6
What is Condor (contd)
  • Condor has several unique mechanisms
  • ClassAds
  • Transparent checkpoint/restart
  • Remote System Calls

7
Whats Condor Good For?
  • Managing a large number of jobs
  • You specify the jobs in a file and submit them to
    Condor, which runs them all and sends you email
    when they complete
  • Mechanisms to help you manage huge numbers of
    jobs (1000s), all the data, etc.
  • Condor can handle inter-job dependencies (DAGMan)

8
Whats Condor Good For? (contd)
  • Robustness
  • Checkpointing allows guaranteed forward progress
    of your jobs, even jobs that run for weeks before
    completion
  • If an execute machine crashes, you only lose work
    done since the last checkpoint
  • Condor maintains a persistent job queue - if the
    submit machine crashes, Condor will recover

9
Whats Condor Good For? (contd)
  • Giving your job the agility to access more
    computing resources
  • Checkpointing allows your job to run on
    opportunistic resources (not dedicated)
  • Checkpointing also provides migration - if a
    machine is no longer available, move!
  • With remote system calls, run on systems which do
    not share a filesystem - You dont even need an
    account on a machine where your job executes

10
Condor will also ...
  • implement your policy on when the jobs can run
    on your workstation
  • implement your policy on the execution order of
    the jobs
  • keep a log of your job activities

11
What kind of results can I expect?
  • Statistics from the UW-Madison Condor Pool
  • condor-view\index.html

12
Tutorial Outline
  • Condor Architecture
  • Whats a Condor Pool?
  • The daemons and their roles
  • What happens when I submit a job?
  • What happens when my job runs?

13
What is a Condor Pool?
  • Pool can be a single machine, or a group of
    machines
  • Determined by a central manager - the
    matchmaker and centralized information repository
  • Each machine runs various daemons to provide
    different services, either to the users who
    submit jobs, the machine owners, or the pool
    itself

14
In BriefThe Condor Daemons
15
condor_master
  • Starts up all other Condor daemons
  • If there are any problems and a daemon exists, it
    restarts the daemon and sends email to the
    administrator
  • Checks the time stamps on the binaries it is
    configured to spawn, and if new binaries appear,
    the master will gracefully shutdown the currently
    running version and start the new version

16
condor_master (contd)
  • Provides access to many remote administration
    commands
  • condor_reconfig
  • condor_restart, condor_off, condor_on
  • Default server for many other commands
  • condor_config_val, etc.
  • Periodically runs condor_preen to clean up any
    files Condor might have left on the machine (the
    rest of the daemons clean up after themselves, as
    well)

17
condor_startd
  • Represents a machine to the Condor pool
  • Enforces the wishes of the machine owner (the
    owners policy)
  • Responsible for starting, suspending, and
    stopping jobs
  • Spawns the appropriate condor_starter, depending
    on the type of job
  • Provides other administrative commands (for
    example, condor_vacate)

18
condor_starter
  • Spawned by the condor_startd to handle all the
    details of starting and managing the job (for
    example, transferring the jobs binary to the
    executing machine or sending back exit status)
  • On SMP machines, you get one condor_starter per
    CPU
  • For PVM jobs, the starter also spawns a PVM
    daemon (condor_pvmd)

19
condor_schedd
  • Represents users to the Condor pool
  • Maintains persistent queue of jobs
  • Responsible for contacting available machines and
    spawning waiting jobs
  • Services most user commands
  • condor_submit
  • condor_rm
  • condor_q

20
condor_shadow
  • Represents the job on the submit machine
  • Services requests from standard jobs for
    remote system calls, including all file I/O
  • Is responsible for making decisions on behalf of
    the job (for example, where to store the
    checkpoint file)
  • There will be one condor_shadow process running
    on your submit machine for each currently running
    Condor job

21
condor_shadow (contd)
  • The shadow doesnt put much load on your submit
    machine
  • Almost always blocked waiting for requests from
    the job or doing I/O
  • Relatively small memory footprint
  • Still, you can limit the impact of the shadows on
    a given submit machine
  • They can be started by Condor with a nice-level
    that you configure (renice)
  • Can put a limit on the total number of shadows
    running on a machine

22
condor_collector
  • Collects information from all other Condor
    daemons in the pool
  • Each daemon sends a periodic update called a
    ClassAd to the collector
  • Services queries for information
  • Queries from other Condor daemons
  • Queries from users (condor_status)

23
condor_negotiator
  • Performs matchmaking in Condor
  • Gets information from the collector about all
    available machines and all idle jobs
  • Tries to match jobs with machines that will serve
    them
  • Both the job and the machine must satisfy each
    others requirements (this is called 2-way
    matching)

24
Layout of a Personal Condor
25
Layout of a Condor Pool
ClassAd Communication Pathway
26
What happens when you submit a job to Condor?
  • condor_submit contacts condor_schedd and adds job
    to the job queue
  • condor_schedd sends ClassAd to the
    condor_collector requesting a machine
  • condor_negotiator matches the request with an
    available machine
  • condor_schedd claims the machine and spawns a
    condor_shadow

27
What happens when you submit a job to Condor?
(part 2)
  • condor_shadow contacts condor_startd and requests
    the appropriate condor_starter
  • condor_starter actually spawns application, and
    connects it to the condor_shadow
  • condor_startd monitors the machine and waits for
    commands
  • Either the application completes, or the
    condor_startd forces it to either suspend or
    vacate

28
Job Startup
Startd
Schedd
Starter
Customer Job
Shadow
Condor Syscall Lib
Submit
29
Tutorial Outline
  • Starting to Use Condor
  • Installation
  • Preparing your job for Condor
  • Submitting your job to Condor
  • Monitoring your jobs

30
Installing Condor
  • Download Condor for your operating system
  • http//www.cs.wisc.edu/condor
  • Stable vs- Developer Releases

31
What Kind of Job Do You Have?
  • You must know some things about your job to
    decide if and how it will work with Condor
  • What kind of I/O does it do?
  • Does it use TCP/IP? (network sockets)
  • Can the job be resumed?
  • About how long do you expect the job to run?
  • Is the job multi-process (fork(), pvm_addhost(),
    etc.)

32
What Kind of I/O Does Your Job Do?
  • Interactive TTY
  • Batch TTY (just reads from STDIN and writes to
    STDOUT or STDERR, but you can redirect to/from
    files)
  • X Windows
  • NFS, AFS, or another network file system
  • Local file system
  • TCP/IP

33
What Does Condor Support?
  • Condor can support various combinations of these
    features in different Universes
  • Different Universes provide different
    functionality for your job
  • Vanilla
  • Standard
  • Scheduler
  • PVM (and soon MPI)
  • Globus (more on this later)

34
Condor Universes
  • A Universe specifies a Condor runtime
    environment
  • STANDARD
  • Supports Checkpointing
  • Supports Remote System Calls
  • Has some limitations (no fork(), socket(), etc.)
  • VANILLA
  • Any Unix executable (shell scripts, etc)
  • No Condor Checkpointing or Remote I/O

35
Condor Universes (contd)
  • PVM (Parallel Virtual Machine) and MPI
  • Allows you to run parallel jobs in Condor (more
    on this later)
  • SCHEDULER
  • Special kind of Condor job the job is run on the
    submit machine, not a remote execute machine
  • Job is automatically restarted if the
    condor_schedd is shutdown
  • Used to schedule jobs (e.g. DAGMan)

36
In Brief Submitting Jobs to Condor
  • Choosing a Universe for your job (already
    covered this)
  • Preparing your job
  • Making it batch-ready
  • Re-linking if checkpointing and remote system
    calls are desired (condor_compile)
  • Creating a submit description file
  • Running condor_submit
  • Sends your request to the User Agent
    (condor_schedd)

37
Preparing Your Job
  • Making your job batch-ready
  • Must be able to run in the background no
    interactive input, windows, GUI, etc.
  • Can still use STDIN, STDOUT, and STDERR (the
    keyboard and the screen), but files are used for
    these instead of the actual devices
  • If your job expects input from the keyboard, you
    have to put the input you want into a file

38
Preparing Your Job (contd)
  • If you are going to use the standard universe
    with checkpointing and remote system calls, you
    must re-link your job with Condors special
    libraries
  • To do this, you use condor_compile
  • Place condor_compile in front of the command
    you normally use to link your job

condor_compile gcc -o myjob myjob.c
39
Creating a Submit Description File
  • A plain ASCII text file
  • Tells Condor about your job
  • Which executable, universe, input, output and
    error files to use, command-line arguments,
    environment variables, any special requirements
    or preferences (more on this later)
  • Can describe many jobs at once (a cluster) each
    with different input, arguments, output, etc.

40
Example Submit Description File
Example condor_submit input file (Lines
beginning with are comments) NOTE the words
on the left side are not case sensitive,
but filenames are! Universe
standard Executable /home/wright/condor/my_job.c
ondor Input my_job.stdin Output
my_job.stdout Error my_job.stderr Log
my_job.log Arguments -arg1
-arg2 InitialDir /home/wright/condor/run_1 Queue
41
Example Submit Description File Described
  • Submits a single job to the standard universe,
    specifies files for STDIN, STDOUT and STDERR,
    creates a UserLog defines command line arguments,
    and specifies the directory the job should be run
    in
  • Equivalent to (for outside of Condor)

cd /home/wright/condor/run_1
/home/wright/condor/my_job.condor -arg1 -arg2 \
gt my_job.stdout 2gt my_job.stderr \ lt
my_job.stdin
42
Clusters and Processes
  • If your submit file describes multiple jobs, we
    call this a cluster
  • Each job within a cluster is called a process
    or proc
  • If you only specify one job, you still get a
    cluster, but it has only one process
  • A Condor Job ID is the cluster number, a
    period, and the process number (23.5)
  • Process numbers always start at 0

43
Example Submit Description File for a Cluster
Example condor_submit input file that defines
a whole cluster of jobs at once Universe
standard Executable /home/wright/condor/my_job.c
ondor Input my_job.stdin Output
my_job.stdout Error my_job.stderr Log
my_job.log Arguments -arg1
-arg2 InitialDir /home/wright/condor/run_(Proce
ss) Queue 500
44
Example Submit Description File for a Cluster -
Described
  • Now, the initial directory for each job is
    specified with the (Process) macro, and instead
    of submitting a single job, we use Queue 500 to
    submit 500 jobs at once
  • (Process) will be expaned to the process number
    for each job in the cluster (from 0 up to 499 in
    this case), so well have run_0, run_1,
    run_499 directories
  • All the input/output files will be in different
    directories!

45
Running condor_submit
  • You give condor_submit the name of the submit
    file you have created
  • condor_submit parses the file and creates a
    ClassAd that describes your job(s)
  • Creates the files you specified for STDOUT and
    STDERR
  • Sends your jobs ClassAd(s) and executable to the
    condor_schedd, which stores the job in its queue

46
In BriefMonitoring Your Jobs
  • Using condor_q
  • Using a User Log file
  • Using condor_status
  • Using condor_rm
  • Getting email from Condor
  • Once they complete, you can use condor_history to
    examine them

47
Using condor_q
  • To view the jobs you have submitted, you use
    condor_q
  • Displays the status of your job, how much compute
    time it has accumulated, etc.
  • Many different options
  • A single job, a single cluster, all jobs that
    match a certain constraint, or all jobs
  • Can view remote job queues (either individual
    queues, or -global)

48
Using a User Log file
  • A UserLog must be specified in your submit file
  • Log filename
  • You get a log entry for everything that happens
    to your job
  • When it was submitted, when it starts executing,
    if it is checkpointed or vacated, if there are
    any problems, etc.
  • Very useful! Highly recommended!

49
Uses for the User Log
  • Event triggers for meta-schedulers
  • Like DagMan
  • Visualize job progress
  • Condor UserLog Viewer

50
Using condor_status
  • To view the status of the whole Condor pool, you
    use condor_status
  • Can use the -run option to see which machines
    are running jobs, as well as
  • The user who submitted each job
  • The machine they submitted from
  • Can also view the status of various submitters
    with -submitter ltnamegt

51
Using condor_rm
  • If you want to remove a job from the Condor
    queue, you use condor_rm
  • You can only remove jobs that you own (you cant
    run condor_rm on someone elses jobs unless you
    are root)
  • You can give specific job IDs (cluster or
    cluster.proc), or you can remove all of your jobs
    with the -a option.

52
Getting Email from Condor
  • By default, Condor will send you email when your
    jobs completes
  • If you dont want this email, put this in your
    submit file
  • notification never
  • If you want email every time something happens to
    your job (checkpoint, exit, etc), use this
  • notification always

53
Getting Email from Condor (contd)
  • If you only want email if your job exits with an
    error, use this
  • notification error
  • By default, the email is sent to your account on
    the host you submitted from. If you want the
    email to go to a different address, use this
  • notify_user email_at_address.here

54
Using condor_history
  • Once your job completes, it will no longer show
    up in condor_q
  • Now, you must use condor_history to view the
    jobs ClassAd
  • The status field (ST) will have either a C
    for completed, or an X if the job was removed
    with condor_rm

55
Any questions?
  • Nothing is too basic
  • If I was unclear, you probably are not the only
    person who doesnt understand, and the rest of
    the day will be even more confusing

56
Tutorial Outline
  • Adding more Resources
  • Personal Condor
  • Build a pool
  • Flocking Linking pools together
  • Globus Universe
  • Opening the window to the Grid
  • Glide-In

57
What is Personal Condor?
  • Condor re-focused and re-packaged to emphasize
  • Use by an individual
  • without scores of machines local to their
    organization
  • without sysadmin experience
  • without sysadmin authority
  • Some Specific Condor Mechanisms
  • Flocking
  • Globus Job Universe
  • GlideIn

58
A Pool of One
  • Typical Condor Installation
  • Typical Personal Condor Installation

59
Personal Condor?!Whats the benefit of a Condor
Pool with just one user and one machine?
60
Your Personal Condor will ...
  • keep an eye on your jobs and will keep you
    posted on their progress
  • implement your policy on when the jobs can run
    on your workstation
  • implement your policy on the execution order of
    the jobs
  • add fault tolerance to your jobs
  • keep a log of your job activities

61
(No Transcript)
62
Theres more! Build a Personal Cluster
  • Install Condor on any other machines you have
  • Point them at your Personal Condor installation
    machine

63
(No Transcript)
64
Theres more! Take advantage of your friends
  • Get permission from friendly Condor pools to
    access their resources
  • Configure your Personal Condor to flock to
    these pools

65
(No Transcript)
66
How Flocking Works
  • Put in condor_config
  • FLOCK_HOSTS Pool-Foo, Pool-Bar

Collector
Negotiator
Submit Machine
Central Manager (CONDOR_HOST)
Pool-Foo Central Manager
Pool-Bar Central Manager
Schedd
67
Flocking Pros Cons
  • Pros
  • Flock hosts are tried in the order specified
    until jobs are satisfied
  • Property of the Schedd, not the CM
  • Different users can Flock to different pools
  • User priority system is flocking-aware
  • Cons
  • Not yet well-integrated with the tools
  • Job Rank does not work across pools

68
Theres More! Access the National Technology Grid
  • National Technology Grid
  • Collaborative effort underway by many Government
    labs, National Science Foundation, and several
    Universities
  • Centralized virtual machine room for account
    creation, etc.
  • Globus Toolkit
  • Grid Certificate (X.509)
  • Globus Gatekeepers

69
Submit a job in Condor to run on Globus
  • In submit description file given to
    condor_submit, specify
  • Universe Globus
  • Which Globus Gatekeeper to use
  • Location of file containing your Globus
    certificate

70
Condor Submit Machine
Schedd
71
Condors Globus Universe Pros/Cons
  • Pros
  • Persistent queue for jobs destined to run on
    Globus resources
  • Jobs submitted to run on Globus resources are
    managed
  • details on progress, order of execution, DAGMan,
  • Cons
  • Must specify a particular gateway
  • What about Standard Universe features?
  • Checkpoint/Restart, Remote system calls, etc.

72
One Solution Condor GlideIn
  • Submit your jobs as regular Condor jobs
    (Standard, Vanilla, or PVM Universe)
  • Expand your Personal Condor pool by submitting
    GlideIn Jobs
  • A Globus Universe job which consists of the
    Condor Daemons (master, startd, starter).
  • When your GlideIn Jobs run on the Globus
    resource, that resource will join your Personal
    Condor Pool!
  • Then your regular Condor jobs get matched and run
    on Globus resources as usual.

73
(No Transcript)
74
Some Problems and some Solutions
  • What if all your jobs completed before a given
    GlideIn job starts running?
  • Solution If the Condor daemons which have
    glided-in are not matched with a job in 10
    minutes, they terminate.

75
Some Problems and some Solutions, cont.
  • My Personal Condor is flocking with a bunch of
    Solaris machines, and also doing a GlideIn to a
    Silicon Graphics O2K. I do not want to
    statically partition my jobs.

Solution In your submit file, say Executable
myjob.(OpSys).(Arch) The (xxx) notation
is replaced with attributes from the machine
ClassAd which was matched with your job.
76
In Review
  • With Personal Condor you can
  • manage your compute job workload
  • access local machines
  • access remote Condor Pools via flocking
  • access remote compute resources on the Grid via
    Globus Universe
  • carve out your own personal Condor Pool from
    the Grid with GlideIn technology.

77
Current Status
  • Initial Personal Condor implementation exists
  • Includes Flocking, Globus Uni., GlideIn
  • Powered several demonstrations at SC99
    (SuperComputing 1999 Conference)
  • In use by some collaborators

78
Current Status, cont.
  • Work currently in progress
  • Enhance robustness, scalability
  • Add features missing from Condors Globus
    Universe
  • Streamline the process for the user
  • Personal Condor distribution
  • Simplified Installation Procedure
  • Grid-Access Only mode

79
Tutorial Outline
  • ClassAds 101
  • What are Classified Advertisements?
  • ClassAd Matching
  • ClassAds in Condor
  • Requirements expression
  • Rank expression
  • Undefined Values

80
In BriefClassified Advertisements
  • ClassAds
  • Language for expressing attributes
  • Semantics for evaluating them
  • Intuitively, a ClassAd is a set of named
    expressions
  • Each named expression is an attribute
  • Expressions are similar to C
  • Constants, attribute references, operators

81
Classified Advertisements Example
  • MyType "Machine"
  • TargetType "Job"
  • Name "froth.cs.wisc.edu"
  • StartdIpAddr"lt128.105.73.4433846gt"
  • Arch "INTEL"
  • OpSys "SOLARIS26"
  • VirtualMemory 225312
  • Disk 35957
  • KFlops 21058
  • Mips 103
  • LoadAvg 0.011719
  • KeyboardIdle 12
  • Cpus 1
  • Memory 128
  • Requirements LoadAvg lt 0.300000
    KeyboardIdle gt 15 60
  • Rank 0

82
Classified Advertisements Matching
  • ClassAds are always considered in pairs
  • Does ClassAd A match ClassAd B (and vice versa)?
  • This is called 2-way matching
  • If the same attribute appears in both ClassAds,
    you can specify which attribute you mean by
    putting MY. or TARGET. in front of the
    attribute name

83
Classified Advertisements Examples
  • ClassAd B
  • MyType "ApartmentRenter"
  • TargetType "Apartment"
  • UnderGrad False
  • RentOffer 900
  • Rank 1/(TARGET.RentOffer 100.0)
    50HeatIncluded
  • Requirements OnBusLine
  • SquareArea gt 2700
  • ClassAd A
  • MyType "Apartment"
  • TargetType "ApartmentRenter"
  • SquareArea 3500
  • RentOffer 1000
  • HeatIncluded False
  • OnBusLine True
  • Rank UnderGradFalse
    TARGET.RentOffer
  • Requirements MY.RentOffer -
    TARGET.RentOffer lt 150

84
ClassAds in the Condor System
  • ClassAds allow Condor to be a general system
  • Constraints and ranks on matches expressed by the
    entities themselves
  • Only priority logic integrated into the
    Match-Maker
  • All principal entities in the Condor system are
    represented by ClassAds
  • Machines, Jobs, Submitters

85
Example ClassAdfor a Machine
  • Friends Owner"tannenba" Owner"wright"
  • Family Owneraunt" Owneruncle"
  • Enemies Ownerrival Ownerriffraff
  • Requirements (!Enemies)
  • (Family (LoadAvglt0.3 KeyboardIdlegt1560))
  • Rank Friends Family10

86
ClassAd Requirements Described
  • Machine will never start a job submitted by
    rival or riffraff
  • If someone from Family (aunt or uncle)
    submits a job, it will always run, regardless of
    keyboard activity or load average
  • If anyone else submits a job, it will only run
    here if the keyboard has been idle for more than
    15 minutes and the load average is less than 0.3

87
ClassAd Rank Described
  • If the machine is running a job submitted by
    owner smith, it will give this a Rank of 0,
    since smith is neither Friend nor Family.
  • If wright or tannenba submits a job, it will
    be ranked at 1 (since Friend will evaluate to 1
    and Family is 0)
  • If aunt or uncle submit a job, it will have a
    rank of 10
  • A job owned by a Friend would be preempted for a
    job owned by Family.

88
Example ClassAdfor a Job
  • Requirements
  • ArchINTEL OpSysLINUX Memorygt20
  • Rank
  • (Memory gt 32)
  • ( (Memory 100) (IsDedicated 10000)
    Mips )

89
ClassAd Rank andRequirements Described
  • The job must run on an Intel CPU, running Linux,
    with at least 20 megs of RAM
  • All machines with 32 megs of RAM or less are
    Ranked at 0
  • Machines with more than 32 megs of RAM are ranked
    according to how much RAM they have, if the
    machine is dedicated (which counts a lot to this
    job!), and how fast the machine is, as measured
    in Million Instructions Per Second

90
Finding and Using the ClassAd Attributes in your
Pool
  • Condor defines a number of attributes we havent
    mentioned.
  • You can see all the attributes for a machine
    with
  • condor_status -long lthostnamegt
  • You can see all the attributes for a job with
  • condor_q long ltjob-numbergt

91
Customized ClassAd Attributes
  • Custom attributes can be added to either Machine
    or Job Ads
  • These attributes can then be used in Requirements
    and/or Rank expressions
  • Useful for steering to specific resources

92
Undefined Values
  • Suppose that a Job requires
  • Requirements (Color ! Red)
  • Color Blue -gt MATCH
  • Color Red -gt no match
  • Color is undefined -gt no match
  • Avoid this behavior with !
  • Requirements (Color ! Red)
  • Color Blue -gt MATCH
  • Color Red -gt no match
  • Color is undefined -gt MATCH
  • (The same holds for and ?)

93
Tutorial Outline
  • More advanced User Jobs
  • DagMan
  • Opportunistic PVM
  • MW

94
Thank you!
  • Check us out on the Web
  • http//www.cs.wisc.edu/condor
  • Email
  • condor-admin_at_cs.wisc.edu
Write a Comment
User Comments (0)
About PowerShow.com