Title: What
1Whats new in Condor?Condor Week 2006
2So Todd where is v6.8?Well, v6.7 has been a
challenge
3inint
4(No Transcript)
5Around since the 80s
6Around since the 80s
80s Mullet Boy
7100 people surveyed! Favorite ility ?
8100 people surveyed!Favorite ility ?
Deployability!
9Existing Ports
- Digital UNIX 4.0Â Â Â Â Â Â Â Â Alpha
- AIX 5.2 (clipped) PowerPCÂ Â Â Â Â Â Â Â
- Tru64 5.1 (clipped)Â Â Â Â Â Â Alpha
- HP UNIX 10.20 PA RISC
- HP UNIX 11.00 (clipped using hpux10.20 32 bit)
PA RISC - Irix 6.5 (clipped) SGI
- Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3
(clipped) Alpha - Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3
Intel x86 - Linux 2.4.x (glibc 2.2) - Red Hat 8Â Â Â Â Â Intel
x86 - Linux 2.4.x (glibc 2.3) - Red Hat 9Â Â Â Â Â Intel
x86Â Â Â Â Â Â Â - Enterprise Server 8.1Â Â Intel Itanium
- Solaris 8       Sparc  Â
- Solaris 9Â Â Â Â Â Â Â Sparc
- Microsoft Windows 2000 or XP (clipped)Â Â Â Intel
x86
CondorWeek 2005
10New Ports
- Introduced in v6.6.x
- MacOSX (clipped")Â PowerPC
- Debian Linux 3.1 Intel x86
- Fedora Core 1 Intel x86Â Â Â Â
- Red Hat Enterprise Linux 3Â Â Intel x86
- SuSE Linux Enterprise Server 8.1Â Â Intel ItaniumÂ
- Introduced in v6.7.x
- AIX 5.1 (clipped") PowerPC
- Fedora Core 2 on x86
- Fedora Core 3 on x86
- SuSE 8.0 ("clipped") on AMD64
- Solaris 10 ("clipped") on Sparc
- Scientific Linux (Release 303) on x86
- Still to be introduced in v6.7.x (before v6.8.0)
- HPUX 11i 64-bit pa-risc
- RHEL 4 on x86
- native 64 bit AMD Linux
Sigh
CondorWeek 2005
Psilord The Condor porting doctor. Talk to
him in person tomorrow.
11Porting Table
- See
- http//www.cs.wisc.edu/condor/porting/port_table
.html - Highlights
- Almost every 32-bit Linux flavor as full
- Every other Unix, MacOS and Windows available as
clipped - Solaris 10 and HP-UX 11.x now clipped
- FreeBSD 4 contribution from Yahoo!, added 5 and 6
- X86_64 Linux full running in the lab
12Backfill Jobs
- Execute machines will run a locally staged
executable when otherwise idle. - Currently designed for BOINC.
Turn on backfill functionality, and use
BOINC ENABLE_BACKFILL TRUE BACKFILL_SYSTEM
BOINC Spawn a backfill job if we've been
Unclaimed for more than 5 minutes START_BACKFILL
(StateTimer) gt (5 (MINUTE)) Evict a
backfill job if the machine is busy (based on
keyboard activity or cpu load) EVICT_BACKFILL
(MachineBusy)
13Joining Condors Einstein_at_Home Compute Team
- If youre running BOINC backfill jobs in Condor
and want to use your cycles to help another UW
project, please join the Einstein_at_Home
computation - Join the Condor Backfill team
- http//einstein.phys.uwm.edu/team_display.php?team
id5994 - http//einstein.phys.uwm.edu/create_account_form.p
hp?teamid5994
14More deployability
- Personal Condor Support on Win32
- LocalSystem not required
- MSI installer on Win32 (thanks Micron!)
- New tools
- Safe, dynamic Condor service deployment.
- More info _at_ Research BOF 9am Rm219
- condor_cold_start and
- condor_cold_stop
15100 people surveyed! Favorite ility ?
16100 people surveyed!Favorite ility ?
Availability!
17 Condor with Firewalls and NATSGCB in v6.8.0!
listen accept
Server app
Client app
connect
GCB layer
GCB layer
TCP/IP
TCP/IP
Relay point
18Job Progress continues if connection is
interrupted
- Now for Vanilla, Java, and Grid universe jobs,
Condor supports reestablishment of the connection
between the submitting and executing machines. - If network outage between execute and submit
machine - If submit machine restarts
- Grid Universe was tricky
- To take advantage of this feature, put the
following line into their jobs submit
description file - JobLeaseDuration ltN secondsgt
- For example
- job_lease_duration 1200
19Job Progress continues if submit machine fails
- Condor can now support a submit machine hot
spare (schedd failover) - If your submit machine A is down for longer than
N minutes, a second machine B can take over - Requires shared filesystem between machines A and
B
20Central Manager Failover
- Condor Central Manager has two services
- condor_collector
- Now a list of collectors is supported
- condor_negotiator (matchmaker)
- If fails, election process, another takes over
- Accounting state is peridocially replicated
- Contributed technology from Technion
21Reliability, cont.
- Time shifts
- Quill
- Closing windows of vulnerability
22100 people surveyed! Favorite ility ?
23100 people surveyed!Favorite ility ?
Lighweight?
24100 people surveyed!Favorite ility ?
X
Lighweight?
25100 people surveyed! Favorite ility ?
26100 people surveyed!Favorite ility ?
Functionality!
27Security
- Common Authentication Methods between Condor on
Unix and Win32 - Kerberos 1.4
- Additional hopeful benefit Authentication
against MS Active Directory! - SSL
- Password (shared secret)
- Starter only runs known executables
- More powerful, unified map file(s)
- GSI credentials delegated
28With Condor on Win32, it be nice if
- My jobs could access my files just like the
condor_shadow can - I didnt have to tie my execute machines to a
single account - I didnt have to run condor_store_cred from every
machine where my credential is needed - (thank you Optena)
29The Windows CredD
- A centralized repository for user passwords
myp4sswd
y0urs
store password
- C\gtcondor_store_cred add
- Account gquinn_at_CROW
- Enter password
- Operation succeeded.
credd
ltpasswordgt
30The Windows CredD
schedd
myp4sswd
fetch password
y0urs
ltpasswordgt
shadow
Submit machines can use the CredD to impersonate
the user in the shadow
31The Windows CredD
starter
fetch password
myp4sswd
y0urs
ltpasswordgt
condor_exec.exe
Execute machines can use the CredD to run jobs as
the submitting user!
32Running Jobs as Submitting User
- In submit file
- Run_job_as_owner true
- In config file on submit and execute nodes
CREDD_HOST vault.cs.wisc.edu STARTER_ALLOW_RUNA
S_OWNER True CREDD_CACHE_LOCALLY True
33Some Condor APIs
- Command Line tools
- condor_submit, condor_q, etc
- -format, -constraint, -xml
- Condor Perl Module
- Chirp
- Checkpoint Library API
- MW --- improved!
- DRMAA (Works w/ Win32, on SourceForge)
- Condor Grid ASCII Protocol (GAHP)
- Web Service Interface
34DRMAA
- Distributed Resource Management Application API
(DRMAA) - GGF Working Group
- An API specification for the submission and
control of jobs to one or more Distributed
Resource Management (DRM) systems - An API with C and Java bindings
- not a protocol
- Scope
- Does job submission, monitoring, control, final
status - Does not file staging, reservations, security,
35Condor GAHP
- The Condor GAHP is a relatively low-level
protocol based on simple ASCII messages through
stdin and stdout - Supports a rich feature set including two-phase
commits, transactions, and optional asynchronous
notification of events
36GAHP, cont
- Example
- R GahpVersion 1.0.0 Nov 26 2001 NCSA\ CoG\
Gahpd - S GRAM_PING 100 vulture.cs.wisc.edu/fork
- R E
- S RESULTS
- R E
- S COMMANDS
- R S COMMANDS GRAM_JOB_CANCEL GRAM_JOB_REQUEST
GRAM_JOB_SIGNAL GRAM_JOB_STATUS GRAM_PING
INITIALIZE_FROM_FILE QUIT RESULTS VERSION - S VERSION
- R S GahpVersion 1.0.0 Nov 26 2001 NCSA\ CoG\
Gahpd - S INITIALIZE_FROM_FILE /tmp/grid_proxy_554523.t
xt - R S
- S GRAM_PING 100 vulture.cs.wisc.edu/fork
- R S
- S RESULTS
- R S 0
- S RESULTS
- R S 1
37Web Service Interfaces
- SOAP over http or https to the Condor daemons
- Use any language or platform (where you can find
a decent SOAP library)
- Functionality Exposed
- in current release
- Submit jobs
- Retrieve job output
- Remove/hold/release jobs
- Query machine status (fetch ads from collector)
- Query job status (fetch ads from the schedd)
38Getting machine status viaSOAP (in Java with
Axis)
- locator new CondorCollectorLocator()
collector locator.getcondorCollector(new
URL(http//machineport))
ads collector.queryStartdAds(Memorygt512)
Because we give you WSDL information you
dont have to write any of these functions.
39More Functionality changes..
- FINALLY, clean/consistent cross-platform quoting
rules for arguments and environment variables
(see condor_submit man page) - Schedd can run HawkEye modules, just like the
Startd - Enables monitoring on the submit machine
- condor_history now faster than a snail, and
cleans up droppings. - DeferralTime, DeferralWindow
- Coordinated starts
- BIND_ALL_INTERFACES in config file
- WANT_REMOTE_IO in job ClassAd
40ClassAd Functions in Condor!
- Conditionals
- IfThenElse(condition,then,else)
- String functions
- Strcat(), strcmp(), toUpper(), etc.
- StringList functions
- Example of a string list (CSV style)
- Mylist Joe, Jon, Jeff, Jim, Jake
- StrListContains(), StrListAppend(),
StrListRemove(), etc. - Others
- Regular expressions, arithmetic, etc
41Accounting Groups andGroup Quota Support
- Account Group (w/ CORE Feature Animation)
- Account Group Quota (inspiration CDF _at_ Fermi)
- Sample Problem Cluster w/ 500 nodes, Chemistry
Dept purchased 100 of them, Chemistry users must
always be able to use them - Could use Machine Rank
- but this ties to specific machines
- Or could use new group support
- Each group can be given a quota in config file
- Job ads can specify group membership
- Group quotas are satisfied first
- Accounting by user and by group
42100 people surveyed! Favorite ility ?
43100 people surveyed!Favorite ility ?
Universability!
44Grid Universe
- With new Grid Universe, always specify a
gridtype. So the old globus Universe is now
declared as - universe grid
- gridtype gt2
- Other gridtypes?
- GT2 (Globus Toolkit 2)
- GT3 (Globus Toolkit 3.2)
- GT4 (Globus Toolkit 3.9.5)
- UNICORE
- Nordugrid
- PBS (OpenPBS, PBSPro technology from INFN)
- LSF (Platform LSF technology from INFN)
- CONDOR (thanks gLite!)
Condor-G
Condor-C
45Other Grid Universe improvements
- Condor-G has support for credential refresh via
the MyProxy Online Credential Management in NMI - http//grid.ncsa.uiuc.edu/myproxy
- (both GT2 and GT4)
- GT4 we start a GridFTP server behind the scenes
- GridFTP server bundled w/ Condor nowadays
- Some functionality present in Condor-G added to
Condor-C - Forwarding of refreshed credentials (EGEE)
- GSI authentication support
- Cleaner ClassAd representation (URL)
46Parallel Universe
- Replaces the MPI universe
- Allows running arbitrary programs that need to
gang-schedule multiple machines - MPICH, LAM,
- FT-MPICH (Seoul National Univ)
- Great for testing environments
47Hey Jobs! Were watching you!
- Local Universe
- Just like Scheduler Universe, but there is a
condor_starter - All advantages of the starter
Submit
Execute
startd
schedd
starter
starter
job
job
Hey, job, behave or else!
48100 people surveyed! Favorite ility ?
49100 people surveyed!Favorite ility ?
Scalability!
50Faster Negotiation
- SIGNIFICANT_ATTRIBUTES determined automatically
- Job attributes AutoClusterId and
AutoClusterAttributes - Rounding of Attributes
- Schedd uses non-blocking TCP connects to the
startd - Negotiator caching
- Collector Forks for queries
- More coming
51Scalability, cont.
- Knobs
- GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE,
- GRIDMANAGER_MAX_PENDING_SUBMIT_PER_RESOURCE,
- GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE
- One instance of gridmanager handles multiple jobs
(all from a given user) - One instance of condor_dagman can run multiple
dags - Is the Shadow next?
- Buffered I/O read on schedd restart (thanks
Yahoo!)
52Quill
- Job ClassAds information mirrored into an RDBMS
- Both active jobs and historical jobs
- Benefits BOTH scalability and accessibility
Master
Startd
Quill
Schedd
Job Queue log
RDBMS
Queue History Tables
53Version 6.9.x
54Whats brewing for after v6.8.0?
- More data, data, data
- Stork distributed now v6.7.x, incl DAGMan support
next it is NeSTs turn. - NeST manage Condor spool files, ckpt servers
- GridFTP used to move the bits
- Quill and CondorDB goodness
- Virtual Machines (and the future of Standard
Universe) - Research BOF w/ Jaeyoung Moon, rm219 9am
55SOAP API
- First focus will be to finish interfaces used by
all command-line tools - condor_userprio, condor_cod,
- Explore message-based security
- Ian Aldermans work w/ signed ClassAd attributes
56Privilege Separation
- No more root in the Condor daemons!
- Instead, a small component will be responsible
for privileged operations - Initial exploratory work w/ GNU userv (Cambridge)
- Now focusing on integration w/ glexec (gLite /
nikhef)
57The Year of the Schedd
- Schedd is juggling to many tasks
- Break it down into smaller pieces, more modular
- Scalability
- All non-blocking I/O
- Hierarchy of schedds
- Schedd-on-the-side
- Scheduler booster
- Transform delegate job classads to different
grids - A job router for a grid
58Thank you!