What - PowerPoint PPT Presentation

About This Presentation
Title:

What

Description:

What s new in Condor? Condor Week 2006 So Todd where is v6.8? Well, v6.7 has been a challenge Around since the 80 s Around since the 80 s 100 people surveyed! – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 59
Provided by: tannenba
Category:

less

Transcript and Presenter's Notes

Title: What


1
Whats new in Condor?Condor Week 2006
2
So Todd where is v6.8?Well, v6.7 has been a
challenge
3
inint





























4
(No Transcript)
5
Around since the 80s
6
Around since the 80s
80s Mullet Boy
7
100 people surveyed! Favorite ility ?
8
100 people surveyed!Favorite ility ?
Deployability!
9
Existing Ports
  • Digital UNIX 4.0        Alpha
  • AIX 5.2 (clipped) PowerPC        
  • Tru64 5.1 (clipped)      Alpha
  • HP UNIX 10.20 PA RISC
  • HP UNIX 11.00 (clipped using hpux10.20 32 bit)
    PA RISC
  • Irix 6.5 (clipped) SGI
  • Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3
    (clipped) Alpha
  • Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3
    Intel x86
  • Linux 2.4.x (glibc 2.2) - Red Hat 8     Intel
    x86
  • Linux 2.4.x (glibc 2.3) - Red Hat 9     Intel
    x86       
  • Enterprise Server 8.1  Intel Itanium
  • Solaris 8       Sparc   
  • Solaris 9       Sparc
  • Microsoft Windows 2000 or XP (clipped)   Intel
    x86

CondorWeek 2005
10
New Ports
  • Introduced in v6.6.x
  • MacOSX (clipped") PowerPC
  • Debian Linux 3.1 Intel x86
  • Fedora Core 1 Intel x86    
  • Red Hat Enterprise Linux 3  Intel x86
  • SuSE Linux Enterprise Server 8.1  Intel Itanium 
  • Introduced in v6.7.x
  • AIX 5.1 (clipped") PowerPC
  • Fedora Core 2 on x86
  • Fedora Core 3 on x86
  • SuSE 8.0 ("clipped") on AMD64
  • Solaris 10 ("clipped") on Sparc
  • Scientific Linux (Release 303) on x86
  • Still to be introduced in v6.7.x (before v6.8.0)
  • HPUX 11i 64-bit pa-risc
  • RHEL 4 on x86
  • native 64 bit AMD Linux

Sigh
CondorWeek 2005
Psilord The Condor porting doctor. Talk to
him in person tomorrow.
11
Porting Table
  • See
  • http//www.cs.wisc.edu/condor/porting/port_table
    .html
  • Highlights
  • Almost every 32-bit Linux flavor as full
  • Every other Unix, MacOS and Windows available as
    clipped
  • Solaris 10 and HP-UX 11.x now clipped
  • FreeBSD 4 contribution from Yahoo!, added 5 and 6
  • X86_64 Linux full running in the lab

12
Backfill Jobs
  • Execute machines will run a locally staged
    executable when otherwise idle.
  • Currently designed for BOINC.

Turn on backfill functionality, and use
BOINC ENABLE_BACKFILL TRUE BACKFILL_SYSTEM
BOINC Spawn a backfill job if we've been
Unclaimed for more than 5 minutes START_BACKFILL
(StateTimer) gt (5 (MINUTE)) Evict a
backfill job if the machine is busy (based on
keyboard activity or cpu load) EVICT_BACKFILL
(MachineBusy)
13
Joining Condors Einstein_at_Home Compute Team
  • If youre running BOINC backfill jobs in Condor
    and want to use your cycles to help another UW
    project, please join the Einstein_at_Home
    computation
  • Join the Condor Backfill team
  • http//einstein.phys.uwm.edu/team_display.php?team
    id5994
  • http//einstein.phys.uwm.edu/create_account_form.p
    hp?teamid5994

14
More deployability
  • Personal Condor Support on Win32
  • LocalSystem not required
  • MSI installer on Win32 (thanks Micron!)
  • New tools
  • Safe, dynamic Condor service deployment.
  • More info _at_ Research BOF 9am Rm219
  • condor_cold_start and
  • condor_cold_stop

15
100 people surveyed! Favorite ility ?
16
100 people surveyed!Favorite ility ?
Availability!
17

Condor with Firewalls and NATSGCB in v6.8.0!
listen accept
Server app
Client app
connect
GCB layer
GCB layer
TCP/IP
TCP/IP
Relay point
18
Job Progress continues if connection is
interrupted
  • Now for Vanilla, Java, and Grid universe jobs,
    Condor supports reestablishment of the connection
    between the submitting and executing machines.
  • If network outage between execute and submit
    machine
  • If submit machine restarts
  • Grid Universe was tricky
  • To take advantage of this feature, put the
    following line into their jobs submit
    description file
  • JobLeaseDuration ltN secondsgt
  • For example
  • job_lease_duration 1200

19
Job Progress continues if submit machine fails
  • Condor can now support a submit machine hot
    spare (schedd failover)
  • If your submit machine A is down for longer than
    N minutes, a second machine B can take over
  • Requires shared filesystem between machines A and
    B

20
Central Manager Failover
  • Condor Central Manager has two services
  • condor_collector
  • Now a list of collectors is supported
  • condor_negotiator (matchmaker)
  • If fails, election process, another takes over
  • Accounting state is peridocially replicated
  • Contributed technology from Technion

21
Reliability, cont.
  • Time shifts
  • Quill
  • Closing windows of vulnerability

22
100 people surveyed! Favorite ility ?
23
100 people surveyed!Favorite ility ?
Lighweight?
24
100 people surveyed!Favorite ility ?
X
Lighweight?
25
100 people surveyed! Favorite ility ?
26
100 people surveyed!Favorite ility ?
Functionality!
27
Security
  • Common Authentication Methods between Condor on
    Unix and Win32
  • Kerberos 1.4
  • Additional hopeful benefit Authentication
    against MS Active Directory!
  • SSL
  • Password (shared secret)
  • Starter only runs known executables
  • More powerful, unified map file(s)
  • GSI credentials delegated

28
With Condor on Win32, it be nice if
  • My jobs could access my files just like the
    condor_shadow can
  • I didnt have to tie my execute machines to a
    single account
  • I didnt have to run condor_store_cred from every
    machine where my credential is needed
  • (thank you Optena)

29
The Windows CredD
  • A centralized repository for user passwords

myp4sswd
y0urs
store password
  • C\gtcondor_store_cred add
  • Account gquinn_at_CROW
  • Enter password
  • Operation succeeded.

credd
ltpasswordgt
30
The Windows CredD
schedd
myp4sswd
fetch password
y0urs
ltpasswordgt
shadow
Submit machines can use the CredD to impersonate
the user in the shadow
31
The Windows CredD
starter
fetch password
myp4sswd
y0urs
ltpasswordgt
condor_exec.exe
Execute machines can use the CredD to run jobs as
the submitting user!
32
Running Jobs as Submitting User
  • In submit file
  • Run_job_as_owner true
  • In config file on submit and execute nodes

CREDD_HOST vault.cs.wisc.edu STARTER_ALLOW_RUNA
S_OWNER True CREDD_CACHE_LOCALLY True
33
Some Condor APIs
  • Command Line tools
  • condor_submit, condor_q, etc
  • -format, -constraint, -xml
  • Condor Perl Module
  • Chirp
  • Checkpoint Library API
  • MW --- improved!
  • DRMAA (Works w/ Win32, on SourceForge)
  • Condor Grid ASCII Protocol (GAHP)
  • Web Service Interface

34
DRMAA
  • Distributed Resource Management Application API
    (DRMAA)
  • GGF Working Group
  • An API specification for the submission and
    control of jobs to one or more Distributed
    Resource Management (DRM) systems
  • An API with C and Java bindings
  • not a protocol
  • Scope
  • Does job submission, monitoring, control, final
    status
  • Does not file staging, reservations, security,

35
Condor GAHP
  • The Condor GAHP is a relatively low-level
    protocol based on simple ASCII messages through
    stdin and stdout
  • Supports a rich feature set including two-phase
    commits, transactions, and optional asynchronous
    notification of events

36
GAHP, cont
  • Example
  • R GahpVersion 1.0.0 Nov 26 2001 NCSA\ CoG\
    Gahpd
  • S GRAM_PING 100 vulture.cs.wisc.edu/fork
  • R E
  • S RESULTS
  • R E
  • S COMMANDS
  • R S COMMANDS GRAM_JOB_CANCEL GRAM_JOB_REQUEST
    GRAM_JOB_SIGNAL GRAM_JOB_STATUS GRAM_PING
    INITIALIZE_FROM_FILE QUIT RESULTS VERSION
  • S VERSION
  • R S GahpVersion 1.0.0 Nov 26 2001 NCSA\ CoG\
    Gahpd
  • S INITIALIZE_FROM_FILE /tmp/grid_proxy_554523.t
    xt
  • R S
  • S GRAM_PING 100 vulture.cs.wisc.edu/fork
  • R S
  • S RESULTS
  • R S 0
  • S RESULTS
  • R S 1

37
Web Service Interfaces
  • SOAP over http or https to the Condor daemons
  • Use any language or platform (where you can find
    a decent SOAP library)
  • Functionality Exposed
  • in current release
  • Submit jobs
  • Retrieve job output
  • Remove/hold/release jobs
  • Query machine status (fetch ads from collector)
  • Query job status (fetch ads from the schedd)

38
Getting machine status viaSOAP (in Java with
Axis)
  • locator new CondorCollectorLocator()

collector locator.getcondorCollector(new
URL(http//machineport))
ads collector.queryStartdAds(Memorygt512)
Because we give you WSDL information you
dont have to write any of these functions.
39
More Functionality changes..
  • FINALLY, clean/consistent cross-platform quoting
    rules for arguments and environment variables
    (see condor_submit man page)
  • Schedd can run HawkEye modules, just like the
    Startd
  • Enables monitoring on the submit machine
  • condor_history now faster than a snail, and
    cleans up droppings.
  • DeferralTime, DeferralWindow
  • Coordinated starts
  • BIND_ALL_INTERFACES in config file
  • WANT_REMOTE_IO in job ClassAd

40
ClassAd Functions in Condor!
  • Conditionals
  • IfThenElse(condition,then,else)
  • String functions
  • Strcat(), strcmp(), toUpper(), etc.
  • StringList functions
  • Example of a string list (CSV style)
  • Mylist Joe, Jon, Jeff, Jim, Jake
  • StrListContains(), StrListAppend(),
    StrListRemove(), etc.
  • Others
  • Regular expressions, arithmetic, etc

41
Accounting Groups andGroup Quota Support
  • Account Group (w/ CORE Feature Animation)
  • Account Group Quota (inspiration CDF _at_ Fermi)
  • Sample Problem Cluster w/ 500 nodes, Chemistry
    Dept purchased 100 of them, Chemistry users must
    always be able to use them
  • Could use Machine Rank
  • but this ties to specific machines
  • Or could use new group support
  • Each group can be given a quota in config file
  • Job ads can specify group membership
  • Group quotas are satisfied first
  • Accounting by user and by group

42
100 people surveyed! Favorite ility ?
43
100 people surveyed!Favorite ility ?
Universability!
44
Grid Universe
  • With new Grid Universe, always specify a
    gridtype. So the old globus Universe is now
    declared as
  • universe grid
  • gridtype gt2
  • Other gridtypes?
  • GT2 (Globus Toolkit 2)
  • GT3 (Globus Toolkit 3.2)
  • GT4 (Globus Toolkit 3.9.5)
  • UNICORE
  • Nordugrid
  • PBS (OpenPBS, PBSPro technology from INFN)
  • LSF (Platform LSF technology from INFN)
  • CONDOR (thanks gLite!)

Condor-G
Condor-C
45
Other Grid Universe improvements
  • Condor-G has support for credential refresh via
    the MyProxy Online Credential Management in NMI
  • http//grid.ncsa.uiuc.edu/myproxy
  • (both GT2 and GT4)
  • GT4 we start a GridFTP server behind the scenes
  • GridFTP server bundled w/ Condor nowadays
  • Some functionality present in Condor-G added to
    Condor-C
  • Forwarding of refreshed credentials (EGEE)
  • GSI authentication support
  • Cleaner ClassAd representation (URL)

46
Parallel Universe
  • Replaces the MPI universe
  • Allows running arbitrary programs that need to
    gang-schedule multiple machines
  • MPICH, LAM,
  • FT-MPICH (Seoul National Univ)
  • Great for testing environments

47
Hey Jobs! Were watching you!
  • Local Universe
  • Just like Scheduler Universe, but there is a
    condor_starter
  • All advantages of the starter

Submit
Execute
startd
schedd
starter
starter
job
job
Hey, job, behave or else!
48
100 people surveyed! Favorite ility ?
49
100 people surveyed!Favorite ility ?
Scalability!
50
Faster Negotiation
  • SIGNIFICANT_ATTRIBUTES determined automatically
  • Job attributes AutoClusterId and
    AutoClusterAttributes
  • Rounding of Attributes
  • Schedd uses non-blocking TCP connects to the
    startd
  • Negotiator caching
  • Collector Forks for queries
  • More coming

51
Scalability, cont.
  • Knobs
  • GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE,
  • GRIDMANAGER_MAX_PENDING_SUBMIT_PER_RESOURCE,
  • GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE
  • One instance of gridmanager handles multiple jobs
    (all from a given user)
  • One instance of condor_dagman can run multiple
    dags
  • Is the Shadow next?
  • Buffered I/O read on schedd restart (thanks
    Yahoo!)

52
Quill
  • Job ClassAds information mirrored into an RDBMS
  • Both active jobs and historical jobs
  • Benefits BOTH scalability and accessibility

Master
Startd
Quill
Schedd

Job Queue log
RDBMS
Queue History Tables
53
Version 6.9.x
54
Whats brewing for after v6.8.0?
  • More data, data, data
  • Stork distributed now v6.7.x, incl DAGMan support
    next it is NeSTs turn.
  • NeST manage Condor spool files, ckpt servers
  • GridFTP used to move the bits
  • Quill and CondorDB goodness
  • Virtual Machines (and the future of Standard
    Universe)
  • Research BOF w/ Jaeyoung Moon, rm219 9am

55
SOAP API
  • First focus will be to finish interfaces used by
    all command-line tools
  • condor_userprio, condor_cod,
  • Explore message-based security
  • Ian Aldermans work w/ signed ClassAd attributes

56
Privilege Separation
  • No more root in the Condor daemons!
  • Instead, a small component will be responsible
    for privileged operations
  • Initial exploratory work w/ GNU userv (Cambridge)
  • Now focusing on integration w/ glexec (gLite /
    nikhef)

57
The Year of the Schedd
  • Schedd is juggling to many tasks
  • Break it down into smaller pieces, more modular
  • Scalability
  • All non-blocking I/O
  • Hierarchy of schedds
  • Schedd-on-the-side
  • Scheduler booster
  • Transform delegate job classads to different
    grids
  • A job router for a grid

58
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com