Title: Condor Usage at Brookhaven National Lab
1Condor Usage at Brookhaven National Lab
Alexander Withers (talk given by Tony Chan) RHIC
Computing Facility Condor Week - March 15, 2005
2About Brookhaven National Lab
- One of a handful of Laboratories supported and
managed by the U.S. govt through DOE. - Multi-disciplinary Lab with 2,700 employees,
Physics being the largest department. - Physics Dept. has its own computing division (30
FTEs) to support physics (HEP) projects. - RHIC (nuclear) and ATLAS (HEP) are largest
projects currently being supported.
3Computing Facility Resources
- Full service facility central/distributed
storage capacity, large Linux Farm, robotic
system for data storage, data backup, etc. - 6 PB permanent tape storage capacity.
- 500 TB central/distributed disk storage
capacity. - 1.4 million SpecInt2000 aggregrate computing
power in Linux Farm.
4History of Condor at Brookhaven
- First looked at Condor in 2003 as a replacement
for LSF and in-house batch software. - Installed 6.4.7 in August 2003.
- Upgraded to 6.6.0 in February 2004.
- Upgraded to 6.6.6 (with 6.7.0 startd binary) in
August 2004. - User base grew from 12 (April 2004) to 50 (March
2005).
5The Rise in Condor Usage
6The Rise in Condor Usage
7Condor Cluster Usage
8BNLs modified Condorview
9Overview of Computing Resources
- Total of 2750 CPUs (growing to 3400 in 2005).
- Two central managers with one acting as a backup.
- Three specialized submit machines which handle
600 simultaneous jobs each on average. - 131 of the execute nodes can also act as
submission nodes. - One monitoring/Condorview server.
10Overview of Computing Resources, cont.
- Six GLOBUS gateway machines for remote job
submission. - Most machines run SL-3.0.2 on the x86 platform,
some still using RH 7.3. - Running 6.6.6 with 6.7.0 startd binary to take
advantage of multiple VM feature.
11Overview of Configuration
- Computing resources divided into 6 pools.
- Two configuration models
- Split pool resources into two parts and restrict
which jobs can run in each part. - More complex version of the Bologna Batch System.
- A pool uses one or both of these models.
- Some pools employ user priority preemption.
- Use drop queue method to fill fast machines
first. - Have tools to easily reconfigure nodes.
- All jobs use vanilla universe (no checkpointing).
12Two Part Model
- Nodes are assigned one of two tasks irrespective
of Condor analysis or reconstruction. - Within Condor, a node advertises itself as either
an analysis node or a reconstruction node. - A job must advertise itself in the same manner to
match with an appropriate node. - Only certain users may run reconstruction jobs
but anyone can run an analysis job.
13Analysis/Reconstruction
Group 5
Group 3
vm1
Fast
vm2
Group 4
Group 2
Group 3
- No suspension
- No preemption
- Will start a job if CPU is free
Group 2
Group 1
Slow
Group 1
Reconstruction Job wants group lt 2
14A More Complex Version of the Bologna Model
- Two CPU nodes each with 8 VMs.
- 2 VMs per CPU.
- Only two jobs running at a time.
- Four job categories, each with its own priority.
- A high priority VM will suspend a random VM of
lower priority. - The random aspect is to prevent the same VM from
always getting suspended.
15High (vm7/vm8)
Analysis/Reconstruction
High Prio
Group 5
Group 3
Med (vm5/vm6)
Fast
Low (vm3/vm4)
Group 4
Low Prio
MC (vm1/vm2)
Group 2
Group 3
- Low priority VMs suspended
- No preemption
- Will start a job if CPU is free
- or is of higher priority
Group 2
Group 1
Slow
Group 1
Reconstruction Job wants group 3 Med.
Priority (vm5/vm6)
16Issues We've Had to Deal With
- Tune parameters to alleviate scalability
problems. - MATCH_TIMEOUT
- MAX_CLAIM_ALIVES_MISSED
- Panasas (proprietary file system) creates kernel
threads with whitespace in process name. Breaks
an fscanf in procapi.C? Panasas fixed bug. - High-volume users can dominate pool, partially
solved with PREEMPTION_REQUIREMENTS.
17Issues Weve Had to Deal With, cont.
- Dagman problems (latency, termination) ? changed
from dagman for plain Condor. - Created own ClassAds and JobAds to create batch
queues and handy management tools (ie, our
version of condor_off). - Modified Condorview to meet our accounting
monitoring requirements.
18Issues Not Yet Resolved
- Need job ClassAd which gives user's primary group
--gt better control over cluster usage. - Transfer output files for debugging when job is
evicted. - Need option to force the schedd to release its
claim after each job. - Allow schedd to set mandatory periodic_remove
policy ? avoid manual cleanup.
19Issues Not Yet Resolved, cont.
- Shadow seems to make a large number of NIS calls.
Possible problem with caching ? address shadows
in vanilla universe? - Need Kerberos support to comply with security
mandates. - Interested in Condor on Demand (COD), but lack of
functionality prevents more usage. - Need more (and effective) cluster management
tools ? condor_off works?
20Near-Term Plans Summary
- Waiting for 6.8.x series (late 2005?) to upgrade.
- Scalability concerns as usage rises.
- High availability more critical as usage rises.
- Integration of BNL Condor pools with external
pools, but concerned about security. - Need some functionalities listed above for a
meaningful upgrade and to improve cluster
management capability.