Condor Usage at Brookhaven National Lab - PowerPoint PPT Presentation

About This Presentation

Title:

Condor Usage at Brookhaven National Lab

Description:

Condor Usage at Brookhaven National Lab – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 21

Provided by: Csw5

Learn more at: https://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Condor Usage at Brookhaven National Lab

1
Condor Usage at Brookhaven National Lab
Alexander Withers (talk given by Tony Chan) RHIC
Computing Facility Condor Week - March 15, 2005
2
About Brookhaven National Lab

One of a handful of Laboratories supported and
managed by the U.S. govt through DOE.
Multi-disciplinary Lab with 2,700 employees,
Physics being the largest department.
Physics Dept. has its own computing division (30
FTEs) to support physics (HEP) projects.
RHIC (nuclear) and ATLAS (HEP) are largest
projects currently being supported.

3
Computing Facility Resources

Full service facility central/distributed
storage capacity, large Linux Farm, robotic
system for data storage, data backup, etc.
6 PB permanent tape storage capacity.
500 TB central/distributed disk storage
capacity.
1.4 million SpecInt2000 aggregrate computing
power in Linux Farm.

4
History of Condor at Brookhaven

First looked at Condor in 2003 as a replacement
for LSF and in-house batch software.
Installed 6.4.7 in August 2003.
Upgraded to 6.6.0 in February 2004.
Upgraded to 6.6.6 (with 6.7.0 startd binary) in
August 2004.
User base grew from 12 (April 2004) to 50 (March
2005).

5
The Rise in Condor Usage
6
The Rise in Condor Usage
7
Condor Cluster Usage
8
BNLs modified Condorview
9
Overview of Computing Resources

Total of 2750 CPUs (growing to 3400 in 2005).
Two central managers with one acting as a backup.
Three specialized submit machines which handle
600 simultaneous jobs each on average.
131 of the execute nodes can also act as
submission nodes.
One monitoring/Condorview server.

10
Overview of Computing Resources, cont.

Six GLOBUS gateway machines for remote job
submission.
Most machines run SL-3.0.2 on the x86 platform,
some still using RH 7.3.
Running 6.6.6 with 6.7.0 startd binary to take
advantage of multiple VM feature.

11
Overview of Configuration

Computing resources divided into 6 pools.
Two configuration models
Split pool resources into two parts and restrict
which jobs can run in each part.
More complex version of the Bologna Batch System.
A pool uses one or both of these models.
Some pools employ user priority preemption.
Use drop queue method to fill fast machines
first.
Have tools to easily reconfigure nodes.
All jobs use vanilla universe (no checkpointing).

12
Two Part Model

Nodes are assigned one of two tasks irrespective
of Condor analysis or reconstruction.
Within Condor, a node advertises itself as either
an analysis node or a reconstruction node.
A job must advertise itself in the same manner to
match with an appropriate node.
Only certain users may run reconstruction jobs
but anyone can run an analysis job.

13
Analysis/Reconstruction
Group 5
Group 3
vm1
Fast
vm2
Group 4
Group 2
Group 3

No suspension
No preemption
Will start a job if CPU is free

Group 2
Group 1
Slow
Group 1
Reconstruction Job wants group lt 2
14
A More Complex Version of the Bologna Model

Two CPU nodes each with 8 VMs.
2 VMs per CPU.
Only two jobs running at a time.
Four job categories, each with its own priority.
A high priority VM will suspend a random VM of
lower priority.
The random aspect is to prevent the same VM from
always getting suspended.

15
High (vm7/vm8)
Analysis/Reconstruction
High Prio
Group 5
Group 3
Med (vm5/vm6)
Fast
Low (vm3/vm4)
Group 4
Low Prio
MC (vm1/vm2)
Group 2
Group 3

Low priority VMs suspended
No preemption
Will start a job if CPU is free
or is of higher priority

Group 2
Group 1
Slow
Group 1
Reconstruction Job wants group 3 Med.
Priority (vm5/vm6)
16
Issues We've Had to Deal With

Tune parameters to alleviate scalability
problems.
MATCH_TIMEOUT
MAX_CLAIM_ALIVES_MISSED
Panasas (proprietary file system) creates kernel
threads with whitespace in process name. Breaks
an fscanf in procapi.C? Panasas fixed bug.
High-volume users can dominate pool, partially
solved with PREEMPTION_REQUIREMENTS.

17
Issues Weve Had to Deal With, cont.

Dagman problems (latency, termination) ? changed
from dagman for plain Condor.
Created own ClassAds and JobAds to create batch
queues and handy management tools (ie, our
version of condor_off).
Modified Condorview to meet our accounting
monitoring requirements.

18
Issues Not Yet Resolved

Need job ClassAd which gives user's primary group
--gt better control over cluster usage.
Transfer output files for debugging when job is
evicted.
Need option to force the schedd to release its
claim after each job.
Allow schedd to set mandatory periodic_remove
policy ? avoid manual cleanup.

19
Issues Not Yet Resolved, cont.

Shadow seems to make a large number of NIS calls.
Possible problem with caching ? address shadows
in vanilla universe?
Need Kerberos support to comply with security
mandates.
Interested in Condor on Demand (COD), but lack of
functionality prevents more usage.
Need more (and effective) cluster management
tools ? condor_off works?

20
Near-Term Plans Summary

Waiting for 6.8.x series (late 2005?) to upgrade.
Scalability concerns as usage rises.
High availability more critical as usage rises.
Integration of BNL Condor pools with external
pools, but concerned about security.
Need some functionalities listed above for a
meaningful upgrade and to improve cluster
management capability.

Write a Comment

User Comments (0)