Title: HighThroughput Computing With Condor
1High-Throughput Computing With Condor
2Who Are We?
3The Condor Project (Established 85)
- Distributed systems CS research performed by a
team that faces - software engineering challenges in a
Unix/Linux/NT environment, - active interaction with users and collaborators,
- daily maintenance and support challenges of a
distributed production environment, - and educating and training students.
- Funding - NSF, NASA,DoE, DoD, IBM, INTEL,
Microsoft and the UW Graduate School - .
4The Condor System
5The Condor System
- Unix and NT
- Operational since 1986
- More than 1300 CPUs at UW-Madison
- Available on the web
- More than 150 clusters worldwide in academia and
industry
6What is Condor?
- Condor converts collections of distributively
owned workstations and dedicated clusters into a
high-throughput computing facility. - Condor uses matchmaking to make sure that
everyone is happy.
7What is High-Throughput Computing?
- High-performance CPU cycles/second under ideal
circumstances. - How fast can I run simulation X on this
machine? - High-throughput CPU cycles/day (week, month,
year?) under non-ideal circumstances. - How many times can I run simulation X in the
next month using all available machines?
8What is High-Throughput Computing?
- Condor does whatever it takes to run your jobs,
even if some machines - Crash! (or are disconnected)
- Run out of disk space
- Dont have your software installed
- Are frequently needed by others
- Are far away admined by someone else
9What is Matchmaking?
- Condor uses Matchmaking to make sure that work
gets done within the constraints of both users
and owners. - Users (jobs) have constraints
- I need an Alpha with 256 MB RAM
- Owners (machines) have constraints
- Only run jobs when I am away from my desk and
never run jobs owned by Bob.
10What can Condordo for me?
- Condor can
- do your housekeeping.
- improve reliability.
- give performance feedback.
- increase your throughput!
11Some Numbers UW-CS Pool
- 6/98-6/00 4,000,000 hours 450 years
- Real Users 1,700,000 hours 260 years
- CS-Optimization 610,000 hours
- CS-Architecture 350,000 hours
- Physics 245,000 hours
- Statistics 80,000 hours
- Engine Research Center 38,000 hours
- Math 90,000 hours
- Civil Engineering 27,000 hours
- Business 970 hours
- External Users 165,000 hours 19 years
- MIT 76,000 hours
- Cornell 38,000 hours
- UCSD 38,000 hours
- CalTech 18,000 hours
12Condor Physics
13Current CMS Activity
- Simulation (CMSIM) for CalTech
- provided gt135,000 CPU hours to date
- peak day 4000 CPU hours
- via NCSA Alliance, Condor has allocated 1,000,000
hours total to CalTech - Simulation and Reconstruction (CMSIM ORCA) for
HEP group at UW-Madison
14INFN Condor Pool - Italy
- Italian National Institute for Research in
Nuclear and Subnuclear Physics - 19 locations, each running a Condor pool
- as few as 1 CPU -- to gt100 CPUs
- each locally controlled
- each flocks jobs to other pools when available
15Particle Physics Data Grid
- The PPDG Project is...
- a software engineering effort to design,
implement, experiment, evaluate, and prototype
HEP-specific data-transfer and caching software
tools for Grid environments - For example...
16Condor PPDG Work
- Condor Data Manager
- technology to automate coordinate data movement
from a variety of long-term repositories to
available Condor computing resources back again - keeping the pipeline full!
- SRB (SDSC), SAM (Fermi), PPDG HRM
17PPDG Collaborators
18National Grid Efforts
- GriPhyN (Grid Physics Network)
- National Technology Grid - NCSA Alliance
(NSF-PACI) - Information Power Grid - IPG (NASA)
- close collaboration with the Globus project
19I have 600simulations to run.How can
Condorhelp me?
20My Application
- Simulate the behavior of F(x,y,z) for 20 values
of x, 10 values of y and 3 values of z (20103
600) - F takes on the average 3 hours to compute on a
typical workstation (total 1800 hours) - F requires a moderate (128MB) amount of memory
- F performs moderate I/O - (x,y,z) is 5 MB and
F(x,y,z) is 50 MB
21Step I - get organized!
- Write a script that creates 600 input files for
each of the (x,y,z) combinations - Write a script that will collect the data from
the 600 output files - Turn your workstation into a Personal Condor
- Submit a cluster of 600 jobs to your personal
Condor - Go on a long vacation (2.5 months)
22(No Transcript)
23Step II - build your personal Grid
- Install Condor on the desktop machine next door
- and on the machines in the classroom.
- Install Condor on the departments Linux cluster
or the O2K in the basement. - Configure these machines to be part of your
Condor pool. - Go on a shorter vacation ...
24(No Transcript)
25Step III - take advantage of your friends
- Get permission from friendly Condor pools to
access their resources - Configure your personal Condor to flock to
these pools - reconsider your vacation plans ...
26(No Transcript)
27Think BIG. Go to the Grid.
28Upgrade to Condor-G
- A Grid-enabled version of Condor that uses the
inter-domain services of Globus to bring Grid
resources into the domain of your Personal Condor - Easy to use on different platforms
- Robust
- Supports SMPs dedicated schedulers
29Step IV - Go for the Grid
- Get access (account(s) certificate(s)) to a
Computational Grid - Submit 599 Grid Universe Condor- glide-in jobs
to your personal Condor - Take the rest of the afternoon off ...
30(No Transcript)
31What Have We Done with the Grid Already?
- NUG30
- quadratic assignment problem
- 30 facilities, 30 locations
- minimize cost of transferring materials between
them - posed in 1968 as challenge, long unsolved
- but with a good pruning algorithm
high-throughput computing...
32NUG30 Personal Condor Grid
- For the run we will be flocking to
- -- the main Condor pool at Wisconsin (600
processors) - -- the Condor pool at Georgia Tech (190 Linux
boxes) - -- the Condor pool at UNM (40 processors)
- -- the Condor pool at Columbia (16 processors)
- -- the Condor pool at Northwestern (12
processors) - -- the Condor pool at NCSA (65 processors)
- -- the Condor pool at INFN (200 processors)
- We will be using glide_in to access the Origin
2000 (through LSF ) at NCSA. - We will use "hobble_in" to access the Chiba City
Linux cluster and Origin - 2000 here at Argonne.
33NUG30 - Solved!!!
- Sender goux_at_dantec.ece.nwu.edu Subject Re Let
the festivities begin. - Hi dear Condor Team,
- you all have been amazing. NUG30 required 10.9
years of Condor Time. In just seven days ! - More stats tomorrow !!! We are off celebrating !
- condor rules !
- cheers,
- JP.
34Conclusion
- Computing power is everywhere, we try to make
it usable by anyone.
35Need more info?
- Condor Web Page (http//www.cs.wisc.edu/condor)
- Peter Couvares (pfc_at_cs.wisc.edu)