HighThroughput Computing With Condor - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

HighThroughput Computing With Condor

Description:

software engineering challenges in a Unix/Linux/NT environment, ... Simulation and Reconstruction (CMSIM ORCA) for HEP group at UW-Madison. ondor. C ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 36
Provided by: peterco2
Learn more at: https://www.jlab.org
Category:

less

Transcript and Presenter's Notes

Title: HighThroughput Computing With Condor


1
High-Throughput Computing With Condor
2
Who Are We?
3
The Condor Project (Established 85)
  • Distributed systems CS research performed by a
    team that faces
  • software engineering challenges in a
    Unix/Linux/NT environment,
  • active interaction with users and collaborators,
  • daily maintenance and support challenges of a
    distributed production environment,
  • and educating and training students.
  • Funding - NSF, NASA,DoE, DoD, IBM, INTEL,
    Microsoft and the UW Graduate School
  • .

4
The Condor System
5
The Condor System
  • Unix and NT
  • Operational since 1986
  • More than 1300 CPUs at UW-Madison
  • Available on the web
  • More than 150 clusters worldwide in academia and
    industry

6
What is Condor?
  • Condor converts collections of distributively
    owned workstations and dedicated clusters into a
    high-throughput computing facility.
  • Condor uses matchmaking to make sure that
    everyone is happy.

7
What is High-Throughput Computing?
  • High-performance CPU cycles/second under ideal
    circumstances.
  • How fast can I run simulation X on this
    machine?
  • High-throughput CPU cycles/day (week, month,
    year?) under non-ideal circumstances.
  • How many times can I run simulation X in the
    next month using all available machines?

8
What is High-Throughput Computing?
  • Condor does whatever it takes to run your jobs,
    even if some machines
  • Crash! (or are disconnected)
  • Run out of disk space
  • Dont have your software installed
  • Are frequently needed by others
  • Are far away admined by someone else

9
What is Matchmaking?
  • Condor uses Matchmaking to make sure that work
    gets done within the constraints of both users
    and owners.
  • Users (jobs) have constraints
  • I need an Alpha with 256 MB RAM
  • Owners (machines) have constraints
  • Only run jobs when I am away from my desk and
    never run jobs owned by Bob.

10
What can Condordo for me?
  • Condor can
  • do your housekeeping.
  • improve reliability.
  • give performance feedback.
  • increase your throughput!

11
Some Numbers UW-CS Pool
  • 6/98-6/00 4,000,000 hours 450 years
  • Real Users 1,700,000 hours 260 years
  • CS-Optimization 610,000 hours
  • CS-Architecture 350,000 hours
  • Physics 245,000 hours
  • Statistics 80,000 hours
  • Engine Research Center 38,000 hours
  • Math 90,000 hours
  • Civil Engineering 27,000 hours
  • Business 970 hours
  • External Users 165,000 hours 19 years
  • MIT 76,000 hours
  • Cornell 38,000 hours
  • UCSD 38,000 hours
  • CalTech 18,000 hours

12
Condor Physics
13
Current CMS Activity
  • Simulation (CMSIM) for CalTech
  • provided gt135,000 CPU hours to date
  • peak day 4000 CPU hours
  • via NCSA Alliance, Condor has allocated 1,000,000
    hours total to CalTech
  • Simulation and Reconstruction (CMSIM ORCA) for
    HEP group at UW-Madison

14
INFN Condor Pool - Italy
  • Italian National Institute for Research in
    Nuclear and Subnuclear Physics
  • 19 locations, each running a Condor pool
  • as few as 1 CPU -- to gt100 CPUs
  • each locally controlled
  • each flocks jobs to other pools when available

15
Particle Physics Data Grid
  • The PPDG Project is...
  • a software engineering effort to design,
    implement, experiment, evaluate, and prototype
    HEP-specific data-transfer and caching software
    tools for Grid environments
  • For example...

16
Condor PPDG Work
  • Condor Data Manager
  • technology to automate coordinate data movement
    from a variety of long-term repositories to
    available Condor computing resources back again
  • keeping the pipeline full!
  • SRB (SDSC), SAM (Fermi), PPDG HRM

17
PPDG Collaborators
18
National Grid Efforts
  • GriPhyN (Grid Physics Network)
  • National Technology Grid - NCSA Alliance
    (NSF-PACI)
  • Information Power Grid - IPG (NASA)
  • close collaboration with the Globus project

19
I have 600simulations to run.How can
Condorhelp me?
20
My Application
  • Simulate the behavior of F(x,y,z) for 20 values
    of x, 10 values of y and 3 values of z (20103
    600)
  • F takes on the average 3 hours to compute on a
    typical workstation (total 1800 hours)
  • F requires a moderate (128MB) amount of memory
  • F performs moderate I/O - (x,y,z) is 5 MB and
    F(x,y,z) is 50 MB

21
Step I - get organized!
  • Write a script that creates 600 input files for
    each of the (x,y,z) combinations
  • Write a script that will collect the data from
    the 600 output files
  • Turn your workstation into a Personal Condor
  • Submit a cluster of 600 jobs to your personal
    Condor
  • Go on a long vacation (2.5 months)

22
(No Transcript)
23
Step II - build your personal Grid
  • Install Condor on the desktop machine next door
  • and on the machines in the classroom.
  • Install Condor on the departments Linux cluster
    or the O2K in the basement.
  • Configure these machines to be part of your
    Condor pool.
  • Go on a shorter vacation ...

24
(No Transcript)
25
Step III - take advantage of your friends
  • Get permission from friendly Condor pools to
    access their resources
  • Configure your personal Condor to flock to
    these pools
  • reconsider your vacation plans ...

26
(No Transcript)
27
Think BIG. Go to the Grid.
28
Upgrade to Condor-G
  • A Grid-enabled version of Condor that uses the
    inter-domain services of Globus to bring Grid
    resources into the domain of your Personal Condor
  • Easy to use on different platforms
  • Robust
  • Supports SMPs dedicated schedulers

29
Step IV - Go for the Grid
  • Get access (account(s) certificate(s)) to a
    Computational Grid
  • Submit 599 Grid Universe Condor- glide-in jobs
    to your personal Condor
  • Take the rest of the afternoon off ...

30
(No Transcript)
31
What Have We Done with the Grid Already?
  • NUG30
  • quadratic assignment problem
  • 30 facilities, 30 locations
  • minimize cost of transferring materials between
    them
  • posed in 1968 as challenge, long unsolved
  • but with a good pruning algorithm
    high-throughput computing...

32
NUG30 Personal Condor Grid
  • For the run we will be flocking to
  • -- the main Condor pool at Wisconsin (600
    processors)
  • -- the Condor pool at Georgia Tech (190 Linux
    boxes)
  • -- the Condor pool at UNM (40 processors)
  • -- the Condor pool at Columbia (16 processors)
  • -- the Condor pool at Northwestern (12
    processors)
  • -- the Condor pool at NCSA (65 processors)
  • -- the Condor pool at INFN (200 processors)
  • We will be using glide_in to access the Origin
    2000 (through LSF ) at NCSA.
  • We will use "hobble_in" to access the Chiba City
    Linux cluster and Origin
  • 2000 here at Argonne.

33
NUG30 - Solved!!!
  • Sender goux_at_dantec.ece.nwu.edu Subject Re Let
    the festivities begin.
  • Hi dear Condor Team,
  • you all have been amazing. NUG30 required 10.9
    years of Condor Time. In just seven days !
  • More stats tomorrow !!! We are off celebrating !
  • condor rules !
  • cheers,
  • JP.

34
Conclusion
  • Computing power is everywhere, we try to make
    it usable by anyone.

35
Need more info?
  • Condor Web Page (http//www.cs.wisc.edu/condor)
  • Peter Couvares (pfc_at_cs.wisc.edu)
Write a Comment
User Comments (0)
About PowerShow.com