Condor%20Parallel%20Universe - PowerPoint PPT Presentation

About This Presentation
Title:

Condor%20Parallel%20Universe

Description:

One shadow, many starters. Starter runs sshd on all machines, does key exchange. Starter runs the exe on first machine (head node, Rank0) ondor. C. www.cs.wisc. ... – PowerPoint PPT presentation

Number of Views:398
Avg rating:3.0/5.0
Slides: 16
Provided by: Miron1
Category:

less

Transcript and Presenter's Notes

Title: Condor%20Parallel%20Universe


1
Condor Parallel Universe
2
Overview
  • Task vs. Job Parallelism
  • New Condor support for Task-Parallelism
  • Other goodies

3
The Talk in one Slide
  • Parallel Universe can run any task parallel job
  • Not just MPICH 1.2.4
  • Not just MPI

4
Job vs Task Parallelism
  • Condor historically focused on Job Parallelism
  • Job parallelism either manually or via DAGman
  • Rest of talk on task parallelism
  • Can also get task parallel via pvm or MW

5
Parallel Universe
  • Adaptation of MPI universe
  • Modifications based on experience with MPI
  • User feedback
  • But, more than just MPI

6
MPI lifecycle without Condor
  • Lam Version
  • lamboot lamboot -ssi boot ssh machine_file
  • mpirun mpirun -np 8 exe arg1 arg2...
  • lamhalt lamhalt

7
Scheduling
  • Need Dedicated Scheduler
  • "Dedicated" has a specific Condor meaning
  • Nodes running MPI require a dedicated scheduler
  • A Given machine can have many opportunistic
    schedulers
  • ... but only 1 dedicated scheduler

8
DedicatedScheduler surprises
  • DedicatedScheduler co-opts normal negotiation
    cycle
  • Preemption and scheduling work differently than
    opportunistic
  • DedicatedScheduler schedules First-Fit, sorted by
    UserJobPrio
  • Condor_q analyze mystery!

9
Job startup
  • Same file transfer, etc. as Vanilla
  • One shadow, many starters
  • Starter runs sshd on all machines, does key
    exchange
  • Starter runs the exe on first machine
  • (head node, Rank0)

10
Your script Here
  • Script on the head node has contact file
  • We provide samples for LAM, MPICH
  • We try to mimic by hand startup
  • Use condor_ssh to start remote jobs
  • When script exits, condor cleans up

11
Parallel Example
Submit Machine
Execute Machines
Schedd
Startd
Startd
Startd
Sshd
Sshd
Sshd
Job
Job
Job
12
Example submit file
  • Universe Parallel executable is a script
  • executable script the real
    binarytransfer_input_files executableargument
    s arg1 arg2 arg3machine_count 8output
    out.(Cluster).(NODE)queue

13
Example Script
  • chmod 755 simple
  • lamboot ssi boot rsh MACHINE_FILE
  • mpirun np NO_MACHINES simple
  • lamhalt

14
Example submit file 2
  • Universe Parallel
  • Requirements (Hostname somemachine)
  • queue
  • Requirements (Hostname ! somemachine)
  • queue 7

15
Example Script 2
  • mach1 sed n 1p MACHINE_FILE
  • mach2 sed n 2p MACHINE_FILE
  • ./server
  • ssh mach1 client_app
  • ssh mach2 client_app
  • wait

16
Summary
  • With Parallel Universe in Condor 6.8 comes
  • Support for most MPI implementations (some
    scripting required)
  • Somewhat better MPI scheduling
  • Better node placement via condor matchmaking

17
Questions?
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com