Introduction to Condor

About This Presentation

Title:

Introduction to Condor

Description:

http://www.cs.wisc.edu/condor. 23-June-2002. Introduction to Condor. ondor. C ... Adopted by the 'real world' (Galileo, Maxtor, Micron, Oracle, Tigr, ... ) ondor. C ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 111

Provided by: Alai79

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Condor

1
Introduction to Condor
2
?????? ????!

Thank you for having me!
I am
Alain Roy
Computer Science Ph.D. in Quality of Service,
with Globus Project
Working with the Condor Project

3
Condor Tutorials

Today (Sunday) 1000-1230
A general introduction to Condor
Monday 1700-1900
Using and administering Condor
Tuesday 1700-1900
Using Condor on the Grid

4
A General Introduction to Condor
5
The Condor Project (Established 1985)

Distributed Computing research performed by a
team of about 30 faculty, full time staff, and
students who
face software engineering challenges in a Unix
and Windows environment,
are involved in national and international
collaborations,
actively interact with users,
maintain and support a distributed production
environment,
and educate and train students.

6
A Multifaceted Project

Harnessing clustersopportunistic and dedicated
(Condor)
Job management for Grid applications (Condor-G,
DaPSched)
Fabric management for Grid resources (Condor,
GlideIns, NeST)
Distributed I/O technology (PFS, Kangaroo, NeST)
Job-flow management (DAGMan, Condor)
Distributed monitoring and management (HawkEye)
Technology for Distributed Systems (ClassAD, MW)

7
Harnessing Computers

We have more than 300 pools with more than 8500
CPUs worldwide.
We have more than 1800 CPUs in 10 pools on our
campus.
Established a complete production environment
for the UW CMS group
Adopted by the real world (Galileo, Maxtor,
Micron, Oracle, Tigr, )

8
The Grid

Close collaboration and coordination with the
Globus Projectjoint development, adoption of
common protocols, technology exchange,
Partner in major national Grid RD2 (Research,
Development and Deployment) efforts (GriPhyN,
iVDGL, IPG, TeraGrid)
Close collaboration with Grid projects in Europe
(EDG, GridLab, e-Science)

9
User/Application
Grid
Fabric (processing, storage, communication)
10
User/Application
Grid
Fabric (processing, storage, communication)
11
distributed I/O

Close collaboration with the Scientific Data
Management Group at LBL.
Provide management services for distributed data
storage resources
Provide management and scheduling services for
Data Placement jobs (DaPs)
Effective, secure and flexible remote I/O
capabilities
Exception handling

12
job flow management

Adoption of Directed Acyclic Graphs (DAGs) as a
common job flow abstraction.
Adoption of the DAGMan as an effective solution
to job flow management.

13
For the Rest of Today

Condor
Condor and the Grid
Related Technologies
DAGMan
ClassAds
Master-Worker
NeST
DaP Scheduler
Hawkeye
Today Just the Big Picture

14
What is Condor?

Condor converts collections of distributively
owned workstations and dedicated clusters into a
distributed high-throughput computing facility.
Run lots of jobs over a long period of time,
Not a short burst of high-performance
Condor manages both machines and jobs with
ClassAd Matchmaking to keep everyone happy

15
Condor Takes Care of You

Condor does whatever it takes to run your jobs,
even if some machines
Crash (or are disconnected)
Run out of disk space
Dont have your software installed
Are frequently needed by others
Are far away managed by someone else

16
What is Unique about Condor?

ClassAds
Transparent checkpoint/restart
Remote system calls
Works in heterogeneous clusters
Clusters can be
Dedicated
Opportunistic

17
Whats Condor Good For?

Managing a large number of jobs
You specify the jobs in a file and submit them to
Condor, which runs them all and sends you email
when they complete
Mechanisms to help you manage huge numbers of
jobs (1000s), all the data, etc.
Condor can handle inter-job dependencies (DAGMan)

18
Whats Condor Good For? (contd)

Robustness
Checkpointing allows guaranteed forward progress
of your jobs, even jobs that run for weeks before
completion
If an execute machine crashes, you only lose work
done since the last checkpoint
Condor maintains a persistent job queue - if the
submit machine crashes, Condor will recover
(Story)

19
Whats Condor Good For? (contd)

Giving your job the agility to access more
computing resources
Checkpointing allows your job to run on
opportunistic resources (not dedicated)
Checkpointing also provides migration - if a
machine is no longer available, move!
With remote system calls, run on systems which do
not share a filesystem - You dont even need an
account on a machine where your job executes

20
Other Condor features

Implement your policy on when the jobs can run on
your workstation
Implement your policy on the execution order of
the jobs
Keep a log of your job activities

21
A Condor Pool In Action
22
A Bit of Condor Philosophy

Condor brings more computing to everyone
A small-time scientist can make an opportunistic
pool with 10 machines, and get 10 times as much
computing done.
A large collaboration can use Condor to control
its dedicated pool with hundreds of machines.

23
The Condor Idea

Computing power is everywhere, we try to make
it usable by anyone.

24
Meet Frieda.
She is a scientist. But she has a big problem.
25
Friedas Application

Simulate the behavior of F(x,y,z) for 20 values
of x, 10 values of y and 3 values of z (20103
600 combinations)
F takes on the average 3 hours to compute on a
typical workstation (total 1800 hours)
F requires a moderate (128MB) amount of memory
F performs moderate I/O - (x,y,z) is 5 MB and
F(x,y,z) is 50 MB

26
I have 600simulations to run.Where can I get
help?
27
Install a Personal Condor!
28
Installing Condor

Download Condor for your operating system
Available as a free download from
http//www.cs.wisc.edu/condor
Not labelled as Personal Condor, just Condor.
Available for most Unix platforms and Windows NT

29
So Frieda Installs Personal Condor on her machine

What do we mean by a Personal Condor?
Condor on your own workstation, no root access
required, no system administrator intervention
neededeasy to set up.
So after installation, Frieda submits her jobs to
her Personal Condor

30
Personal Condor?!Whats the benefit of a Condor
Pool with just one user and one machine?
31
Your Personal Condor will ...

Keep an eye on your jobs and will keep you posted
on their progress
Keep a log of your job activities
Add fault tolerance to your jobs
Implement your policy on when the jobs can run on
your workstation

32
Frieda is happy untilShe realizes she needs to
run a post-analysis on each job, after it
completes.
33
Condor DAGMan

Directed Acyclic Graph Manager
DAGMan allows you to specify the dependencies
between your Condor jobs, so it can manage them
automatically for you.
(e.g., Dont run job B until job A has
completed successfully.)

34
What is a DAG?

A DAG is the data structure used by DAGMan to
represent these dependencies.
Each job is a node in the DAG.
Each node can have any number of parent or
children nodes as long as there are no loops!

35
Running a DAG

DAGMan acts as a meta-scheduler, managing the
submission of your jobs to Condor based on the
DAG dependencies.

DAGMan
A
Condor Job Queue
.dag File
A
C
B
D
36
Running a DAG (contd)

DAGMan holds submits jobs to Condor at the
appropriate times.

DAGMan
A
Condor Job Queue
B
C
B
C
D
37
Running a DAG (contd)

In case of a job failure, DAGMan continues until
it can no longer make progress, and then creates
a rescue file with the current state of the DAG.

DAGMan
A
Condor Job Queue
Rescue File
X
B
D
38
Recovering a DAG

Once the failed job is ready to be re-run, the
rescue file can be used to restore the prior
state of the DAG.

DAGMan
A
Condor Job Queue
Rescue File
C
B
C
D
39
Recovering a DAG (contd)

Once that job completes, DAGMan will continue the
DAG as if the failure never happened.

DAGMan
A
Condor Job Queue
C
B
D
D
40
Finishing a DAG

Once the DAG is complete, the DAGMan job itself
is finished, and exits.

DAGMan
A
Condor Job Queue
C
B
D
41
Frieda wants more

She decides to use the graduate students
computers when they arent, and get done sooner.
In exchange, they can use the Condor pool too.

42
Friedas Condor pool
Friedas Computer Central Manager
Graduate Students Desktop Computers
43
Friedas Pool is Flexible

Since Friedas is a professor, her jobs are
preferred.
Frieda doesnt always have jobs, so now the
graduate students have access to more computing
power.
Friedas pool has enabled more work to be done by
everyone.

44
How does this work?

Frieda submits a job. Condor makes a ClassAd and
give it to the Central Manager
Owner Frieda
MemoryUsed 40M
ImageSize20M
Requirements(OpsysLinux Memory gt
MemoryUsed)
Central Manager collects machine ClassAds
Memory128M
Requirements(ImageSize lt 50M)
Rank(OwnerFrieda)
Central Manager finds best match

45
After a match is found

Central Manager tells both parties about the
match
Friedas computer and the remote computer
cooperate to run Friedas job.

46
Lots of flexibility

Machines can
Only run jobs when I have been idle for at least
15 minutesor always run them.
Kick off jobs when someone starts using the
computeror never kick them off.
Jobs can
Require or prefer certain machines
Use checkpointing, remote I/O, etc

47
Happy Day! Friedas organization purchased a
Beowulf Cluster!

Other scientists in her department have realized
the power of Condor and want to share it..
The Beowulf cluster and the graduate student
computers can be part of a single Condor pool.

48
Friedas Condor pool
Graduate Students Desktop Computers
Friedas Computer Central Manager
Beowulf Cluster
49
Friedas Big Condor Pool

Jobs can prefer to run in the Beowulf cluster by
using Rank.
Jobs can run just on appropriate machines based
on
Memory, disk space, software, etc.
The Beowulf cluster is dedicated.
The student computers are still useful.
Everyones computing power is increased.

50
Frieda collaborates

She wants to share her Condor pool with
scientists from another lab.

51
Condor Flocking

Condor pools can work cooperatively

52
Flocking

Flocking is Condor specificyou can just link
Condor pools together
Jobs usually prefer running in their native
pool, before running in alternate pools.
What if you want to connect to a non-Condor pool?

53
Condor-G

Condor-G lets you submit jobs to Grid resources.
Uses Globus job submission mechanisms
You get Condors benefits
Fault tolerance, monitoring, etc.
You get the Grids benefits
Use any Grid resources

54
Condor as a Grid Resource

Condor can be a backend for Globus
Submit Globus jobs to Condor resource
The Globus jobs run in the Condor pool

55
Condor Summary

Condor is useful, even on a single machine or a
small pool.
Condor can bring computing power to people that
cant afford a real cluster.
Condor can work with dedicated clusters
Condor works with the Grid
Questions so far?

56
ClassAds

Condor uses ClassAds internally to pair jobs with
machines.
Normally, you dont need to know the details when
you use Condor
We saw sample ClassAds earlier.
If you like, you can also use ClassAds in your
own projects.

57
What Are ClassAds?

A ClassAd maps attributes to expressions
Expressions
Constants strings, numbers, etc.
Expressions other.Memory gt 600M
Lists roy, pfc, melski
Other ClassAds
Powerful tool for grid computing
Semi-structured (you pick your structure)
Matchmaking

58
ClassAd Example

Type Job
Owner roy
Universe Standard
Requirements (other.OpSys Linux
other.DiskSpace gt 140M)
Rank (other.DiskSpace gt 300M ? 10 1)
ClusterID 12314
JobID 0
Env
Real ClassAds have a more fields than will fit on
this slide.

59
ClassAd Matchmaking

Type Job
Owner roy
Requirements (other.OpSys Linux
other.DiskSpace gt 140M)
Rank (other.DiskSpace gt 300M ? 10 1)
Type Machine
OpSys Linux
DiskSpace 500M
AllowedUsers roy, melski, pfc
Requirements (IsMember(other.Owner,
AllowedUsers)

60
ClassAds Are Open Source

Library GNU Public License (LGPL)
Complete source code included
Library code
Test program
Available from
http//www.cs.wisc.edu/condor/classad
Version 0.9.3

61
Who Uses ClassAds?

Condor
European Data Grid
NeST
Web site
You?

62
ClassAd User Condor

ClassAds describe jobs and machines
Matchmaking figures out what jobs run on which
machines
DAGMan will soon internally represent DAGs as
ClassAds

63
ClassAd User EU Datagrid

JDL ClassAd schema to describe jobs/machines
ResourceBroker matches jobs to machines

64
ClassAd User NeST

NeST is a storage appliance
NeST uses ClassAd collections for persistent
storage of
User Information
File meta-data
Disk Information
Lots (storage space allocations)

65
ClassAd User Web Site

Web-based application in Germany
User actions (transitions) are constrained
Constraints expressed through ClassAds

66
ClassAd Summary

ClassAds are flexible
Matchmaking is powerful
You can use ClassAd independently of Condor
http//www.cs.wisc.edu/condor/classad/

67
MW Master-Worker

Master-Worker Style Parallel Applications
Large problem partitioned into small pieces
(tasks)
The master manages tasks and resources (worker
pool)
Each worker gets a task, execute it, sends the
result back, and repeat until all tasks are done
Examples ray-tracing, optimization problems,
etc.
On Condor (PVM, Globus, )
Many opportunities!
Issues (in a Distributed Opportunistic
Environment)
Resource management, communication, portability
Fault-tolerance, dealing with runtime pool
changes.

68
MW to Simplify the Work!

An OO framework with simple interfaces
3 classes to extend, a few virtual functions to
fill
Scientists can focus on their algorithms.
Lots of Functionality
Handles all the issues in a meta-computing
environment
Provides sufficient info. to make smart
decisions.
Many Choices without Changing User Code
Multiple resource managers Condor, PVM,
Multiple communication interfaces PVM, File,
Socket,

69
MWs Layered Architecture
Application classes
API
MW abstract classes
MW App.
IPI
M W
Resource Mgr
Communication Layer
Infrastructure Providers Interface
Underlying infrastructure
70
MWs Runtime Structure
Master Process
Worker Process
Workers
ToDo tasks
Running tasks
Worker Process

Worker Process

User code adds tasks to the masters Todo list
Each task is sent to a worker (Todo -gt Running)
The task is executed by the worker
The result is sent back to the master
User code processes the result (can add/remove
tasks).

71
MW Summary

Its simple
simple API, minimal user code.
Its powerful
works on meta-computing platforms.
Its inexpensive
On top of Condor, it can exploits 100s of
machines.
It solves hard problems!
Nug30, STORM,

72
MW Success Stories

Nug30 solved in 7 days by MW-QAP
Quadratic assignment problem outstanding for 30
years
Utilized 2500 machines from 10 sites
NCSA, ANL, UWisc, Gatech, INFN_at_Italy,
1009 workers at peak, 11 CPU years
http//www-unix.mcs.anl.gov/metaneos/nug30/
STORM (flight scheduling)
Stochastic programming problem (1000M row X
13000M col)
2K times larger than the best sequential program
can do
556 workers at peak, 1 CPU year
http//www.cs.wisc.edu/swright/stochastic/atr/

73
MW Information

http//www.cs.wisc.edu/condor/mw/

74
Questions So Far?
75
NeST

Traditional file servers have not evolved
NeST is a 2nd gen file server
Flexible storage appliance for the grid
Provides local and remote access to data
Easy management of storage resources
User level sw turns machines into storage apps
Deployable and portable

76
Research Meets Production

NeST exists at an exciting intersection
Freedom to pursue academic curiosities
Opportunities to discover real user concerns

77
Very exciting intersection
78
NeST Supports Lots

A lot is a guaranteed storage allocation.
When you run your large analysis on a Grid, will
you have sufficient storage for your results?
Lots ensure you have storage space.

79
NeST Supports Multiple Protocols

Interoperability between admin domains
NeST currently speaks
Grid FTP and FTP
HTTP
NFS (beta)
Chirp
Designed for integration of new protocols

80
Design structure
Physical network layer
Chirp
FTP
Grid ftp
NFS
HTTP
Common protocol layer
Storage Mgr
Physical storage layer
81
Why not JBOS?

Just a bunch of servers has limitations
NeST advantages over JBOS
Single config and admin interface
Optimizations across multiple protocols
e.g. cache aware scheduling
Management and control of protocols
e.g. prefer local users to remote users

82
Three-Way Matching
Refers to NearestStorage.
Knows where NearestStorage is.
Job Ad
Machine Ad
Storage Ad
match
Machine
Job
NeST
83
Three way ClassAds
Type job TargetType machine Cmd
sim.exe Owner thain Requirements
(OpSyslinux) NearestStorage.HasCMSData
Type machine TargetType job OpSys
linux Requirements (Ownerthain) NearestSto
rage ( Name turkey) (TypeStorage)
Machine ClassAd
Job ClassAd
84
NeST Information

http//www.cs.wisc.edu/condor/nest
Version 0.9 now available (linux only, no NFS)
Solaris and NFS coming soon
Requests welcome

85
DaP Scheduler

Intelligent scheduling of data transfers

86
Applications Demand Storage

Database systems
Multimedia applications
Scientific applications
High Energy Physics Computational Genomics
Currently terabytes, soon petabytes of data

87
Is Remote access good enough?

Huge amounts of data (mostly in tapes)
Large number of users
Distance / Low Bandwidth
Different platforms
Scalability and efficiency concerns
gt A middleware is required

88
Two approaches

Move job/application to the data
Less common
Insufficient computational power on storage site
Not efficient
Does not scale
Move data to the job/application

89
Move data to the Job
WAN
Local Storage Area (eg. Local Disk, NeST
Server..)
LAN
Remote Staging Area
Compute cluster
90
Main Issues

1. Insufficient local storage area
2. CPU should not wait much for I/O
3. Crash Recovery
4. Different Platforms Protocols
5. Make it simple

91
Data Placement Scheduler (DaPS)

Intelligently Manages and Schedules Data
Placement (DaP) activities/jobs
What Condor is for computational jobs, DaPS means
the same for DaP jobs
Just submit a bunch of DaP jobs and then relax..

92
Supported Protocols

Currently supported
FTP
GridFTP
NeST (chirp)
SRB (Storage Resource Broker)
Very soon
SRM (Storage Resource Manager)
GDMP (Grid Data Management Pilot)

93
Case Study DAGMan
.dag File
94
Current DAG structure

All jobs are assumed to be computational jobs

Job A
Job C
Job B
Job D
95
Current DAG structure

If data transfer to/from remote sites is
required, this is performed via pre- and
post-scripts attached to each job.

Job A
PRE Job B POST
Job C
Job D
96
New DAG structure

Add DaP jobs to the DAG structure

PRE Job B POST
97
New DAGMan Architecture
.dag File
DAGMan
DAGMan
A
DaPS Job Queue
Condor Job Queue
X
X
A
B
C
Y
D
98
DaP Conclusion

More intelligent management of remote data
transfer staging
increase local storage utilization
maximize CPU throughput

99
Questions So Far?
100
Hawkeye

Sys admins first need information about what is
happening on the machines they are responsible
for.
Both current and past
Information must be consolidated and easily
accessible
Information must be dynamic

101
HawkEye Monitoring Agent
HawkEye Monitoring Agent
HawkEye Manager
HawkEye Monitoring Agent
HawkEye Monitoring Agent
HawkEye Monitoring Agent
102
HawkEye Monitoring Agent
Hawkeye_Startup_Agent
Hawkeye_Monitor
HawkEye Manager
ClassAd Updates
/proc, kstat
HawkEye Monitoring Agent
103
Monitor Agent, cont.

Updates are sent periodically
Information does not get stale
Updates also serve as a heartbeat monitor
Know when a machine is down
Out of the box, the update ClassAd has many
attributes about the machine of interest for
system administration
Current Prototype about 200 attributes

104
Custom Attributes
Hawkeye_Startup_Agent
Hawkeye_Monitor
HawkEye Manager
/proc, kstat
Data from hawkeye_update_attribute command line
tool
Create your own HawkEye plugins, or share plugins
with others
HawkEye Monitoring Agent
105
Role of HawkEye Manager

Store all incoming ClassAds in a indexed resident
data structure
Fast response to client tool queries about
current state
Show me all machines with a load average gt 10
Periodically store ClassAd attributes into a
Round Robin Database
Store information over time
Show me a graph with the load average for this
machine over the past week
Speak to clients via CEDAR, HTTP

106
Web client
http//www.cs.wisc.edu/roy/hawkeye/

Command-line, GUI, Web-based

107
Running tasks on behalf of the sys admin

Submit your sys admin tasks to HawkEye
Tasks are stored in a persistent queue by the
Manager
Tasks can leave the queue upon completion, or
repeat after specified intervals
Tasks can have complex interdependencies via
DAGMan
Records are kept on which task ran where
Sounds like Condor, eh?
Yes, but simpler

108
Run Tasks in response to monitoring information

ClassAd Requirements Attribute
Example Send email if a machine is low on disk
space or low on swap space
Submit an email task with an attribute
Requirements free_disk lt 5 free_swap lt 5
Example w/ task interdependency If load average
is high and OSLinux and console is Idle, submit
a task which runs top, if top sees Netscape,
submit a task to kill Netscape

109
Todays Summary

Condor works on many levels
Small pools can make a big difference
Big pools are for the really big problems
Condor works in the Grid
Condor is assisted by a host of technologies
ClassAds, Checkpointing, Remote I/O DAGMan,
Master-Worker, NeST, DaPScheduler, Hawkeye