Title: Overview
1Overview
- Grid Computing What is it?
- Condor What is it?
- The Condor Project at Boise State University
College of Engineering - Guidelines for Implementing a Grid Project
2Acknowledgements
- Brook Gore, Senior Fellow, Micron Technology
- Elisa BarneySmith Associate Professor ECE,
College of Engineering,Boise State University - Lynn Russell, Former Dean of the College of
Engineering, Boise State University
3What is Grid Computing?
- Grid Computing is both batch processing and
service delivery through distributed networked
computers. - It can be dedicated resource or opportunistic use
of available computer cycles (scavenging). - Grid computing can be thought of as distributed,
large-scale cluster computing. - Sometimes it is characterized as a form of
networked disturbed parallel computing.
4Why Should We Be Interested in Grid Computing
- John Patrick, IBM's vice-president for Internet
strategies says, the next big thing will be grid
computing." - Grid computing is very cost-effective when unused
computer cycles are used. - Grid computing opens up solutions to problems
that can't be approached without an enormous
amount of computing power.
5What is Condor?
- Condor converts a collection of distributed
computers into a high-throughput computing
facility. - Condor provides
- queuing mechanism
- scheduling policy
- priority scheme
- resource classification
- client services
6What is Condor (Continued)
- Condor is supported on many platforms
- MacOSX
- AIX
- Sun
- HPUX
- Irix
- Linux
- Windows NT, 2000 XP
7Condor Components
- The Central Manager
- 1. Oversees all resources of the pool.
- 2. Schedules jobs.
- 3. Queue management.
- The pool machines or clients can be configured in
one of three ways - to only run jobs.
- to only submit jobs.
- to both run and submit jobs.
8A typical Condor Pool
Central Manager
Monitors status of execute hosts and assigns jobs
to them
Matches jobs from submit hosts to appropriate
execute hosts
These machines are both submit and execute hosts
Execute hosts
Submit hosts
Checkpoint files from jobs that checkpoint are
stored on checkpoint server
9Central Manager
Execute Host tells Central Manager about itself.
Central Manager tells it when to accept a job
from Submit Host.
Submit Host tells Central Manager about a job.
Central Manager tells it to which Execute Host it
should send job to.
Condor daemons (Normally listen on ports 9614 and
9618)
ClassAds are passed by the execute host and
asked for by the submit host
Execute Host
Submit Host
Send job to Execute Host. Send results to Submit
Host.
Condor daemons
Condor daemons
Spawns job and signals it when to abort, suspend,
or checkpoint.
condor_shadow process
Users job
Users executable code
10The ClassAd Mechanism
- Each execute host in the Pool reports their
ClassAds to the Condor Master. - ClassAds are configurable to certain degree.
- Each job that is submitted has a list of ClassAd
requirements. - For a Job to run the requirement ClassAds must
match a reported ClassAd in the Condor Masters
list.
11What is ClassAd Matchmaking?
- Condor uses ClassAd Matchmaking to make sure that
work gets done within the constraints of both
users and owners of the workstations. - Users (jobs) have constraints
- I need an Opteron with 512 MB RAM
- Owners (machines) have constraints
- Only run jobs when I am away from my desk and
never run jobs owned by Bob.
12ClassAds
- MyType "Machine"
- TargetType "Job"
- Name "MEC408-15"
- Machine "MEC408-15"
- Rank 0.000000
- CpuBusy ((LoadAvg - CondorLoadAvg) gt 0.500000)
- VirtualMachineID 1
- VirtualMemory 754084
- Disk 65736860
- CondorLoadAvg 0.000000
- LoadAvg 0.020000
- KeyboardIdle 0
- ConsoleIdle 0
- Memory 384
- Cpus 1
- StartdIpAddr "lt132.178.151.181029gt"
- Arch "INTEL"
- OpSys "WINNT51"
13The Condor Job Format
- A minimal Condor Job is made up of
- 1) a submit file
- 2) an executable file or a batch file
- Additional data files can be part of a more
complicated job stream.
14A Simple Condor Submit File, printname.sub
- universe vanilla
- requirements (OpSys "WINNT51" )
- executable printname.bat
- output printname.out
- error printname.err
- log printname.log
- queue
15A Simple Condor Job, printname.bat
- echo Howdy!
- echo Output from "net name"
- net name
- echo That's all folk
16Output from the Simple Job
- Howdy!
- Output from "net name"
- Name
- -----------------------------------
- BSU200108
- The command completed successfully.
- That's all folk
17What Can Happen During a Job
- A job can fail to be matched through ClassAds to
an execute host and remain idle in the queue. - The job can run to completion and data is
returned to the submitting host. - If there is keyboard or mouse activity on the
execute host, one of three things will happen to
the job - 1) terminated and requeued
- 2) suspended and held in memory
- 3) continue in the background
18Basic Condor Commands
- All commands run from the command prompt in
Windows, Unix ,Linux etc. - condor_status reports on the status of the pool
- condor_submit ltfile.subgt submits a job to the
pool - condor_q reports on queued or running jobs
- condor_rm ltjob numbergt removes a queued or
running job. - condor_q -analyze tells you the ClassAd
requirements of a queued job.
19condor_status
- amcdonal_at_coengrid amcdonal condor_status
- Name OpSys Arch State
Activity LoadAv Mem - Coengrid LINUX INTEL Owner
Idle 0.380 501 - BSU101190 WINNT50 INTEL Unclaimed Idle
0.000 384 - MEC202P-03 WINNT50 INTEL Unclaimed Idle
2.180 384 - MEC202P-05 WINNT50 INTEL Unclaimed Idle
2.030 384 - raidman WINNT50 INTEL Unclaimed
Idle 0.000 312 - BSU101194 WINNT51 INTEL Unclaimed Idle
0.010 384 - BSU104889 WINNT51 INTEL Unclaimed Idle
0.010 1024 - BSU200108 WINNT51 INTEL Unclaimed Idle
0.000 256 - ET238-BSU WINNT51 INTEL Unclaimed Idle
0.010 256
20condor_submit
- amcdonal_at_coen condor_submit printname.sub
- Submitting job(s)...
- Logging submit event(s)...
- 1 job(s) submitted to cluster 194.0
21condor_q command
- amcdonal_at_coen condor_q
- -- Submitter coengrid.boisestate.edu
lt132.178.144.7632773gt coengrid.boisestate.edu - ID OWNER SUBMITTED RUN_TIME
ST PRI SIZE CMD - 96.0 jjensen 3/23 1348 17232848
R 0 13.8 gaussa_X.bat 0 - 96.1 jjensen 3/23 1348 17084146
R 0 13.8 gaussa_X.bat 1 - 96.2 jjensen 3/23 1348 17034933
R 0 13.8 gaussa_X.bat 2 - 96.3 jjensen 3/23 1348 17044034
R 0 13.8 gaussa_X.bat 3 - 194.0 amcdonal 4/11 1908 0000001
I 0 0.0 printname.bat 60 -
22condor_q -analyze
- amcdonal_at_coen condor_q -analyze 194.0
- -- Submitter coengrid.boisestate.edu
lt132.178.144.7632773gt coengrid.boisestate.edu - ID OWNER SUBMITTED RUN_TIME
ST PRI SIZE CMD - 194.000 Run analysis summary. Of 87 machines,
- 86 are rejected by your job's requirements
- 1 reject your job because of their own
requirements - 0 match, but are serving users with a
better priority in the pool - 0 match, but prefer another specific job
despite its worse user-priority - 0 match, but will not currently preempt
their existing job - 0 are available to run your job
- No successful match recorded.
- Last failed match Mon Apr 11 190804
2005 - Reason for last match failure no match
found
23condor_rm
- amcdonal_at_coen condor_rm 194.0
- Cluster 194 has been marked for removal.
24The multiple job facilities of Condor
- The previous job was a single batch jobs.
- The power of Condor is in the multiple job
facilities. - These can be simultaneous submissions of similar
jobs. - It can also be parallel processing of a single
job that has been decomposed into smaller parts. - Lets look at parallel processing of a single job.
25A Parallel Job
- Dr Elisa BarneySmith in the Department of ECE at
Boise State University has a large, easily
decomposable problem that must be run on 20 to 30
cases. - This is ideal for Condor.
- Prior to the development of the Condor Grid she
manually distributed her work on all the
available machines she could find. - With the Condor Grid, she is able to submit one
job that will break each case into 178 parts and
run them separately. - At the end the results are assembled
programmatically.
26Submit File Constructs
- There are several important constructs in submit
files for parallel processing. - The queue command allows multiple jobs to be run.
- The (process) variable increments with each
queuing of a job stream. - The argument command allows specific instructions
to be passed to each queued job.
27The Parallel Job Condor Submit File
- universe vanilla
- executable sqr_gaussa_W.bat
- arguments (Process)
- transfer_input_files sqr_gaussa_W.m,
sqr_gaussa_W.txt - requirements (OpSys "Linux" )
- queue 178
28The sqr_gaussa_W.bat file
- !/bin/csh -v
- setenv PATH /usr/coen/matlab/binPATH
- setenv LM_LICENSE_FILE /usr/coen/matlab/etc/licens
e.dat - setenv PROCESS_NUM "1"
- matlab -nosplash -r sqr_gaussa_W -logfile 1.log
29The Parallel Job Matlab Code
- agetenv('PROCESS_NUM')
- processstr2num(a)
- offset 256process
- char,psf_type,psftextread(datafile,offset)
30Resources for Starting A Grid Project
- http//www.cs.wisc.edu/condor
- 1) Condor download site.
- 2) Tools for managing and using Condor.
- 3) Technical support forum.
- 4) Documentation
- http//www-128.ibm.com/developerworks/library/gr-d
esign.html - 1) Guidelines for evaluating the suitability of
a project for a grid. - 2) 32 design elements are considered.
- http//www-128.ibm.com/developerworks/library/gr-e
nable.html - 1) Contains six strategies for grid application
enablement. - 2) Maps progressive steps of grid application
development.