Title: P1252108901utFca
1Scheduling Resource Management in Distributed
Systems Rajesh Rajamani, raj_at_cs.wisc.edu http//
www.cs.wisc.edu/condor May 2001
2Outline
- Hi-throughput computing and Condor
- Resource Management in distributed systems
- Matchmaking
- Current research/Misc.
3Power of Computing environments
- Power Work / Time
- High Performance Computing
- Fixed amount of work how much time?
- Traditional Performance metrics FLOPS, MIPS
- Response time/latency oriented
- High Throughput Computing
- Fixed amount of time how much work?
- Application specific performance metrics
- Throughput oriented
4In other words
- HPC - Enormous amounts of computing power over
relatively short periods of time - () Good for applications under sharp time
constraint - HTC - Large amounts of computing power for
lengthy periods - () What if u want to simulate 1000 applications
on ur latest DSP chip design over the next 3
months??
5The Condor Project
- Goal - To develop, implement, deploy, and
evaluate mechanisms and policies that support
High Throughput Computing (HTC) on large
collections of distributively owned computing
resources
6More about Condor
- Started in late 80s
- Principal Investigator - Prof.Miron Livny
- Latest version 6.3.0 released
- Supports 14 different platforms (OS Arch)
including Linux, Solaris and WinNT - Currently employs over 20 students and 5 staff
- We write code, debug, port, publish papers and
YES, we also provide support !!!
7Distributed ownership of resources
- Underutilized - 70 of CPU cycles in a cluster go
waste - Fragmented - Resources owned by different people
- Use these resources to provide HTC, BUT without
impacting QOS available to owner - Achieved by allowing the user to set access
policy using control expressions
8Access policy
- Current state of the resource (eg, keyboard idle
for 15 minutes or load average less than 0.2) - Characteristics of the request (run only jobs of
research associates) - Time of day/night that jobs can be run
9What happens when u submit a job
Central Manager
2. Submitting machine sends Classad of the job
Resources announce their properties periodically
3. Matchmaker Notifies parties of a match
Submitting machine
Available resource
4. Parties negotiate
1. User submits a job
10Important Mechanisms
Mechanism For
Matchmaking Resource Management
Checkpointing Saving the state of a job
Bypass Remote system calls
DAGMAN Automatic job submission based on dependency graph
Master-Worker Exploiting task level parallelism
11Condor Architecture
- Manager
- Collector Database of resources
- Negotiator Matchmaker
- Accountant Priority maintenance
- Startds ( Represent owners of resources)
- Implement owner's access control policy
- Schedds ( Represent customers of the system)
- Maintain persistent queues of resource requests
12Condor Architecture, cont.
13Power of Condor
- Solves NUG30 Quadratic assignment problem, posed
in 1968 over a period of 6.9 days, delivering
over 96,000 CPU hours by commandeering an average
of 650 machines !!! - Compare this with the RSA-155 problem posed in
1977 and solved using 300 computers (over a
period of 7 months) in the last 90s. If you were
to use the same amount of resources as that used
to solve NUG30, this couldve been done in 2
weeks !!! - It (Chorus production) was done in parallel on
machines in the computer center running XXX, and
on the office machines under Condor. The latter
did about 90 of the work! - - - Helge MEINHARD
- (EP division, CERN)
14Resource management using Matchmaking
- Opportunistic Resource Exploitation
- Resource availability is unpredictable
- Exploit resources as soon as they are available
- Matchmaking performed continuously
- As against a centralized scheduler which
wouldve to deal with - - Heterogeneity of resources
- Distributed Ownership - widely varying allocation
policies - Dynamic nature of the cluster
15Classified Advertisements
- A simple language used by resource providers and
customers to express their properties/requirements
to the Collector - Uses a semi-structured data model gt no specific
schema is required by the matchmaker, allowing it
to work naturally in a heterogeneous env - Language folds query language into the data
model. Constraints may be expressed as attributes
of the classad - Should conform to advertising protocol
16Matchmaking with Classads
- 4 steps to managing resources -
- Parties requiring matchmaking advertise their
characteristics, preferences, constraints, etc. - Advertisements matched by a Matchmaker
- Matched entities are notified
- Matched entities establish an allocation through
a claiming process - could include
authentication, constraint verification,
negotiation of terms etc - Method is symmetric
17Classad example
- Sample classad of a Job
- Type Job
- Owner run_sim
- Constraint
- other.Type Machine
- Arch INTEL
- Opsys Solaris251
- Other.Memory gt Memory
-
- Sample classad of a workstation
- Type Machine
- OpSys Linux
- Arch INTEL
- Memory 256 M
- Constraint true
-
18Example Classad (workstation)
-
- Type Machine
- Activity Idle
- Name crow.cs.wisc.edu
- Arch INTEL
- OpSys Solaris251
- Kflops 21893
- Memory 64
- Disk 323496 //KB
- DayTime 36107
19Example Classad (contd.)
- ResearchGrp miron, thain, john
- Untrusted bgates, lalooyadav,
thief - Rank member(other.Owner, ResearchGrp)10
- Constraint !member(other.Owner, Untrusted)
Rank gt 10 ?true false //To prevent
malicious users
20Example Classad (Submitted job)
-
- Type Job
- QDate 886799469
- Owner raman
- Cmd run_sim
- Iwd /usr/raman/sim2
- Memory 31
- Rank Kflops/1e3 other.Memory/32
- Constraint other.Type Machine
OpSys Solaris251 Disk gt 10000
other.Memory gt self.Memory -
-
21Matchmaking
- Evaluates expressions in an environment that
allows each classad to access attributes of the
other - Other.Memory gt self.Memory
- References to non-existent attribute evaluates to
undefined - Considers pairs of ads incompatible unless their
Constraint expressions both evaluate to true - Rank is then then used to choose among compatible
matches - Both parties are notified about the match - could
generate and hand-off session key for
authentication and security
22Separation of Matching and Claiming
- Weak consistency requirements - Claiming allows
provider and customer to verify their constraints
with respect to their current state - Claiming protocol could use cryptographic
techniques (authentication) - Principals involved in a match are themselves
responsible for establishing, maintaining and
servicing a match
23Work outside the Condor kernel- New challenges
- Mulitlateral Matchmaking - Gangmatching
- IO regulation and Disk allocation - Kangaroo
- User interfaces - ClassadView
- Grid applications - Globus
- Security
24Summary
- Matchmaking provides a scalable and robust
resource management solution for HTC environments
- Classads are used by workstations and jobs
- Matchmaker forms the match and informs the
parties, who in turn invoke the claiming protocol - The parties are responsible for establishing,
maintaining and servicing a match - Questions ?
25Gangmatch request
-
- Type Job
- Owner raj
- Cmd run_sim
- Ports
- Label cpu
- ImageSize 28 M
- //Rank and constraints ,
- Label License
- Host cpu.Name
- //Rank and constraints
-
-
-