Title: Grid Scheduling through ServiceLevel Agreement
1Grid Scheduling through Service-Level Agreement
- Karl Czajkowski
- The Globus Project
- http//www.globus.org/
2Overview
- Introduction to Grid Environments
- The Resource Management Problem
- Cross-domain applications
- Resource owner goals vs. application goals
- An Open Architecture to Manage Resources
- Service-Level Agreement (SLA)
- GRAM and Managed Services
- Related and Ongoing Work
3Grid Resource Environment
R
?
R
R
R
?
R
R
R
R
R
network
dispersed users
R
?
?
R
R
R
R
R
R
R
R
R
R
VO-A
VO-B
- Distributed users and resources
- Variable resource status
- Variable grouping and connectivity
- Decentralized scheduling/policy
4Social/Policy Conflicts
- Application Goals
- Users deadlines and availability goals
- Applications need coordinated resources
- Localized Resource Owner Goals
- Policies towards users
- Optimization goals
- Community Goals Emerge As
- An aggregate user/application?
- A virtual resource? Both!
5Data-Intensive Example
- Concurrent resource requirements
- Large scale storage, computing, network, graphics
- Datapath involves autonomous domains
6Early Co-Allocation in Grids
- SF-Express (1997-8)
- Real-time simulation
- 12 supercomputers, 1400 processors
- Required advance reservation
- Brokered by telephone!
- Globus DUROC software to sync startup
- Over 45 minutes to recover from failure
- In use today in MPICH-G2 (MPI library)
7Traditional Scheduling
- Closed-System Model
- Presumption of global owner/authority
- Sandboxed applications with no interactions
- Toss job over the fence and wait
- Utilization as Primary Metric
- Deep batch queues allow tighter packing
- No incentives for matching user schedule
- Sub-cultures Counter Site Policies
- Users learn tricks for gaming their site
8An Open Negotiation Model
- Resources in a Global Context
- Advertisement and negotiation
- Normalized remote client interface
- Resource maintains autonomy
- Users or Agents Bridge Resources
- Drive task submission and provisioning
- Coordinate acts across domains
- Community-based Mediation
- Coordination for collective interest
9Community Scheduling Example
- Individual users
- Require service
- Have application goals
- Community schedulers
- Broker service
- Aggregate scheduling
- Individual resources
- Provide service
- Have policy autonomy
- Serve above clients
10Negotiation Phases
- Discovery
- What resources are relevant to interest?
- Finds service providers
- Monitoring
- Whats happening to them now?
- Compare service providers
- Service-Level Agreement
- Will they provide what I need?
- The core Resource Management problem
- Process can iterate due to adaptation
11Service-Level Agreement
- Three kinds of SLA
- Task submission (do something)
- Resource reservation (pre-agreement)
- Lazy task/resource Binding (apply resv.)
- Simple protocol for negotiating SLAs
- Basic 2-party negotiation
- Support for basic offer/accept pattern
- Optional counter-offer patterns
- Variable commitment phase for stricter promises
- Client may maintain multiple 2-party SLAs
12Many Types of Service
- Must support service heterogeneity
- Resources
- Hardware disks, CPU, memory, networks, display
- Logical accounts, services
- Capabilities space, throughput
- Tasks
- Data stored file, data read/write
- Compute execution, suspended/swapped job
- SLAs bear embedded term languages
- Isolate domain-specific details
13Domain Extension File Transfer
- Single goal
- Reliable deadline transfer
- Specialized scheduler
- Brokers basic services
- Synthesizes new service
- Fault-handling logic
- Distributed resources
- Storage space
- Storage bandwidth
- Network bandwidth
14Technical Challenges
- Complex Security Requirements
- Global Scalability
- Similar ideals to Internet
- Interoperable infrastructure
- Policy-configurable for social needs
- Permanence or Evolve in Place
- Cannot take World off-line for service
- Over time upgrade, extend, adapt
- Accept heterogeneity
15GRAM Architecture
SLA implementation
Planner
Domain-specific SLA
Application
Information Service
Monitor
Discover
Concrete SLA
Incremental SLAs
Local resource managers
GRAM2
GRAM2
GRAM2
Job
CPU
Disk
16WS-Agreement
- New standardization effort
- Generalizes GRAM ideas
- Service-oriented architecture
- Resource becomes Service Provider
- Tasks become Negotiated Services
- SLAs presented as Agreement services
- Still supports extensible domain terms
17WS-Agreement Entities
18WS-Agreement Adds Management
19Virtualized Providers
20Agreement-based Jobs
- Agreement represents queue entry
- Commitment with job parameters etc.
- Agreement Provider
- i.e. Job scheduler/Queuing system
- Management interface to service provider
- Service Provider
- i.e. scheduled resource (compute nodes)
- Service is the Job computation
21Advance Reservation for Jobs
- Schedule-based commitment of service
- Requires schedule based SLA terms
- Optional Pre-Agreement (RSLA)
- Agreement to facilitate future Job Agreement
- Characterizes virtual resource needed for Job
- May not need full job terms
- Job Agreement almost as usual
- May exploit Pre-Agreement
- Reference existing promise of resource schedule
- May get schedule commitment in one shot
- Directly include schedule terms
- (Can think of as atomic advance reserve/claim)
22Need for Complex Description
- 128 physical nodes
- Physical topology
- Interconnect
- RAM, disk size
- Subject of RSLA
- Single MPI job
- Subject of TSLA
- May reference RSLAs
- Quality requirements
- Real-time parameters
- CPU, disk performance
- Subject of BSLA
23MDS Resource Models (History)
24Future Models
- Service behavioral descriptions
- Unified service term model
- Capture user/application requirements
- Capture provider capabilities
- Core meta-language
- Facilitates planner/decision designs
- Extends with domain concepts
- Extensible negotiability mark-up
- Capture range of negotiability for variable terms
- Capture importance of terms (required/optional)
- Capture cost of options (fees/penalties)
25SLA Types in Depth
- Resource SLA (RSLA), i.e. reservation
- A promise of resource availability
- Client must utilize promise in subsequent SLAs
- Task SLA (TSLA), i.e. execution
- A promise to perform a task
- Complex task requirements
- May reference an RSLA (implicit binding)
- Binding SLA (BSLA), i.e. claim
- Binds a resource capability to a TSLA
- May reference an RSLA (otherwise obtain
implicitly) - May be created lazily to provision the task
26Resource Lifecycle
- S0 Start with no SLAs
- S1 Create SLAs
- TSLA or RSLA
- S2 Bind task/resource
- Explicit BSLA
- Implicit provider schedule
- S3 Active task
- Resource consumption
- Backtrack to S0
- On task completion
- On expiration
- On failure
27Incremental Negotiation
- RSLA reserve resources for future use
- TSLA submit task to scheduler
- BSLA bind reservation to task
- Resources change state due to SLAs and scheduler
decisions
28Linking SLAs for Complex Case
TSLA1
account tmpuser1
RSLA1
50 GB in /scratch filesystem
BSLA1
30 GB for /scratch/tmpuser1/foo/ files
TSLA2
Complex job
TSLA3
TSLA4
RSLA2
Net
Stage in
Stage out
BSLA2
time
- Dependent SLAs nest intrinsically
- BSLA2 defined in terms of RSLA2 and TSLA4
- Chained SLAs simplify negotiation
- Optionally link destruction/reclamation
29Related Work
- Academic Contemporaries
- Condor Matchmaking
- Economy-based Scheduling
- Work-flow Planning
- Commercial Scheduler Examples
- Many examples for traditional sites
- Several generalized for the enterprise
- Platform Computing
- LSF scaled to lots of jobs
- MultiCluster for site-to-site resource sharing
- IBM eWLM
- Goal-based provisioning of transactional flows
30Condor Matchmaking
- At heart a scheduling algorithm
- Heuristics for pairing job with resource
- Match symmetric Classified Ads
- Great for bulk/commodity matching
- Closed system view
- Subsumes resource through lease
- Sandboxed job environment
- Favor vertical integration over generality
- Tuned high-throughput system
31Condor on GRAM
- Condor already uses GRAM two ways
- GRAM treats Condor as local scheduler
- Condor uses GRAM to access resource
- Condor maps to SLA architecture
- Advertise resource ClassAd
- Submit job ClassAd (as TSLA)
- Matchmaker is a Community Scheduler
- Need SLA scalability to be practical
32Future Work
- SLA interaction with policy
- SLA negotiation subject to policy
- One SLA affects another, e.g. RSLA subdivision
- One client more important than another
- SLA implemented by low-level policies
- Domain-specific SLA maps to resource SLAs
- Resource SLAs map to resource control mechanisms
- Resource characterization
- Advertisement of resources options, cost
- Interoperable capability languages
33Conclusion
- Generic SLA management
- Compositional for complex scenarios
- Extensible for unique requirements
- Requires work on Grid service modeling
- To describe jobs, resource requirements, etc.
- Enhancement to proven architectures
- Encompasses GRAMGARA
- Evolution of the Globus Toolkit RM
- GRAM evolving since 1997
- WS-Agreement standard in progress