Title: Resource Management of Large-Scale Applications on a Grid
1 Resource Management of Large-Scale Applications
on a Grid
- Laukik Chitnis and Sanjay Ranka
- (with Paul Avery, Jang-uk In and Rick Cavanaugh)
- Department of CISE
- University of Florida, Gainesville
- ranka_at_cise.ufl.edu
- 352 392 6838
- (http//www.cise.ufl.edu/ranka/)
2Overview
- High End Grid Applications and Infrastructure at
University of Florida - Resource Management for Grids
- Sphinx Middleware for Resource Provisioning
- Grid Monitoring for better meta-scheduling
- Provisioning Algorithm Research for multi-core
and grid environments
3The Evolution of High-End Applications (and
their system characteristics)
- Geographicallydistributed datasets
- High speed storage
- Gigabit networks
Data Intensive Applications
- Large clusters
- Supercomputers
1980
1990
2000
4Some Representative Applications
- HEP, Medicine, Astronomy, Distributed Data Mining
5Representative Application High Energy Physics
1000
20 countries
1-10 petabytes
1-
6Representative Application Tele-Radiation Therapy
RCET Center for Radiation Oncology
7Representative Application Distributed Intrusion
Detection
NSF ITR Project Middleware for Distributed
Data Mining (PI Ranka joint with Kumar and
Grossman)
8Grid Infrastructure
- Florida Lambda Rail and UF
9Campus Grid (University of Florida)
NSF Major Research Instrumentation Project (PI
Ranka, Avery et. al.) 20 Gigabit/sec Network 20
Terabytes 2-3 Teraflops 10 Scientific and
Engineering Applications
Gigabit Ethernet Based Cluster
Infiniband based Cluster
10Grid Services
- The software part of the infrastructure!
11Services offered in a Grid
Resource Management Services
Monitoring and Information Services
Data Management Services
Note that all the other services use security
services
12Resource Management Services
- Provide a uniform, standard interface to remote
resources including CPU, Storage and Bandwidth - Main component is the remote job manager
- Ex GRAM (Globus Resource Allocation Manager)
13Resource Management on a Grid
LSF
GRAM
Site 2
Site 1
Condor
PBS
fork
Site 3
Site n
The Grid
Narration note the different local schedulers
14Scheduling your Application
15Scheduling your Application
- An application can be run on a grid site as a job
- The modules in grid architecture (such as GRAM)
allow uniform access to the grid sites for your
job - But
- Most applications can be parallelized
- And these separate parts of it can be scheduled
to run simultaneously on different sites - Thus utilizing the power of the grid
16Modeling an Application Workflow
- Many workflows can be modeled as a Directed
Acyclic Graph - The amount of resource required (in units of
time) is known to a degree of certainty - There is a small probability of failure in
execution (in a grid environment this could
happen due to resources no longer available)
17Workflow Resource Provisioning
Executing multiple workflows over distributed
and adaptive (faulty) resources while managing
policies
Large
Precedence
Applications
Time Constraints
Data Intensive
Access Control
Priority
Multi-core
Heterogeneous
Policies
Resources
Multiple Ownership
Quota
Faulty
Distributed
18A Real Life Example from High Energy Physics
- Merge two grids into a single
- multi-VOInter-Grid
- How to ensure that
- neither VO is harmed?
- both VOs actually benefit?
- there are answers to questions like
- With what probability will my job be scheduled
and complete before my conference deadline? - Clear need for a scheduling middleware!
19Typical scenario
VDT Client
?
?
?
VDT Server
VDT Server
VDT Server
20Typical scenario
_at__at_
VDT Client
?
?
?
VDT Server
VDT Server
VDT Server
21Some Requirements for Effective Grid Scheduling
- Information requirements
- Past future dependencies of the application
- Persistent storage of workflows
- Resource usage estimation
- Policies
- Expected to vary slowly over time
- Global views of job descriptions
- Request Tracking and Usage Statistics
- State information important
- Resource Properties and Status
- Expected to vary slowly with time
- Grid weather
- Latency of measurement important
- Replica management
- System requirements
- Distributed, fault-tolerant scheduling
- Customisability
- Interoperability with other scheduling systems
- Quality of Service
22Incorporate Requirementsinto a Framework
VDT Client
?
?
?
- Assume the GriPhyN Virtual Data Toolkit
- Client (request/job submission)
- Globus clients
- Condor-G/DAGMan
- Chimera Virtual Data System
- Server (resource gatekeeper)
- MonALISA Monitoring Service
- Globus services
- RLS (Replica Location Service)
VDT Server
VDT Server
VDT Server
23Incorporate Requirementsinto a Framework
?
- Framework design principles
- Information driven
- Flexible client-server model
- General, but pragmatic and simple
- Avoid adding middleware requirements on grid
resources
VDT Client
Recommendation Engine
VDT Server
- Assume the Virtual Data Toolkit
- Client (request/job submission)
- Clarens Web Service
- Globus clients
- Condor-G/DAGMan
- Chimera Virtual Data System
- Server (resource gatekeeper)
- MonALISA Monitoring Service
- Globus services
- RLS (Replica Location Service)
VDT Server
VDT Server
24Related Provisioning Software
25- Innovative Workflow Scheduling Middleware
- Modular system
- Automated scheduling procedure based on modulated
service - Robust and recoverable system
- Database infrastructure
- Fault-tolerant and recoverable from internal
failure - Platform independent interoperable system
- XML-based communication protocols
- SOAP, XML-RPC
- Supports heterogeneous service environment
- 60 Java Classes
- 24,000 lines of Java code
- 50 test scripts, 1500 lines of script code
26The Sphinx Workflow Execution Framework
VDT Client
Sphinx Server
Sphinx Client
Chimera Virtual Data System
Clarens
WS Backbone
Request Processing
Condor-G/DAGMan
Data Warehouse
Data Management
VDT Server Site
Globus Resource
Information Gathering
Replica Location Service
MonALISA Monitoring Service
27Sphinx Workflow Scheduling Server
Sphinx Server
Message Interface
- Functions as the Nerve Centre
- Data Warehouse
- Policies, Account Information, Grid Weather,
Resource Properties and Status, Request Tracking,
Workflows, etc - Control Process
- Finite State Machine
- Different modules modify jobs, graphs, workflows,
etc and change their state - Flexible
- Extensible
Graph Reducer
Control Process
Job Predictor
Graph Predictor
Job Admission Control
Graph Admission Control
Graph Data Planner
Data Warehouse
Job Execution Planner
Graph Tracker
Data Management
Information Gatherer
28SPHINX
- Scheduling in Parallel for Heterogeneous
Independent NetworXs
29Policy Based Scheduling
- Sphinx provides soft QoS through time
dependent, global views of - Submissions (workflows, jobs, allocation, etc)
- Policies
- Resources
- Uses Linear Programming Methods
- Satisfy Constraints
- Policies, User-requirements, etc
- Optimize an objective function Estimate
probabilities to meet deadlines within policy
constraints - J. In, P. Avery, R. Cavanaugh, and S. Ranka,
"Policy Based Scheduling for Simple Quality of
Service in Grid Computing", in Proceedings of the
18th IEEE IPDPS, Santa Fe, New Mexico, April, 2004
Submissions
Resources
Time
Policy Space
Submissions
Resources
Time
30Ability to tolerate task failures
Jang-uk In, Sanjay Ranka et. al. "SPHINX A
fault-tolerant system for scheduling in dynamic
grid environments", in Proceedings of the 19th
IEEE IPDPS, Denver, Colorado, April, 2005
- Significant Impact of using feedback information
31Grid Enabled Analysis
32Distributed Services for Grid Enabled Data
Analysis
Distributed Services for Grid Enabled Data
Analysis
Clarens
Clarens
Globus
Clarens
Clarens
GridFTP
Globus
Globus
MonALISA
33Evaluation of Information gathered from grid
monitoring systems
34Limitation of Existing Monitoring Systems for the
Grid
- Information aggregated across multiple users is
not very useful in effective resource allocation.
- An end-to-end parameter such as Average Job Delay
- the average queuing delay experienced by a job
of a given user at an execution site - is a
better estimate for comparing the resource
availability and response time for a given user. - It is also not very susceptible to monitoring
latencies.
35Effective DAG Scheduling
- The completion time based algorithm here uses the
Average Job Delay parameter for scheduling - As seen in the adjoining figure, it outperforms
the algorithms tested with other monitored
parameters.
36Work in Progress Modeling Workflow Cost and
developing efficient provisioning algorithms
- 1. Developing an objective measure of completion
time - Integrating performance and reliability of
workflow execution P (Time to complete gtT) lt
epsilon - 2. Relating this measure to the properties of the
longest path of the DAG based on the mean and
uncertainty of time required for underlying tasks
due to - 1) variable time requirements due to different
parameter values - 2) failure due to change of the underlying
resources etc. - 3. Developing novel scheduling and replication
techniques to optimize allocation based on these
metrics.
37Work in Progress Provisioning algorithms for
multiple workflows (Yield Management)
Multiple Workflows
Level 1
Level 1
Level 2
Level 2
Level 3
Level 3
Level 4
Level 4
Dag 1
Dag 2
Dag 3
Dag 5
Dag 4
Dag 1
Dag 2
Dag 3
Dag 5
Dag 4
- Quality of Service guarantees for each workflow
- Controlled (a cluster of multi-core processors)
versus uncontrolled - (grid of multiple clusters owned by multiple
units) environment
38CHEPREO - Grid Education and Networking
- E/O Center in Miami area
- Tutorial for Large Scale Application Development
39Grid Education
- Developing a Grid tutorial as part of CHEPREO
- Grid basics
- Components of a Grid
- Grid Services OGSA
- OSG summer workshop
- South Padre island, Texas. July 11-15, 2005
- http//osg.ivdgl.org/twiki/bin/view/SummerGridWork
shop/ - Lectures and Hands-on sessions
- Building and Maintaining a Grid
40Acknowledgements
- CHEPREO project, NSF
- GriPhyN/iVDgL, NSF
- Data Mining Middleware, NSF
- Intel Corporation
41Thank You
- May the Force be with you!
42Additional slides
43Effect of latency on Average Job Delay
- Latency is simulated in the system by purposely
retrieving old values for the parameter while
making scheduling decisions - The correlation indices with added latencies are
comparable, though lower as expected, to the
correlation indices of un-delayed Average Job
Delay parameter. The amount of correlation is
still quite high.
44SPHINX Scheduling Latency
Average scheduling latency for various number of
DAGs (20, 40 , 80 and 100) with different
arrival rate per minute.
45Demonstration at Supercomputing
Conference Distributed Data Analysis in a Grid
Environment
The architecture has been implemented and
demonstrated in SC03 and SC04, Arizona, USA, 2003.
46Scheduling DAGs Dynamic Critical Path Algorithm
- The DCP algorithm executes the following steps
iteratively - Compute the earliest possible start time (AEST)
and the latest possible start time (ALST) for all
tasks on each processor. - Select a task which has the smallest difference
between its ALST and AEST and has no unscheduled
parent task. If there are tasks with the same
differences, select the one with a smaller AEST. - Select a processor which gives the earliest start
time for the selected task
47Scheduling DAGs ILP- Novel algorithm to support
heterogeneity (work supported by Intel
Corporation)
- There are two novel features
- Assign multiple independent tasks simultaneously
cost of task assigned depends on the processor
available, many tasks commence with a small
difference in start time. - Iteratively refine the scheduling - refines the
scheduling by using the cost of the critical path
based on the assignment in the previous
iteration.
48Comparison of different algorithms
Number of processors 30. Number of Tasks 2000.
Number of processors 30.
49Time for Scheduling