Title: Scheduling and Resource Management for Next-generation Clusters
1Scheduling and Resource Management for
Next-generation Clusters
- Yanyong Zhang
- Penn State University
- www.cse.psu.edu/yyzhang
2What is a Cluster?
- Cost effective
- Easily scalable
- Highly available
- Readily upgradeable
3Scientific Engineering Applications
- HPTi win 5 year 15M procurement to provide
systems for weather modeling (NOAA).
(http//www.noaanews.noaa.gov/stories/s419.htm) - Sandia's expansion of their Alpha-based C-plant
system. - Maui HPCC LosLobos Linux Super-cluster
- (http//www.dl.ac.uk/CFS/benchmarks/beowulf/ts
ld007.htm) - A performance-price ratio of is demonstrated in
simulations of wind instruments using a cluster
of 20 . - (http//www.swiss.ai.mit.edu/pas/p/sc95.html)
- The PC cluster based parallel simulation
environment and the technologies will have a
positive impact on networking research nationwide
. - (http//www.osc.edu/press/releases/2001/approve
d.shtml)
4Commercial Applications
- Business applications
- Transaction Processing (IBM DB2, oracle )
- Decision Support System (IBM DB2, oracle )
- Internet applications
- Web serving / searching (Google.Com )
- Infowares (yahoo.Com, AOL.Com)
- Email, eChat, ePhone, eBook,eBank, eSociety,
eAnything - Computing portal
5Resource Management
- Each application is demanding
- Several applications/users can be present at the
same time
Resource management and Quality-of-service
become important.
6System Model
4
4
3
- Each node is
- independent
- Maximum MPL
- Arrival queue
7Two Phases in Resource Management
- Allocation Issues
- Admission Control
- Arrival Queue Principle
- Scheduling Issues (CPU Scheduling)
- Resource Isolation
- Co-allocation
8Co-allocation / Co-scheduling
P1
P0
P0
t0
t1
TIME
9Outline
- From OSs perspective
- Contribution 1 boosting the CPU utilization at
supercomputing centers - Contribution 2 providing quick responses for
commercial workloads - Contribution 3 scheduling multiple classes of
applications - From applications perspective
- Contribution 4 optimizing clustered DB2
NEXT
10Contribution 1Boosting CPU Utilization at
Supercomputing Centers
11Objective
Response Time
Wait Time
Execute Time
Wait in the arrival Q
Wait in the ready/blocked Q
minimize
12Existing Techniques
- Back Filling (BF)
- Gang Scheduling (GS)
- Migration (M)
time
2
8
8
3
2
6
2
space
of CPUs 14
13Proposed Scheme
- MBGS GS BF M
- Use GS as the basic framework
- At each row of GS matrix, apply BF technique
- Whenever GS matrix is re-calculated, M should be
considered.
14How Does MBGS Perform?
15Outline
- From OSs perspective
- Contribution 1 boosting the CPU utilization at
supercomputing centers - Contribution 2 providing quick responses for
commercial workloads - Contribution 3 scheduling multiple classes of
applications - From applications perspective
- Contribution 4 optimizing clustered DB2
NEXT
16Contribution 2Reducing Response Times for
Commercial Applications
17Objective
Response Time
Wait Time
Execute Time
Wait in the arrival Q
Wait in the ready/blocked Q
- Minimize wait time
- Minimize response time
18Previous Work IGang Scheduling (GS)
(1)
MINUTES !
(2)
GS is not responsive enough !
19Previous Work IIDynamic Co-scheduling
P1
P2
P3
P0
B
D
A
C
Its As turn
C just finishes I/O
B just gets a msg
Everybody else is blocked
The scheduler on each node makes
independent decision based on local events
without global synchronizations.
20Dynamic Co-scheduling Heuristics
21Simulation Study
- A detailed simulator at a microsecond granularity
- System parameters
- System configurations (maximum MPL, to partition
or not) - System overheads (context switch overheads,
interrupt costs, costs associated with
manipulating queues)
22Simulation Study (Contd)
- Application parameters
- Injection load
- Characteristics (CPU intensive, IO intensive,
communication intensive or somewhere in the
middle)
23Impact of Load
24Impact of Workload Characteristics
Comm intensive
I/O intensive
25Periodic Boost Heuristics
- S1 Compute Phase
- S2 S1 Unconsumed Msg.
- S3 Recv. Msg. Arrived
- S4 Recv. No Msg.
- A S3-gt S2,S1
- B S3-gtS2-gtS1
- C S3,S2,S1
- D S3,S2-gtS1
- E S2-gtS3-gtS1
26Analytical Modeling Study
- The state space is impossible to handle.
Dynamic arrival
27Analysis Description
Number of jobs on node k
Original State Space (impossible to handle!!)
Assumption The state of each processor is
stochastically independent and identical to
the state of the other processors.
?
28Analysis Description (Cont)
?
Address the state transition rates
using Continuous Markov model Build
the Generator Matrix Q
?
Get the invariant probability vector ? by
solving ?Q 0, and ?e 1.
?
Use fixed-point iteration to get the solution
29SB Example
r2
30Results
Optimal PB Frequency
Optimal Spin Time for SB
31Results Optimal Quantum Length
CPU Intensive
Comm Intensive
I/O Intensive
32Outline
- From OSs perspective
- Contribution 1 boosting the CPU utilization at
supercomputing centers - Contribution 2 providing quick responses for
commercial workloads - Contribution 3 scheduling multiple classes of
applications - From applications perspective
- Contribution 4 optimizing clustered DB2
NEXT
33Contribution 3Scheduling Multiple Classes of
Applications
interactive
real time
batch
34Objective
?BE
?RT
How long did it take me to finish?? Response time
How many deadlines have been missed? Miss rate
cluster
35Fairness Ratio (xy)
Cluster Resource
x
xy
y
xy
36How to Adhere to Fairness Ratio?
37BE response time
?RT ?BE 21
?RT ?BE 19
?RT ?BE 91
38 RT Deadline Miss Rate
?RT ?BE 19
?RT ?BE 21
?RT ?BE 91
39Outline
- From OSs perspective
- Contribution 1 boosting the CPU utilization at
supercomputing centers - Contribution 2 providing quick responses for
commercial workloads - Contribution 3 scheduling multiple classes of
applications - From applications perspective
- Characterizing decision support workloads on the
clustered database server - Resource management for transaction processing
workloads on the clustered database server
NEXT
40Experiment Setup
- IBM DB2 Universal Database for Linux, EEE,
Version 7.2 - 8 dual node Linux/Pentium cluster, that has 256
MB RAM and 18 GB disk on each node. - TPC-H workload. Queries are run sequentially (Q1
Q20). Completion time for each query is
measured.
41Platform
Select from T
Client
42Methodology
- Identify the components with high system
overhead. - For each such component, characterize the request
distribution. - Come up with ways of optimization.
- Quantify potential benefits from the
optimization.
43Sampling OS Statistics
- Sample the statistics provided by stat, net/dev,
process/stat. - User/system CPU
- of pages faults
- of blocks read/written
- of reads/writes
- of packets sent/received
- CPU utilization during I/O
44Kernel Instrumentation
- Instrument each system call in the kernel.
Enter system call
Exit system call
unblock
block
resume execution
45Operating System Profile
- Considerable part of the execution time is taken
by pread system call. - There is good overlap of computation with I/O for
some queries. - More reads than writes.
46TPC-H pread Overhead
Query of exe time Query of exe time
Q6 20.0 Q13 10.0
Q14 19.0 Q3 9.6
Q19 16.9 Q4 9.1
Q12 15.4 Q18 9.0
Q15 13.4 Q20 7.9
Q7 12.1 Q2 5.2
Q17 10.8 Q9 5.2
Q8 10.5 Q5 4.6
Q10 10.3 Q16 4.1
Q1 10.0 Q11 3.5
pread overhead of preads X overhead per
pread.
47pread Optimization
pread(dest, chunk) for each page in the
chunk if the page is not in cache
bring it in from disk
copy the page into dest
- Optimization
- Re-mapping the
- buffer
- Copy on write
30?s
48Copy-on-write
Query reduction Query reduction
Q1 98.9 Q11 96.1
Q2 85.7 Q12 87.1
Q3 96.0 Q13 100.0
Q4 80.9 Q14 96.1
Q5 100.0 Q15 96.8
Q6 100.0 Q16 70.7
Q7 79.7 Q17 94.5
Q8 79.3 Q18 100.0
Q9 88.7 Q19 95.7
Q10 77.8 Q20 94.4
of copy-on-write
reduction 1 -
of preads
49Operating System Profile
- Socket calls are the next dominant system calls.
50Message Characteristics
Q11
Q16
Message Size (bytes)
Message Inter-injection Time (Millisecond)
Message Destination
51Observations on Messages
- Only a small set of message sizes is used.
- Many messages are sent in a short period.
- Message destination distribution is uniform.
- Many messages are point-to-point implementations
of multicast/broadcast messages. - Multicast can reduce of messages.
52Potential Reduction in Messages
query total small large query total small large
Q1 44.7 71.4 38.7 Q11 9.6 28.6 0.1
Q2 20.4 58.7 0.2 Q12 8.3 7.8 2.9
Q3 48.2 64.3 38.0 Q13 24.5 75.2 0.1
Q4 22.6 58.6 0.1 Q14 27.9 80.4 0.7
Q5 8.0 7.1 8.4 Q15 46.6 56.5 0.7
Q6 76.4 78.6 45.5 Q16 59.1 63.0 56.9
Q7 57.5 71.4 56.2 Q17 41.5 66.7 27.3
Q8 29.1 75.5 4.8 Q18 11.4 32.3 0.0
Q9 66.8 78.5 61.1 Q19 26.7 79.4 0.2
Q10 25.0 73.6 0.1 Q20 21.1 62.8 0.1
53Online Algorithm
Send ( msg, dest ) send msg to node dest
Send ( msg, dest ) if (msg buffered_msg
dest ? dest_set) dest_set dest_set ?
dest else buffer the msg
Send_bg () foreach buffered_msg if
( it has been buffered longer than threshold )
send multicast msg to nodes in dest_set
54Impact of Threshold
Q7
Q16
Threshold (millisecond)
Threshold (millisecond)
55Outline
- From OSs perspective
- Contribution 1 boosting the CPU utilization at
supercomputing centers - Contribution 2 providing quick responses for
commercial workloads - Contribution 3 scheduling multiple classes of
applications - From applications perspective
- Characterizing decision support workloads on the
clustered database server - Resource management for clustered database
applications
NEXT
56Ongoing/Near-term Work
- What is the optimal number of jobs which should
be admitted? - Can we dynamically pause some processes based on
resource requirement and resource availability? - Which dynamic co-scheduling scheme works best
here? - How do we exploit application level information
in scheduling?
57Future Work
- Some next-generation applications
- Real time medical imaging and collaborative
surgery -
- Application requirements
- VAST processing power, disk
- capacity and network bandwidth
- absolute availability
- deterministic performance
58Future Work
- Requirements
- performance
- more users
- responsive
- Quality-of-service
- availability
- security
- power consumption
- pricing model
59Future Work
- What does it take to get there?
- Hardware innovations
- Resource management and isolation
- Good scalability
- High availability
- Deterministic Performance
60Future Work
- Not only high performance
- Energy consumption
- Security
- Pricing for service
- User satisfaction
- System management
- Ease of use
61Related Work
- parallel job scheduling
- Gang Scheduling Ousterhout82
- Backfilling (Lifka95, Feitelson98)
- Migration (Epima96)
- Dynamic co-scheduling
- Spin Block (Arpaci-Dusseau98, Anglano00),
- Periodic Boost (Nagar99)
- Demand-based Coscheduling (Sobalvarro97),
62Related Work (Contd)
- Real-time Scheduling
- Earliest Deadline First
- Rate Monotonic
- Least Laxity First
- Single node Multi-class scheduling
- Hierarchical scheduling (Goyal96)
- Proportional share (Waldspurger95)
- Commercial clustered server (Pai98, reserve)
63Related Work (Contd)
- Commercial Workloads (CAECW, Barford99,
Kant99) - Database Characterizing (Keeton99,
Ailamaki99, Rosenblum97) - OS support for database (Stonebraker81,
Gray78, Christmann87) - Reducing copies in IO (Pai00, Druschel93,
Thadani95)
64Publications
- IEEE Transactions on Parallel and Distributed
Systems. - International Parallel and Distributed Processing
Symposium (IPDPS 2000) - ACM International Conference on Supercomputing
(ICS 2000) - International Euro-par Conference (Europar 2000)
- ACM Symposium on Parallel Algorithms and
Architectures (SPAA 2001) - Workshop on Job Scheduling Strategies for
Parallel Processing (JSSPP 2001) - Workshop on Computer Architecture Evaluation
Using Commercial Workloads (CAECW 2002)
65Publications IBatch Applications
- Y. Zhang, H. Franke, J. Moreira, A.
Sivasubramaniam. An Integrated Approach to
Parallel Scheduling Using Gang-Scheduling,Backfill
ing and Migration, 7th Workshop on Job Scheduling
Strategies for Parallel Processing. - Y. Zhang, H. Franke, J. Moreira, A.
Sivasubramaniam. The Impact of Migration on
Parallel Job Scheduling for Distributed Systems.
Proceedings of 6th International Euro-Par
Conference Lecture Notes in Computer Science
1900, pages 242-251, Munich, Aug/Sep 2000. - Y. Zhang, H. Franke, J. Moreira, A.
Sivasubramaniam. Improving Parallel Job
Scheduling by combining Gang Scheduling and
Backfilling Techniques. International Parallel
and Distributed Processing Symposium
(IPDPS'2000), pages 133-142, May 2000. - Y. Zhang, H. Franke, J. Moreira, A.
Sivasubramaniam. A Comparative Analysis of Space-
and Time-Sharing Techniques for Parallel Job
Scheduling in Large Scale Parallel Systems.
Submitted to IEEE Transactions on Parallel and
Distributed Systems. -
66Publications IIInteractive Applications
- M. Squillante, Y. Zhang, A. Sivasubramaniam, N.
Gautam, H. Franke, J. Moreira. Analytic Modeling
and Analysis of Dynamic Coscheduling for a Wide
Spectrum of Parallel and Distributed
Environments. Penn State CSE tech report
CSE-01-004. - Y. Zhang, A. Sivasubramaniam, J. Moreira, H.
Franke. Impact of Workload and System Parameters
on Next Generation Cluster Scheduling Mechanisms.
To appear in IEEE Transactions on Parallel and
Distributed Systems. - Y. Zhang, A. Sivasubramaniam, H. Franke, J.
Moreira. A Simulation-based Performance Study of
Cluster Scheduling Mechanisms. 14th ACM
International Conference on Supercomputing
(ICS'2000), pages 100-109, May 2000. - M. Squillante, Y. Zhang, A. Sivasubramaniam, N.
Gautam, H. Franke, J. Moreira. Analytic Modeling
and Analysis of Dynamic Coscheduling for a Wide
Spectrum of Parallel and Distributed
Environments. Submitted to ACM Transactions on
Modeling and Compute Simulation (TOMACS).
67Publications IIIMulti-class Applications
- Y. Zhang, A. Sivasubramaniam.Scheduling
Best-Effort and Real-Time Pipelined Applications
on Time-Shared Clusters, the 13th Annual ACM
symposium on Parallel Algorithms and
Architectures. - Y. Zhang, A. Sivasubramaniam.Scheduling
Best-Effort and Real-Time Pipelined Applications
on Time-Shared Clusters, Submitted to IEEE
Transactions on Parallel and Distributed Systems.
68Publications IVDatabase
- Y. Zhang, J. Zhang, A. Sivasubramaniam, C. Liu,
H. Franke. Decision-Support Workload
Characteristics on a Clustered Database Server
from the OS Perspective. Penn State Technical
Report CSE-01-003
69Thank You !
70Applications
- Numerous scientific engineering apps
- Parametric simulations
- Business applications
- E-commerce applications (Amazon.Com, eBay.Com )
- Database applications (IBM DB2, oracle )
- Internet applications
- Web serving / searching (Google.Com )
- Infowares (yahoo.Com, AOL.Com)
- Asps (application service providers)
- Email, eChat, ePhone, eBook,eBank, eSociety,
eAnything - Computing portals
- Mission critical applications
- Command control systems, banks, nuclear reactor
control, star-war, and handling life threatening
situations
71Admission Control
Arrival Queue
CPU
CPU
72Arrival Queue
- First come first serve, shortest job first, first
fit, best fit, worst fit - Or is it possible to violate the priority
principle based on the resource availability? And
how?
Job 1
Job 2
system
73Resource Isolation
74Previous Work ISpace Sharing
- Easy to implement
- Severe fragmentation
2
2
3
3
6
6
75Previous Work IIBackfilling (BF)
Arrival Queue
time
8
2
8
3
6
space
of CPUs 14
76Previous Work IIIGang Scheduling(GS)
5
3
5
5
2
6
2
6
2
3
77Previous Work IVMigration (M)
- Jobs can be migrated during execution.
78A Question
- Can we combine BF, GS, and M to deliver even
better performance ?
79Basic GS
- At each scheduling event, the matrix will be
re-calculated - Two optimizations to the matrix
- CollapseMatrx
- FillMatrix
80GS BF
81Backfilling (BF)
- Conservative BF
- ? model of overestimation
- Characteristics of ? model
- ? is the fraction of jobs which run for at least
their estimated time - The rest of jobs just execute for some fraction
of their estimation time - The fraction is uniformly distributed
82One Problem
- How to estimate job execution time in time-shared
environment? - Upper bound MPL user submitted estimation
Using upper bound to backfill !
83Why Migration?
84What to migrate?
85MBGS GS BF M
86(No Transcript)
87Model Validation
88Future Work
- Can we piggyback some information from src node
to help the scheduling on the dest node? - Can the system dynamically figure out where it is
operating and choose the best scheme accordingly?
(global optimization)
89Objective
- To minimize the interference between jobs of
different classes. - To schedule jobs in each class efficiently.
- Minimize BE response times
- Minimize RT deadline miss rates
90Previous Work
- Parallel BE job scheduling
- Gang scheduling
- dynamic co-scheduling.
- RT scheduling
- Earliest Deadline First,
- Rate Monotonic,
- Least Laxity First etc
- Single node Multi-class scheduling
- hierarchical schedulers,
- proportional share schedulers, etc
91Proposed Work
- Look at coexisting multiple class parallel
applications on a timed-shared cluster !
92One-level Gang Scheduler (1GS)
93 1GS Optimizations
Optimization Principle A class can use slots of
other class if they cannot be utilized by other
class.
94Two-level Dynamic Coscheduler with Rigid TDM
(2DCS-TDM)
xy 21
- Globally decide when to schedule which class.
- Locally decide the schedule within each class.
E
952DCS-TDM optimizations
Optimization Principle a class can borrow time
from other class when no job in that class can be
run !
96Two-level Dynamic Coscheduler with Proportional
Share Scheduling (2DCS-PS)
- Locally decide when to schedule which class.
- Locally decide the schedule within each class.
xy 21
972DCS-PS Optimizations
Optimization Principle If no job in one class
can be run, only jobs in other class will be
scheduled !
98Future Work
- How can the admission control algorithms affect
the performance? - Can we provide a deterministic admission control
algorithms since in some cases being
deterministic is very important? - Can we provide a tunable admission control so
that the users can specify how much performance
they are willing to lose to trade off the
admission rate? - How do the system parameters affect the
performance?
99Case Study
100I/O Characteristics (Q6)
101Potential Benefit
- Maximum benefit
- Pages are read-only.
- Page cache hit ratio is high
- If page cache miss, good overlap
- with useful computation
102Conclusions
- Resource management is and continues to be
important for clusters. - Important starting steps taken in three types of
environments. - Still a long way to go for next generation
applications.