Scheduling and Resource Management for Next-generation Clusters - PowerPoint PPT Presentation

About This Presentation
Title:

Scheduling and Resource Management for Next-generation Clusters

Description:

is demonstrated in simulations of wind instruments using a cluster of 20 ... The PC cluster based parallel simulation environment and the technologies ... – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 71
Provided by: yang123
Category:

less

Transcript and Presenter's Notes

Title: Scheduling and Resource Management for Next-generation Clusters


1
Scheduling and Resource Management for
Next-generation Clusters
  • Yanyong Zhang
  • Penn State University
  • www.cse.psu.edu/yyzhang

2
What is a Cluster?
  • Cost effective
  • Easily scalable
  • Highly available
  • Readily upgradeable

3
Scientific Engineering Applications
  • HPTi win 5 year 15M procurement to provide
    systems for weather modeling (NOAA).
    (http//www.noaanews.noaa.gov/stories/s419.htm)
  • Sandia's expansion of their Alpha-based C-plant
    system.
  • Maui HPCC LosLobos Linux Super-cluster
  • (http//www.dl.ac.uk/CFS/benchmarks/beowulf/ts
    ld007.htm)
  • A performance-price ratio of is demonstrated in
    simulations of wind instruments using a cluster
    of 20 .
  • (http//www.swiss.ai.mit.edu/pas/p/sc95.html)
  • The PC cluster based parallel simulation
    environment and the technologies will have a
    positive impact on networking research nationwide
    .
  • (http//www.osc.edu/press/releases/2001/approve
    d.shtml)

4
Commercial Applications
  • Business applications
  • Transaction Processing (IBM DB2, oracle )
  • Decision Support System (IBM DB2, oracle )
  • Internet applications
  • Web serving / searching (Google.Com )
  • Infowares (yahoo.Com, AOL.Com)
  • Email, eChat, ePhone, eBook,eBank, eSociety,
    eAnything
  • Computing portal

5
Resource Management
  • Each application is demanding
  • Several applications/users can be present at the
    same time

Resource management and Quality-of-service
become important.
6
System Model
4
4
3
  • Each node is
  • independent
  • Maximum MPL
  • Arrival queue

7
Two Phases in Resource Management
  • Allocation Issues
  • Admission Control
  • Arrival Queue Principle
  • Scheduling Issues (CPU Scheduling)
  • Resource Isolation
  • Co-allocation

8
Co-allocation / Co-scheduling
P1
P0
P0
t0
t1
TIME
9
Outline
  • From OSs perspective
  • Contribution 1 boosting the CPU utilization at
    supercomputing centers
  • Contribution 2 providing quick responses for
    commercial workloads
  • Contribution 3 scheduling multiple classes of
    applications
  • From applications perspective
  • Contribution 4 optimizing clustered DB2

NEXT
10
Contribution 1Boosting CPU Utilization at
Supercomputing Centers
11
Objective
Response Time
Wait Time
Execute Time
Wait in the arrival Q
Wait in the ready/blocked Q
minimize
12
Existing Techniques
  • Back Filling (BF)
  • Gang Scheduling (GS)
  • Migration (M)

time
2
8
8
3
2
6
2
space
of CPUs 14
13
Proposed Scheme
  • MBGS GS BF M
  • Use GS as the basic framework
  • At each row of GS matrix, apply BF technique
  • Whenever GS matrix is re-calculated, M should be
    considered.

14
How Does MBGS Perform?
15
Outline
  • From OSs perspective
  • Contribution 1 boosting the CPU utilization at
    supercomputing centers
  • Contribution 2 providing quick responses for
    commercial workloads
  • Contribution 3 scheduling multiple classes of
    applications
  • From applications perspective
  • Contribution 4 optimizing clustered DB2

NEXT
16
Contribution 2Reducing Response Times for
Commercial Applications
17
Objective
Response Time
Wait Time
Execute Time
Wait in the arrival Q
Wait in the ready/blocked Q
  • Minimize wait time
  • Minimize response time

18
Previous Work IGang Scheduling (GS)
(1)
MINUTES !
(2)
GS is not responsive enough !
19
Previous Work IIDynamic Co-scheduling
P1
P2
P3
P0
B
D
A
C
Its As turn
C just finishes I/O
B just gets a msg
Everybody else is blocked
The scheduler on each node makes
independent decision based on local events
without global synchronizations.
20
Dynamic Co-scheduling Heuristics
21
Simulation Study
  • A detailed simulator at a microsecond granularity
  • System parameters
  • System configurations (maximum MPL, to partition
    or not)
  • System overheads (context switch overheads,
    interrupt costs, costs associated with
    manipulating queues)

22
Simulation Study (Contd)
  • Application parameters
  • Injection load
  • Characteristics (CPU intensive, IO intensive,
    communication intensive or somewhere in the
    middle)

23
Impact of Load
24
Impact of Workload Characteristics
Comm intensive
I/O intensive
25
Periodic Boost Heuristics
  • S1 Compute Phase
  • S2 S1 Unconsumed Msg.
  • S3 Recv. Msg. Arrived
  • S4 Recv. No Msg.
  • A S3-gt S2,S1
  • B S3-gtS2-gtS1
  • C S3,S2,S1
  • D S3,S2-gtS1
  • E S2-gtS3-gtS1

26
Analytical Modeling Study
  • The state space is impossible to handle.

Dynamic arrival
27
Analysis Description
Number of jobs on node k
Original State Space (impossible to handle!!)
Assumption The state of each processor is
stochastically independent and identical to
the state of the other processors.
?
28
Analysis Description (Cont)
?
Address the state transition rates
using Continuous Markov model Build
the Generator Matrix Q
?
Get the invariant probability vector ? by
solving ?Q 0, and ?e 1.
?
Use fixed-point iteration to get the solution
29
SB Example
r2
30
Results
Optimal PB Frequency
Optimal Spin Time for SB
31
Results Optimal Quantum Length
CPU Intensive
Comm Intensive
I/O Intensive
32
Outline
  • From OSs perspective
  • Contribution 1 boosting the CPU utilization at
    supercomputing centers
  • Contribution 2 providing quick responses for
    commercial workloads
  • Contribution 3 scheduling multiple classes of
    applications
  • From applications perspective
  • Contribution 4 optimizing clustered DB2

NEXT
33
Contribution 3Scheduling Multiple Classes of
Applications
interactive
real time
batch
34
Objective
?BE
?RT
How long did it take me to finish?? Response time
How many deadlines have been missed? Miss rate
cluster
35
Fairness Ratio (xy)
Cluster Resource
x
xy
y
xy
36
How to Adhere to Fairness Ratio?
37
BE response time
?RT ?BE 21
?RT ?BE 19
?RT ?BE 91
38
RT Deadline Miss Rate
?RT ?BE 19
?RT ?BE 21
?RT ?BE 91
39
Outline
  • From OSs perspective
  • Contribution 1 boosting the CPU utilization at
    supercomputing centers
  • Contribution 2 providing quick responses for
    commercial workloads
  • Contribution 3 scheduling multiple classes of
    applications
  • From applications perspective
  • Characterizing decision support workloads on the
    clustered database server
  • Resource management for transaction processing
    workloads on the clustered database server

NEXT
40
Experiment Setup
  • IBM DB2 Universal Database for Linux, EEE,
    Version 7.2
  • 8 dual node Linux/Pentium cluster, that has 256
    MB RAM and 18 GB disk on each node.
  • TPC-H workload. Queries are run sequentially (Q1
    Q20). Completion time for each query is
    measured.

41
Platform
Select from T
Client
42
Methodology
  • Identify the components with high system
    overhead.
  • For each such component, characterize the request
    distribution.
  • Come up with ways of optimization.
  • Quantify potential benefits from the
    optimization.

43
Sampling OS Statistics
  • Sample the statistics provided by stat, net/dev,
    process/stat.
  • User/system CPU
  • of pages faults
  • of blocks read/written
  • of reads/writes
  • of packets sent/received
  • CPU utilization during I/O

44
Kernel Instrumentation
  • Instrument each system call in the kernel.

Enter system call
Exit system call
unblock
block
resume execution
45
Operating System Profile
  • Considerable part of the execution time is taken
    by pread system call.
  • There is good overlap of computation with I/O for
    some queries.
  • More reads than writes.

46
TPC-H pread Overhead
Query of exe time Query of exe time
Q6 20.0 Q13 10.0
Q14 19.0 Q3 9.6
Q19 16.9 Q4 9.1
Q12 15.4 Q18 9.0
Q15 13.4 Q20 7.9
Q7 12.1 Q2 5.2
Q17 10.8 Q9 5.2
Q8 10.5 Q5 4.6
Q10 10.3 Q16 4.1
Q1 10.0 Q11 3.5
pread overhead of preads X overhead per
pread.
47
pread Optimization
pread(dest, chunk) for each page in the
chunk if the page is not in cache
bring it in from disk
copy the page into dest
  • Optimization
  • Re-mapping the
  • buffer
  • Copy on write

30?s
48
Copy-on-write
Query reduction Query reduction
Q1 98.9 Q11 96.1
Q2 85.7 Q12 87.1
Q3 96.0 Q13 100.0
Q4 80.9 Q14 96.1
Q5 100.0 Q15 96.8
Q6 100.0 Q16 70.7
Q7 79.7 Q17 94.5
Q8 79.3 Q18 100.0
Q9 88.7 Q19 95.7
Q10 77.8 Q20 94.4
of copy-on-write
reduction 1 -
of preads
49
Operating System Profile
  • Socket calls are the next dominant system calls.

50
Message Characteristics
Q11
Q16
Message Size (bytes)
Message Inter-injection Time (Millisecond)
Message Destination
51
Observations on Messages
  • Only a small set of message sizes is used.
  • Many messages are sent in a short period.
  • Message destination distribution is uniform.
  • Many messages are point-to-point implementations
    of multicast/broadcast messages.
  • Multicast can reduce of messages.

52
Potential Reduction in Messages
query total small large query total small large
Q1 44.7 71.4 38.7 Q11 9.6 28.6 0.1
Q2 20.4 58.7 0.2 Q12 8.3 7.8 2.9
Q3 48.2 64.3 38.0 Q13 24.5 75.2 0.1
Q4 22.6 58.6 0.1 Q14 27.9 80.4 0.7
Q5 8.0 7.1 8.4 Q15 46.6 56.5 0.7
Q6 76.4 78.6 45.5 Q16 59.1 63.0 56.9
Q7 57.5 71.4 56.2 Q17 41.5 66.7 27.3
Q8 29.1 75.5 4.8 Q18 11.4 32.3 0.0
Q9 66.8 78.5 61.1 Q19 26.7 79.4 0.2
Q10 25.0 73.6 0.1 Q20 21.1 62.8 0.1
53
Online Algorithm
Send ( msg, dest ) send msg to node dest
Send ( msg, dest ) if (msg buffered_msg
dest ? dest_set) dest_set dest_set ?
dest else buffer the msg
Send_bg () foreach buffered_msg if
( it has been buffered longer than threshold )
send multicast msg to nodes in dest_set
54
Impact of Threshold
Q7
Q16
Threshold (millisecond)
Threshold (millisecond)
55
Outline
  • From OSs perspective
  • Contribution 1 boosting the CPU utilization at
    supercomputing centers
  • Contribution 2 providing quick responses for
    commercial workloads
  • Contribution 3 scheduling multiple classes of
    applications
  • From applications perspective
  • Characterizing decision support workloads on the
    clustered database server
  • Resource management for clustered database
    applications

NEXT
56
Ongoing/Near-term Work
  • What is the optimal number of jobs which should
    be admitted?
  • Can we dynamically pause some processes based on
    resource requirement and resource availability?
  • Which dynamic co-scheduling scheme works best
    here?
  • How do we exploit application level information
    in scheduling?

57
Future Work
  • Some next-generation applications
  • Real time medical imaging and collaborative
    surgery
  • Application requirements
  • VAST processing power, disk
  • capacity and network bandwidth
  • absolute availability
  • deterministic performance

58
Future Work
  • E-business on demand
  • Requirements
  • performance
  • more users
  • responsive
  • Quality-of-service
  • availability
  • security
  • power consumption
  • pricing model

59
Future Work
  • What does it take to get there?
  • Hardware innovations
  • Resource management and isolation
  • Good scalability
  • High availability
  • Deterministic Performance

60
Future Work
  • Not only high performance
  • Energy consumption
  • Security
  • Pricing for service
  • User satisfaction
  • System management
  • Ease of use

61
Related Work
  • parallel job scheduling
  • Gang Scheduling Ousterhout82
  • Backfilling (Lifka95, Feitelson98)
  • Migration (Epima96)
  • Dynamic co-scheduling
  • Spin Block (Arpaci-Dusseau98, Anglano00),
  • Periodic Boost (Nagar99)
  • Demand-based Coscheduling (Sobalvarro97),

62
Related Work (Contd)
  • Real-time Scheduling
  • Earliest Deadline First
  • Rate Monotonic
  • Least Laxity First
  • Single node Multi-class scheduling
  • Hierarchical scheduling (Goyal96)
  • Proportional share (Waldspurger95)
  • Commercial clustered server (Pai98, reserve)

63
Related Work (Contd)
  • Commercial Workloads (CAECW, Barford99,
    Kant99)
  • Database Characterizing (Keeton99,
    Ailamaki99, Rosenblum97)
  • OS support for database (Stonebraker81,
    Gray78, Christmann87)
  • Reducing copies in IO (Pai00, Druschel93,
    Thadani95)

64
Publications
  • IEEE Transactions on Parallel and Distributed
    Systems.
  • International Parallel and Distributed Processing
    Symposium (IPDPS 2000)
  • ACM International Conference on Supercomputing
    (ICS 2000)
  • International Euro-par Conference (Europar 2000)
  • ACM Symposium on Parallel Algorithms and
    Architectures (SPAA 2001)
  • Workshop on Job Scheduling Strategies for
    Parallel Processing (JSSPP 2001)
  • Workshop on Computer Architecture Evaluation
    Using Commercial Workloads (CAECW 2002)

65
Publications IBatch Applications
  • Y. Zhang, H. Franke, J. Moreira, A.
    Sivasubramaniam. An Integrated Approach to
    Parallel Scheduling Using Gang-Scheduling,Backfill
    ing and Migration, 7th Workshop on Job Scheduling
    Strategies for Parallel Processing.
  • Y. Zhang, H. Franke, J. Moreira, A.
    Sivasubramaniam. The Impact of Migration on
    Parallel Job Scheduling for Distributed Systems.
    Proceedings of 6th International Euro-Par
    Conference Lecture Notes in Computer Science
    1900, pages 242-251, Munich, Aug/Sep 2000.
  • Y. Zhang, H. Franke, J. Moreira, A.
    Sivasubramaniam. Improving Parallel Job
    Scheduling by combining Gang Scheduling and
    Backfilling Techniques. International Parallel
    and Distributed Processing Symposium
    (IPDPS'2000), pages 133-142, May 2000.
  • Y. Zhang, H. Franke, J. Moreira, A.
    Sivasubramaniam. A Comparative Analysis of Space-
    and Time-Sharing Techniques for Parallel Job
    Scheduling in Large Scale Parallel Systems.
    Submitted to IEEE Transactions on Parallel and
    Distributed Systems.

66
Publications IIInteractive Applications
  • M. Squillante, Y. Zhang, A. Sivasubramaniam, N.
    Gautam, H. Franke, J. Moreira. Analytic Modeling
    and Analysis of Dynamic Coscheduling for a Wide
    Spectrum of Parallel and Distributed
    Environments. Penn State CSE tech report
    CSE-01-004.
  • Y. Zhang, A. Sivasubramaniam, J. Moreira, H.
    Franke. Impact of Workload and System Parameters
    on Next Generation Cluster Scheduling Mechanisms.
    To appear in IEEE Transactions on Parallel and
    Distributed Systems.
  • Y. Zhang, A. Sivasubramaniam, H. Franke, J.
    Moreira. A Simulation-based Performance Study of
    Cluster Scheduling Mechanisms. 14th ACM
    International Conference on Supercomputing
    (ICS'2000), pages 100-109, May 2000.
  • M. Squillante, Y. Zhang, A. Sivasubramaniam, N.
    Gautam, H. Franke, J. Moreira. Analytic Modeling
    and Analysis of Dynamic Coscheduling for a Wide
    Spectrum of Parallel and Distributed
    Environments. Submitted to ACM Transactions on
    Modeling and Compute Simulation (TOMACS).

67
Publications IIIMulti-class Applications
  • Y. Zhang, A. Sivasubramaniam.Scheduling
    Best-Effort and Real-Time Pipelined Applications
    on Time-Shared Clusters, the 13th Annual ACM
    symposium on Parallel Algorithms and
    Architectures.
  • Y. Zhang, A. Sivasubramaniam.Scheduling
    Best-Effort and Real-Time Pipelined Applications
    on Time-Shared Clusters, Submitted to IEEE
    Transactions on Parallel and Distributed Systems.

68
Publications IVDatabase
  • Y. Zhang, J. Zhang, A. Sivasubramaniam, C. Liu,
    H. Franke. Decision-Support Workload
    Characteristics on a Clustered Database Server
    from the OS Perspective. Penn State Technical
    Report CSE-01-003

69
Thank You !
70
Applications
  • Numerous scientific engineering apps
  • Parametric simulations
  • Business applications
  • E-commerce applications (Amazon.Com, eBay.Com )
  • Database applications (IBM DB2, oracle )
  • Internet applications
  • Web serving / searching (Google.Com )
  • Infowares (yahoo.Com, AOL.Com)
  • Asps (application service providers)
  • Email, eChat, ePhone, eBook,eBank, eSociety,
    eAnything
  • Computing portals
  • Mission critical applications
  • Command control systems, banks, nuclear reactor
    control, star-war, and handling life threatening
    situations

71
Admission Control
  • Example 1
  • Example 2

Arrival Queue
CPU
CPU
72
Arrival Queue
  • First come first serve, shortest job first, first
    fit, best fit, worst fit
  • Or is it possible to violate the priority
    principle based on the resource availability? And
    how?

Job 1
Job 2
system
73
Resource Isolation
74
Previous Work ISpace Sharing
  • Easy to implement
  • Severe fragmentation

2
2
3
3
6
6
75
Previous Work IIBackfilling (BF)
Arrival Queue
time
8
2
8
3
6
space
of CPUs 14
76
Previous Work IIIGang Scheduling(GS)
5
3
5
5
2
6
2
6
2
3
77
Previous Work IVMigration (M)
  • Jobs can be migrated during execution.

78
A Question
  • Can we combine BF, GS, and M to deliver even
    better performance ?

79
Basic GS
  • At each scheduling event, the matrix will be
    re-calculated
  • Two optimizations to the matrix
  • CollapseMatrx
  • FillMatrix

80
GS BF
81
Backfilling (BF)
  • Conservative BF
  • ? model of overestimation
  • Characteristics of ? model
  • ? is the fraction of jobs which run for at least
    their estimated time
  • The rest of jobs just execute for some fraction
    of their estimation time
  • The fraction is uniformly distributed

82
One Problem
  • How to estimate job execution time in time-shared
    environment?
  • Upper bound MPL user submitted estimation

Using upper bound to backfill !
83
Why Migration?
84
What to migrate?
85
MBGS GS BF M
86
(No Transcript)
87
Model Validation
88
Future Work
  • Can we piggyback some information from src node
    to help the scheduling on the dest node?
  • Can the system dynamically figure out where it is
    operating and choose the best scheme accordingly?
    (global optimization)

89
Objective
  • To minimize the interference between jobs of
    different classes.
  • To schedule jobs in each class efficiently.
  • Minimize BE response times
  • Minimize RT deadline miss rates

90
Previous Work
  • Parallel BE job scheduling
  • Gang scheduling
  • dynamic co-scheduling.
  • RT scheduling
  • Earliest Deadline First,
  • Rate Monotonic,
  • Least Laxity First etc
  • Single node Multi-class scheduling
  • hierarchical schedulers,
  • proportional share schedulers, etc

91
Proposed Work
  • Look at coexisting multiple class parallel
    applications on a timed-shared cluster !

92
One-level Gang Scheduler (1GS)
93
1GS Optimizations
Optimization Principle A class can use slots of
other class if they cannot be utilized by other
class.
94
Two-level Dynamic Coscheduler with Rigid TDM
(2DCS-TDM)
xy 21
  • Globally decide when to schedule which class.
  • Locally decide the schedule within each class.

E
95
2DCS-TDM optimizations
Optimization Principle a class can borrow time
from other class when no job in that class can be
run !
96
Two-level Dynamic Coscheduler with Proportional
Share Scheduling (2DCS-PS)
  • Locally decide when to schedule which class.
  • Locally decide the schedule within each class.

xy 21
97
2DCS-PS Optimizations
Optimization Principle If no job in one class
can be run, only jobs in other class will be
scheduled !
98
Future Work
  • How can the admission control algorithms affect
    the performance?
  • Can we provide a deterministic admission control
    algorithms since in some cases being
    deterministic is very important?
  • Can we provide a tunable admission control so
    that the users can specify how much performance
    they are willing to lose to trade off the
    admission rate?
  • How do the system parameters affect the
    performance?

99
Case Study
100
I/O Characteristics (Q6)
101
Potential Benefit
  • Maximum benefit
  • Pages are read-only.
  • Page cache hit ratio is high
  • If page cache miss, good overlap
  • with useful computation

102
Conclusions
  • Resource management is and continues to be
    important for clusters.
  • Important starting steps taken in three types of
    environments.
  • Still a long way to go for next generation
    applications.
Write a Comment
User Comments (0)
About PowerShow.com