Scheduling on Clusters, Scheduling on Grids - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Scheduling on Clusters, Scheduling on Grids

Description:

How do we make a decision? Depends on the data. Sudharshan. Grid FTP Data, NWS ... We'd like to make life easier for the application scientist. Collaborators ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 63
Provided by: jms3
Category:

less

Transcript and Presenter's Notes

Title: Scheduling on Clusters, Scheduling on Grids


1
Scheduling on Clusters,Scheduling on Grids
  • Jennifer M. Schopf
  • Northwestern University

2
Outline
  • Stochastic Scheduling
  • Thesis work
  • UC San Diego Fran Berman
  • Data Replica Selection
  • Work in progress
  • Northwestern University Jeff Mezger, Christopher
    Beckmann
  • Argonne National Lab Sudharshan Vazhkudai and
    Mohamed Kerasha

3
Stochastic Scheduling OverviewThe Problem
  • Clusters of workstations can provide the
    resources required to execute a scientific
    application efficiently
  • Cannot achieve good performance for any single
    application when resources are shared

4
Our Solution
  • Scheduling techniques can be developed to make
    use of the dynamic performance characteristics of
    shared resources
  • The approach
  • Structural performance models
  • Stochastic values and predictions
  • Stochastic scheduling techniques

5
Why use shared clusters of workstations?
  • More resources at a low cost
  • Multiple distributed resources
  • Processors
  • Storage and Data sources
  • Memory
  • Cooperation execution of a single application
  • Focus on performance speed and capacity

6
How can these resources be used effectively?
  • Efficient scheduling
  • Selection of resources
  • Mapping of tasks to resources
  • Allocating data
  • Accurate prediction of performance
  • Good performance prediction modeling techniques

7
Stochastic Value Parameters
Point Value Parameters
Structural Prediction Models
Stochastic Prediction
Stochastic Scheduling
8
Modeling performance on shared clusters
  • Challenges
  • Heterogeneous setting
  • Multiple languages and programming styles
  • Multiple implementations of application
  • Non-dedicated and possibly highly contended
    resources

9
Modeling approach must incorporate
  • Different models for different machine types and
    different application types
  • Different models for the same application, or
    parts of the application
  • Extensible and flexible models to adjust to
    changing resources

10
Structural Modeling
  • Method to construct flexible, extensible
    performance models for distributed parallel
    applications
  • Application performance is decomposed according
    to the structure of the application
  • Each sub-task can have its own model
  • Result is a performance equation
  • Parameters are application and system
    characteristics

11
Successive Over-Relaxation (SOR)
  • Iterative solution to Laplaces equation
  • Typical stencil application
  • Divided into a red pahse and a black phase
  • 2-d grid of data divided into strips

12
SOR
13
Models
14
Dedicated SOR Experiments
  • Platform- 2 Sparc 2s. 1 Sparc 5, 1 Sparc 10
  • 10 mbit ethernet connection
  • Quiescent machines and network
  • Prediction within 3 before memory spill

15
Non-dedicated SOR results
  • Available CPU on workstations varied from .43 to
    .53

16
Platforms with Scattered Range of CPU Availability
17
Improving structural models
  • Available CPU has range of 0.48 /- 0.05
  • Prediction should also have a range

18
Using Additional Information
  • Point value
  • Bandwidth reported as 7Mbits/sec
  • Single Value
  • Often a best guess, estimate under ideal
    circumstances, or a value accurate only for a
    given time frame
  • Stochastic value
  • Bandwidth reported as 7Mbits /- 2 Mbits
  • A set of possible values weighted by
    probabilities
  • Represents a range of likely behavior

19
Stochastic Structural Models
  • Goal Extend structural models so that resulting
    predictions are distributions
  • Structural model is an equation so
  • Need to represent stochastic information
  • Normal distribution
  • Interval
  • Histogram
  • Need to be able to mathematically combine the
    stochastic values in a timely manner

20
Using Normal Distributions
  • A distribution is a set of values with associated
    probabilities
  • General distributions have no unifying
    characteristics (and no associated tractable
    arithmetic)
  • Can often summarize with a well-known family of
    distributions

21
Normal distributions
  • Symmetric and bell-shaped
  • Summarized by a mean and a standard deviation
  • Range of 2 standard deviations captures 95 of
    the values
  • Assume that stochastic data can be adequately
    represented by normal dist

22
Dedicated Execution Time
23
Practical issues when using stochastic data
  • Who/what can supply stochastic data?
  • User
  • Data from past runs
  • On-line measurement tools
  • Network weather service time series data
  • Time frame
  • Given a time series, how much data should we
    consider?

24
Accuracy of stochastic results
  • Result of a stochastic prediction will also be a
    range of values
  • Need to consider how to achieve a tight (sharp)
    interval
  • What to do if interval isnt tight

25
How can I use these predictions in scheduling?
Point Value Parameters
Stochastic Value Parameters
Structural Prediction Models
Stochastic Prediction
Stochastic Scheduling
26
Using stochastic predictions
  • Simplest scheduling situation Given a data
    parallel application, adjust amount of data
    assigned to each processor to minimize execution
    time

27
Delay in 1 can cause delay in all
28
Stochastic Scheduling
  • Examine
  • Stochastic data represented as normal
    distributions
  • Data parallel codes
  • Fixed set of shared resources
  • Question How should data be distributed to
    minimize execution time?
  • Approach Adjust data allocation so that a high
    variance machine receives less work in order to
    minimize the effects of contention

29
Time Balancing
  • Minimize execution time by assigning data so that
    each processor finishes at roughly the same time
  • Di data assigned to processor I
  • Ui time per unit of data on processor I
  • Ci time to distributed the data
  • DiUi Ci Dj Uj Cj for all i,j
  • Sum Di Dtotal

30
Stochastic Time Balancing
  • Adapt time to compute a unit of data (ui) to
    reflect stochastic information
  • Larger ui means smaller Di (less data)
  • If we have normal distributions
  • 95 confidence interval corresponds to m-2sd,
    m2sd
  • If we set u m2 sd
  • 95 conservative schedule

31
Stochastic Time Balancing (cont)
  • Set of equations is now
  • Di (mi 2 sdi ) Ci Dj(mj 2 sdj ) Cj
  • for all i, j
  • Sum Di Dtotal

32
How do policies compare in a production
environment
  • 4 contended Sparcs over 10 Mbit shared ethernet

33
Set of Schedules
34
Tuning factor
  • Tuning factor is the knobto turn to decide how
    conservative a schedule should be
  • For example,m used to determine number of
    standard deviations to add to mean
  • Let ui mi sdiTF
  • Solve
  • Di (mi sdiTF) Ci Dj (mjsdjTF) Cj

35
Extensible approach
  • Dont have to use mean and standard deviation
  • TF can be defined in a variety of ways

36
Defining our stochastic scheduling policy goals
  • Decrease execution time
  • Predictable performance
  • Avoid spikes in execution behavior
  • More conservative when in doubt

37
System of benefits and penalties
  • Based on Sih and Lees approach to scheduling
  • Benefit (give a less conservative schedule to)
  • Platforms with fewer varying machines
  • Low variance machines, especially those with
    lower power

38
Partial ordering
39
Algorithm for TF
40
Scheduling Experiments
  • Platform-
  • 4 contended PCs running Linux
  • 100 mbit shared ethernet connection
  • 3 policies run back to back
  • Mean Ui based on runtime mean pred.
  • VTF Ui based on mean and heuristic TF
    evaluation
  • 95TF Ui based on 95 conf. interval

41
Metrics
  • Window Which of each window of three runs has
    fastest execution time?
  • Compare How often was one policy better than,
    worse than, or split when compared with the
    policy run just before and just after
  • Whats the right metric?

42
SOR- scheduling 1
  • Window Mean 9, CTF 27, 95TF 22 (of 57)
  • Compare Better Mixed Worse
  • Mean 3 4 12
  • VTF 10 7 3
  • 95TF 6 9 4

43
CPU performance
44
SOR 2
SOR- scheduling 2
  • Window Mean 8, VTF 39, 95TF 11 (of 57)
  • Compare Better Mixed Worse
  • Mean 3 7 9
  • VTF 15 2 3
  • 95TF 3 8 8

45
CPU
46
Experimental Conclusions
  • Stochastic information was more beneficial when
    there was a higher variability in available CPU
  • Almost always we saw a reduction in variation in
    actual execution times
  • Unclear when it is better to use which heuristic
    scheduling policy at this point

47
Summary
  • The ability to parameterize structural models
    with stochastic values in order to meet the
    prediction needs of shared clusters of
    workstations
  • A stochastic scheduling policy that can make use
    of stochastic predictions to achieve better
    execution times and more predictable application
    behavior

48
The Grid
  • What is a Grid?
  • Shared resources
  • Coordinated problem solving
  • Grid Problems
  • Multiple sites (multiple institutions)
  • Autonomy
  • Heterogeneity
  • Focus on the user

49
Scheduling on the Grid
  • Select resources
  • Machines
  • Network
  • Storage Devices (Data replica)
  • Move
  • Data to compute
  • Compute to Data
  • Both

50
Moving Compute to Data
  • Find the fastest data source, move compute
  • Pro
  • Often most efficient (BW is scarce commodity)
  • Cons
  • Executables are picky (compilers, libraries, etc)
  • Authentification/authorization/accounting
  • Many minimum parameters memory, disk,
    bandwidth, etc
  • Socio-political

51
Move Data to Compute
  • Select best compute resource, copy data
  • Pro
  • Common model today
  • Sure that application will run
  • Con
  • May not be most efficient (Best compute may be
    far from data)
  • Need to pick best data source for a given compute
    source

52
Data Replication
  • Extremely large data sets
  • Distributed storage sites
  • One file may be available from a number of
    different sources
  • Question where is the best source for me to copy
    it from?

53
High Energy Physics Example
Image courtesy H. Newman, Caltech and C.
Kesselman, ISI
54
Data Replica Selection
  • Given a logical file name, Replica Catalog
    returns a set of physical file names
  • Which replica should we copy the data from?
  • (What happens if that connection is interrupted?)

55
What do we need to do?
  • Need information
  • Need to make a decision
  • These are intertwined

56
Where does the info come from?
VO Specific Agg Dirs
D
D
Registration
R
R
R
R
57
What info is available?
  • Anything in the MDS
  • Anything we write GRISs for
  • GridFTP
  • NWS

58
How do we make a decision?
  • Depends on the data
  • Sudharshan
  • Grid FTP Data, NWS data on bandwidth
  • NWS predictors
  • Christopher and Jeff
  • Data transfer data
  • Case based Reasoning techniques

59
Stochastic Predictions
  • Given variance information about
  • Bandwidth
  • Disk access times
  • Example
  • Data transfer from A will take 5-7 minutes
  • Data transfer from B will take 3-9 minutes
  • Which to pick?

60
The Bigger Picture
  • Wed like a framework for data replica selection
  • Choose any info available
  • Choose your own algorithm
  • Wed like to combine data replica selection with
    CPU selection
  • Wed like to make life easier for the application
    scientist

61
Collaborators
  • The AppLeS group Fran Berman (UCSD), Rich
    Wolski (Univ Tennessee, UCSB)
  • The Northwestern University Parallel Distributed
    (Beer) Lab Team
  • Argonne DSL Students

62
Contact
  • jms_at_cs.nwu.edu
  • http//www.cs.nwu.edu/jms
  • Funding
  • NASA GSRP grant NGT-1-52133
  • Darpa Contract N66001-97-C-8521
  • NSF Career Grant ACI-0093300
Write a Comment
User Comments (0)
About PowerShow.com