Load Balancing, p1 - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Load Balancing, p1

Description:

Problem: Inconsistent Views. Each client knows about a different set of caches: its view ... Handle incremental and inconsistent views. CS294, Yelick. Load ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 62
Provided by: carla3
Category:

less

Transcript and Presenter's Notes

Title: Load Balancing, p1


1
CS 294-8Distributed Load Balancing
http//www.cs.berkeley.edu/yelick/294
2
Load Balancing
  • Problem distribute items into buckets
  • Data to memory locations
  • Files to disks
  • Tasks to processors
  • Web pages to caches
  • Goal even distribution
  • Slides stolen from Karger at MIT
    http//theory.lcs.mit.edu/karger

3
Load Balancing
  • Enormous and diverse literature on load balancing
  • Computer Science systems
  • operating systems
  • parallel computing
  • distributed computing
  • Computer Science theory
  • Operations research (IEOR)
  • Application domains

4
Agenda
  • Overview
  • Load Balancing Data
  • Load Balancing Computation
  • (if there is time)

5
The Web
MIT
CNN
UCB
CMU
Servers
USC
Browers (clients)
6
Hot Spots
IRAM
OceanStore
UCB
Servers
BANE
Telegraph
Browers (clients)
7
Temporary Loads
  • For permanent loads, use bigger server
  • Must also deal with flash crowds
  • IBM chess match
  • Florida election tally
  • Inefficient to design for max load
  • Rarely attained
  • Much capacity wasted
  • Better to offload peak load elsewhere

8
Proxy Caches Balance Load
OceanStore
Telegraph
MIT
CNN
UCB
BANE
IRAM
CMU
Servers
USC
Browers (clients)
9
Proxy Caching
  • Old server hit once for each browser
  • New serve his once for each page
  • Adapts to changing access patterns

10
Proxy Caching
  • Every server can also be a cache
  • Incentives
  • Provides a social good
  • Reduces load at sites you want to contact
  • Costs you little, if done right
  • Few accesses
  • Small amount of storage (times many servers)

11
Who Caches What?
  • Each cache should hold few items
  • Otherwise swamped by clients
  • Each item should be in few caches
  • Otherwise server swamped by caches
  • And cache invalidates/updates expensive
  • Browser must know right cache
  • Could ask for server to redirect
  • But server gets swamped by redirects

12
Hashing
  • Simple and powerful load balancing
  • Constant time to find bucket for item
  • Example map to n buckets. Pick a, b
  • yaxb (mod n)
  • Intuition hash maps each itme to one random
    bucket
  • No bucket gets many items

13
Problem Adding Caches
  • Suppose a new cache arrives
  • How to work it into has function?
  • Natural change
  • y ax b (mod n1)
  • Problem changes bucket for every item
  • Every cache will be flushed
  • Server swamped with new requests
  • Goal when add bucket, few items move

14
Problem Inconsistent Views
  • Each client knows about a different set of
    caches its view
  • View affects choice of cache for item
  • With many views, each cache will be asked for
    item
  • Item in all caches swamps server
  • Goal item in few caches despite views

15
Problem Inconsistent Views
UCB
my view
caches
0
1
2
3
ax b (mod 4) 2
16
Problem Inconsistent Views
UCB
Joes view
caches
0
3
1
2
ax b (mod 4) 2
17
Problem Inconsistent Views
UCB
Sues view
caches
2
0
3
1
ax b (mod 4) 2
18
Problem Inconsistent Views
UCB
Mikes view
caches
1
2
0
3
ax b (mod 4) 2
19
Problem Inconsistent Views
UCB
caches
2
2
2
2
20
Consistent Hashing
  • A new kind of hash function
  • Maps any item to a bucket in my view
  • Computable in constant time, locally
  • 1 standard hash function
  • Adding bucket to view takes log time
  • Logarithmic of standard hash functions
  • Handle incremental and inconsistent views

21
Single View Properties
  • Balance all buckets get roughly same number of
    items
  • Smooth when a kth bucket is added, only a 1/k
    fraction of items move
  • And only from O(log n) servers
  • Minimum needed to preserve balance

22
Multiple View Properties
  • Consider n views, each of an arbitrary constant
    fraction of the buckets
  • Load number of items a bucket gets from all
    views is O(log n) times average
  • Despite views, load balanced
  • Spread over all views, each item appears in
    O(log n) buckets
  • Despite views, few caches for each item

23
Implementation
  • Use standard hash function H to map items and
    caches to unit circle
  • If H maps to 0..M, divide by M

x
Y
  • Map each item to the closest cache (going
    clockwise)
  • A holds 1,2,3
  • B holds 4, 5

24
Implementation
  • To add a new cache
  • Hash the cache id
  • Move items that should be assigned to it
  • Items do not move between A and B
  • A holds 3
  • B holds 4, 5
  • C holds 1, 2

C
25
Implementation
  • Cache points stored in a pre-computed binary
    tree
  • Lookup for cached item requires
  • Hash of item key (e.g., URL)
  • BST lookup of successor
  • Consistent hashing with n caches requires O(log
    n) time
  • An alternative that breaks the unit circle into
    equal-length intervals can make this constant time

26
Balance
  • Cache points uniformly distributed by H
  • Each cache owns equal portion of the unit
    circle
  • Item position random by H
  • So each cache gets about the same number of items

27
Smoothness
  • To add kth cache, hash it to circle
  • Captures items between it and nearest cache
  • 1/k fraction of total items
  • Only from 1 other bucket
  • O(log n) to find is, as with lookup

28
Low Spread
  • Some views might not see nearest cache to an
    item, hash it elsewhere
  • But every view will have a bucket near the item
    (on the circle) by random placement
  • So only buckets near the item will ever have to
    hold it
  • Only a few buckets are near the item by random
    placement

29
Low Load
  • Cache only gets item I if no other cache is
    closer to I
  • Under any view, some cache is close to I by
    random place of caches
  • So a cache only gets items close to it
  • But an item is unlikely to be close
  • So cache doesnt get many items

30
Fault Tolerance
  • Suppose contacted cache is down
  • Delete from cache set view (BST) and find next
    closest cache in interval
  • Just a small change in view
  • Even with many failures, uniform load and other
    properties still hold

31
Experimental Setup
  • Cache Resolver System
  • Cache machines for content
  • Users browsers that direct requests toward
    virtual caches
  • Resolution units (DNS) that use consistent
    hashing to map virtual caches to physical
    machines
  • Surge web load generator from BU
  • Two modes
  • Common mode (fixed cache for set of clients)
  • Cache resolver mode using consistent hashing

32
Performance
33
Summary of Consistent Hashing
  • Trivial to implement
  • Fast to compute
  • Uniformly distributes items
  • Can cheaply add/remove cache
  • Even with multiple views
  • No cache gets too many items
  • Each item in only a few caches

34
Consistent Hashing for Caching
  • Works well
  • Client maps known caches to unit circle
  • When item arrives, hash to cache
  • Server gets O(log n) requests for its own pages
  • Each server can also be a cache
  • Gets small number of requests for others pages
  • Robust to failures
  • Caches can come and go
  • Different browsers can know different caches

35
Refinement BW Adaptation
  • Browser bandwidth to machines may vary
  • If bandwidth to server is high, unwilling to use
    lower bandwidth cache
  • Consistently hash item only to caches with
    bandwidth as good as server
  • Theorem all previous properties still hold
  • Uniform cache loads
  • Low server loads (few caches per item)

36
Refinement Hot Pages
  • What if one page gets popular?
  • Cache responsible for it gets swamped
  • Use of tree of caches
  • Cache at root gets swamped
  • Use a different tree for each page
  • Build using consistent hashing
  • Balances load for hot pages and hot servers

37
Cache Tree Result
  • Using cache trees of log depth, for any set of
    page accesses, can adaptively balance load such
    that every server gets at most log times the
    average load of the system (browser/server ratio)
  • Modulo some theory caveats

38
Agenda
  • Overview
  • Load Balancing Data
  • Load Balancing Computation
  • (if there is time)

39
Load Balancing Spectrum
  • Tasks costs
  • Do all tasks have equal costs?
  • If not, when are the costs known?
  • Task dependencies
  • Can all tasks be run in any order (including
    parallel)?
  • If not, when are the dependencies known?
  • Locality
  • Is it important for some tasks to be scheduled on
    the same processor (or nearby)?
  • When is the information about communication
    known?
  • Heterogeneity
  • Are all the machines equally fast?
  • If not, when do we know their performance?

40
Task cost spectrum
41
Task Dependency Spectrum
42
Task Locality Spectrum
43
Machine Heterogeneity Spectrum
  • Easy All nodes (e.g., processors) are equally
    powerful
  • Harder Nodes differ, but resources are fixed
  • Different physical characteristic
  • Hardest Nodes change dynamically
  • Other loads on the system (dynamic)
  • Data layout (inner vs. out track on disks)

44
Spectrum of Solutions
  • When is load balancing information known
  • Static scheduling. All information is available
    to scheduling algorithm, which runs before any
    real computation starts. (offline algorithms)
  • Semi-static scheduling. Information may be known
    at program startup, or the beginning of each
    timestep, or at other points. Offline algorithms
    may be used.
  • Dynamic scheduling. Information is not known
    until mid-execution. (online offline algorithms)

45
Approaches
  • Static load balancing
  • Semi-static load balancing
  • Self-scheduling
  • Distributed task queues
  • Diffusion-based load balancing
  • DAG scheduling
  • Note these are not all-inclusive, but represent
    some of the problems for which good solutions
    exist.

46
Static Load Balancing
  • Static load balancing is use when all information
    is available in advance, e.g.,
  • dense matrix algorithms, such as LU factorization
  • done using blocked/cyclic layout
  • blocked for locality, cyclic for load balance
  • most computations on a regular mesh, e.g., FFT
  • done using cyclictransposeblocked layout for 1D
  • similar for higher dimensions, i.e., with
    transpose
  • explicit methods and iterative methods on an
    unstructured mesh
  • use graph partitioning
  • assumes graph does not change over time (or at
    least within a timestep during iterative solve)

47
Semi-Static Load Balance
  • If domain changes slowly over time and locality
    is important
  • Often used in
  • particle simulations, particle-in-cell (PIC)
    methods
  • poor locality may be more of a problem than load
    imbalance as particles move from one grid
    partition to another
  • tree-structured computations (Barnes Hut, etc.)
  • grid computations with dynamically changing grid,
    which changes slowly

48
Self-Scheduling
  • Self scheduling
  • Keep a pool of tasks that are available to run
  • When a processor completes its current task, look
    at the pool
  • If the computation of one task generates more,
    add them to the pool
  • Originally used for
  • Scheduling loops by compiler (really the
    runtime-system)
  • Original paper by Tang and Yew, ICPP 1986

49
When to Use Self-Scheduling
  • Useful when
  • A batch (or set) of tasks without dependencies
  • can also be used with dependencies, but most
    analysis has only been done for task sets without
    dependencies
  • The cost of each task is unknown
  • Locality is not important
  • Using a shared memory multiprocessor, so a
    centralized solution is fine

50
Variations on Self-Scheduling
  • Typically, dont want to grab smallest unit of
    parallel work.
  • Instead, choose a chunk of tasks of size K.
  • If K is large, access overhead for task queue is
    small
  • If K is small, we are likely to have even finish
    times (load balance)
  • Variations
  • Use a fixed chunk size
  • Guided self-scheduling
  • Tapering
  • Weighted Factoring
  • Note there are more

51
V1 Fixed Chunk Size
  • Kruskal and Weiss give a technique for computing
    the optimal chunk size
  • Requires a lot of information about the problem
    characteristics
  • e.g., task costs, number
  • Results in an off-line algorithm. Not very
    useful in practice.
  • For use in a compiler, for example, the compiler
    would have to estimate the cost of each task
  • All tasks must be known in advance

52
V2 Guided Self-Scheduling
  • Idea use larger chunks at the beginning to avoid
    excessive overhead and smaller chunks near the
    end to even out the finish times.
  • The chunk size Ki at the ith access to the task
    pool is given by
  • ceiling(Ri/p)
  • where Ri is the total number of tasks remaining
    and
  • p is the number of processors
  • See Polychronopolous, Guided Self-Scheduling A
    Practical Scheduling Scheme for Parallel
    Supercomputers, IEEE Transactions on Computers,
    Dec. 1987.

53
V3 Tapering
  • Idea the chunk size, Ki is a function of not
    only the remaining work, but also the task cost
    variance
  • variance is estimated using history information
  • high variance gt small chunk size should be used
  • low variant gt larger chunks OK
  • See S. Lucco, Adaptive Parallel Programs, PhD
    Thesis, UCB, CSD-95-864, 1994.
  • Gives analysis (based on workload distribution)
  • Also gives experimental results -- tapering
    always works at least as well as GSS, although
    difference is often small

54
V4 Weighted Factoring
  • Idea similar to self-scheduling, but divide task
    cost by computational power of requesting node
  • Useful for heterogeneous systems
  • Also useful for shared resource NOWs, e.g., built
    using all the machines in a building
  • as with Tapering, historical information is used
    to predict future speed
  • speed may depend on the other loads currently
    on a given processor
  • See Hummel, Schmit, Uma, and Wein, SPAA 96
  • includes experimental data and analysis

55
V5 Distributed Task Queues
  • The obvious extension of self-scheduling to
    distributed memory is
  • a distributed task queue (or bag)
  • When are these a good idea?
  • Distributed memory multiprocessors
  • Or, shared memory with significant
    synchronization overhead
  • Locality is not (very) important
  • Tasks that are
  • known in advance, e.g., a bag of independent ones
  • dependencies exist, i.e., being computed on the
    fly
  • The costs of tasks is not known in advance

56
Theory of Distributed Queues
  • Main result A simple randomized algorithm is
    optimal with high probability
  • Karp and Zhang 88 show this for a tree of unit
    cost (equal size) tasks
  • Chakrabarti et al 94 show this for a tree of
    variable cost tasks
  • using randomized pushing of tasks
  • Blumofe and Leiserson 94 show this for a fixed
    task tree of variable cost tasks
  • uses task pulling (stealing), which is better for
    locality
  • Also have (loose) bounds on the total memory
    required

57
Engineering Distributed Queues
  • A lot of papers on engineering these systems on
    various machines, and their applications
  • If nothing is known about task costs when created
  • organize local tasks as a stack (push/pop from
    top)
  • steal from the stack bottom (as if it were a
    queue)
  • If something is known about tasks costs and
    communication costs, can be used as hints. (See
    Wen, UCB PhD, 1996.)
  • Goldstein, Rogers, Grunwald, and others
    (independent work) have all shown
  • advantages of integrating into the language
    framework
  • very lightweight thread creation

58
Diffusion-Based Load Balancing
  • In the randomized schemes, the machine is treated
    as fully-connected.
  • Diffusion-based load balancing takes topology
    into account
  • Locality properties better than prior work
  • Load balancing somewhat slower than randomized
  • Cost of tasks must be known at creation time
  • No dependencies between tasks

59
Diffusion-based load balancing
  • The machines is modeled as a graph
  • At each step, we compute the weight of task
    remaining on each processor
  • This is simply the number if they are unit cost
    tasks
  • Each processor compares its weight with its
    neighbors and performs some averaging
  • See Ghosh et al, SPAA96 for a second order
    diffusive load balancing algorithm
  • takes into account amount of work sent last time
  • avoids some oscillation of first order schemes
  • Note locality is not directly addressed

60
DAG Scheduling
  • Some problems involve a DAG of tasks
  • nodes represent computation (may be weighted)
  • edges represent orderings and usually
    communication (may also be weighted)
  • Two application domains
  • Digital Signal Processing computations
  • Sparse direct solvers (mainly Cholesky, since it
    doesnt require pivoting).
  • The basic strategy partition DAG to minimize
    communication and keep all processors busy
  • NP complete
  • See Gerasoulis and Yang, IEEE Transaction on
    PDS, Jun 93.

61
Heterogeneous Machines
  • Diffusion-based load balancing for heterogeneous
    environment
  • Fizzano, Karger, Stein, Wein
  • Graduate declustering
  • Remzi Arpaci-Dusseau et al
  • And more
Write a Comment
User Comments (0)
About PowerShow.com