Title: Load Balancing, p1
1CS 294-8Distributed Load Balancing
http//www.cs.berkeley.edu/yelick/294
2Load Balancing
- Problem distribute items into buckets
- Data to memory locations
- Files to disks
- Tasks to processors
- Web pages to caches
- Goal even distribution
- Slides stolen from Karger at MIT
http//theory.lcs.mit.edu/karger
3Load Balancing
- Enormous and diverse literature on load balancing
- Computer Science systems
- operating systems
- parallel computing
- distributed computing
- Computer Science theory
- Operations research (IEOR)
- Application domains
4Agenda
- Overview
- Load Balancing Data
- Load Balancing Computation
- (if there is time)
5The Web
MIT
CNN
UCB
CMU
Servers
USC
Browers (clients)
6Hot Spots
IRAM
OceanStore
UCB
Servers
BANE
Telegraph
Browers (clients)
7Temporary Loads
- For permanent loads, use bigger server
- Must also deal with flash crowds
- IBM chess match
- Florida election tally
- Inefficient to design for max load
- Rarely attained
- Much capacity wasted
- Better to offload peak load elsewhere
8Proxy Caches Balance Load
OceanStore
Telegraph
MIT
CNN
UCB
BANE
IRAM
CMU
Servers
USC
Browers (clients)
9Proxy Caching
- Old server hit once for each browser
- New serve his once for each page
- Adapts to changing access patterns
10Proxy Caching
- Every server can also be a cache
- Incentives
- Provides a social good
- Reduces load at sites you want to contact
- Costs you little, if done right
- Few accesses
- Small amount of storage (times many servers)
11Who Caches What?
- Each cache should hold few items
- Otherwise swamped by clients
- Each item should be in few caches
- Otherwise server swamped by caches
- And cache invalidates/updates expensive
- Browser must know right cache
- Could ask for server to redirect
- But server gets swamped by redirects
12Hashing
- Simple and powerful load balancing
- Constant time to find bucket for item
- Example map to n buckets. Pick a, b
- yaxb (mod n)
- Intuition hash maps each itme to one random
bucket - No bucket gets many items
13Problem Adding Caches
- Suppose a new cache arrives
- How to work it into has function?
- Natural change
- y ax b (mod n1)
- Problem changes bucket for every item
- Every cache will be flushed
- Server swamped with new requests
- Goal when add bucket, few items move
14Problem Inconsistent Views
- Each client knows about a different set of
caches its view - View affects choice of cache for item
- With many views, each cache will be asked for
item - Item in all caches swamps server
- Goal item in few caches despite views
15Problem Inconsistent Views
UCB
my view
caches
0
1
2
3
ax b (mod 4) 2
16Problem Inconsistent Views
UCB
Joes view
caches
0
3
1
2
ax b (mod 4) 2
17Problem Inconsistent Views
UCB
Sues view
caches
2
0
3
1
ax b (mod 4) 2
18Problem Inconsistent Views
UCB
Mikes view
caches
1
2
0
3
ax b (mod 4) 2
19Problem Inconsistent Views
UCB
caches
2
2
2
2
20Consistent Hashing
- A new kind of hash function
- Maps any item to a bucket in my view
- Computable in constant time, locally
- 1 standard hash function
- Adding bucket to view takes log time
- Logarithmic of standard hash functions
- Handle incremental and inconsistent views
21Single View Properties
- Balance all buckets get roughly same number of
items - Smooth when a kth bucket is added, only a 1/k
fraction of items move - And only from O(log n) servers
- Minimum needed to preserve balance
22Multiple View Properties
- Consider n views, each of an arbitrary constant
fraction of the buckets - Load number of items a bucket gets from all
views is O(log n) times average - Despite views, load balanced
- Spread over all views, each item appears in
O(log n) buckets - Despite views, few caches for each item
23Implementation
- Use standard hash function H to map items and
caches to unit circle - If H maps to 0..M, divide by M
x
Y
- Map each item to the closest cache (going
clockwise) - A holds 1,2,3
- B holds 4, 5
24Implementation
- To add a new cache
- Hash the cache id
- Move items that should be assigned to it
- Items do not move between A and B
- A holds 3
- B holds 4, 5
- C holds 1, 2
C
25Implementation
- Cache points stored in a pre-computed binary
tree - Lookup for cached item requires
- Hash of item key (e.g., URL)
- BST lookup of successor
- Consistent hashing with n caches requires O(log
n) time - An alternative that breaks the unit circle into
equal-length intervals can make this constant time
26Balance
- Cache points uniformly distributed by H
- Each cache owns equal portion of the unit
circle - Item position random by H
- So each cache gets about the same number of items
27Smoothness
- To add kth cache, hash it to circle
- Captures items between it and nearest cache
- 1/k fraction of total items
- Only from 1 other bucket
- O(log n) to find is, as with lookup
28Low Spread
- Some views might not see nearest cache to an
item, hash it elsewhere - But every view will have a bucket near the item
(on the circle) by random placement - So only buckets near the item will ever have to
hold it - Only a few buckets are near the item by random
placement
29Low Load
- Cache only gets item I if no other cache is
closer to I - Under any view, some cache is close to I by
random place of caches - So a cache only gets items close to it
- But an item is unlikely to be close
- So cache doesnt get many items
30Fault Tolerance
- Suppose contacted cache is down
- Delete from cache set view (BST) and find next
closest cache in interval - Just a small change in view
- Even with many failures, uniform load and other
properties still hold
31Experimental Setup
- Cache Resolver System
- Cache machines for content
- Users browsers that direct requests toward
virtual caches - Resolution units (DNS) that use consistent
hashing to map virtual caches to physical
machines - Surge web load generator from BU
- Two modes
- Common mode (fixed cache for set of clients)
- Cache resolver mode using consistent hashing
32Performance
33Summary of Consistent Hashing
- Trivial to implement
- Fast to compute
- Uniformly distributes items
- Can cheaply add/remove cache
- Even with multiple views
- No cache gets too many items
- Each item in only a few caches
34Consistent Hashing for Caching
- Works well
- Client maps known caches to unit circle
- When item arrives, hash to cache
- Server gets O(log n) requests for its own pages
- Each server can also be a cache
- Gets small number of requests for others pages
- Robust to failures
- Caches can come and go
- Different browsers can know different caches
35Refinement BW Adaptation
- Browser bandwidth to machines may vary
- If bandwidth to server is high, unwilling to use
lower bandwidth cache - Consistently hash item only to caches with
bandwidth as good as server - Theorem all previous properties still hold
- Uniform cache loads
- Low server loads (few caches per item)
36Refinement Hot Pages
- What if one page gets popular?
- Cache responsible for it gets swamped
- Use of tree of caches
- Cache at root gets swamped
- Use a different tree for each page
- Build using consistent hashing
- Balances load for hot pages and hot servers
37Cache Tree Result
- Using cache trees of log depth, for any set of
page accesses, can adaptively balance load such
that every server gets at most log times the
average load of the system (browser/server ratio) - Modulo some theory caveats
38Agenda
- Overview
- Load Balancing Data
- Load Balancing Computation
- (if there is time)
39Load Balancing Spectrum
- Tasks costs
- Do all tasks have equal costs?
- If not, when are the costs known?
- Task dependencies
- Can all tasks be run in any order (including
parallel)? - If not, when are the dependencies known?
- Locality
- Is it important for some tasks to be scheduled on
the same processor (or nearby)? - When is the information about communication
known? - Heterogeneity
- Are all the machines equally fast?
- If not, when do we know their performance?
40Task cost spectrum
41Task Dependency Spectrum
42Task Locality Spectrum
43Machine Heterogeneity Spectrum
- Easy All nodes (e.g., processors) are equally
powerful - Harder Nodes differ, but resources are fixed
- Different physical characteristic
- Hardest Nodes change dynamically
- Other loads on the system (dynamic)
- Data layout (inner vs. out track on disks)
44Spectrum of Solutions
- When is load balancing information known
- Static scheduling. All information is available
to scheduling algorithm, which runs before any
real computation starts. (offline algorithms) - Semi-static scheduling. Information may be known
at program startup, or the beginning of each
timestep, or at other points. Offline algorithms
may be used. - Dynamic scheduling. Information is not known
until mid-execution. (online offline algorithms)
45Approaches
- Static load balancing
- Semi-static load balancing
- Self-scheduling
- Distributed task queues
- Diffusion-based load balancing
- DAG scheduling
- Note these are not all-inclusive, but represent
some of the problems for which good solutions
exist.
46Static Load Balancing
- Static load balancing is use when all information
is available in advance, e.g., - dense matrix algorithms, such as LU factorization
- done using blocked/cyclic layout
- blocked for locality, cyclic for load balance
- most computations on a regular mesh, e.g., FFT
- done using cyclictransposeblocked layout for 1D
- similar for higher dimensions, i.e., with
transpose - explicit methods and iterative methods on an
unstructured mesh - use graph partitioning
- assumes graph does not change over time (or at
least within a timestep during iterative solve)
47Semi-Static Load Balance
- If domain changes slowly over time and locality
is important - Often used in
- particle simulations, particle-in-cell (PIC)
methods - poor locality may be more of a problem than load
imbalance as particles move from one grid
partition to another - tree-structured computations (Barnes Hut, etc.)
- grid computations with dynamically changing grid,
which changes slowly
48Self-Scheduling
- Self scheduling
- Keep a pool of tasks that are available to run
- When a processor completes its current task, look
at the pool - If the computation of one task generates more,
add them to the pool - Originally used for
- Scheduling loops by compiler (really the
runtime-system) - Original paper by Tang and Yew, ICPP 1986
49When to Use Self-Scheduling
- Useful when
- A batch (or set) of tasks without dependencies
- can also be used with dependencies, but most
analysis has only been done for task sets without
dependencies - The cost of each task is unknown
- Locality is not important
- Using a shared memory multiprocessor, so a
centralized solution is fine
50Variations on Self-Scheduling
- Typically, dont want to grab smallest unit of
parallel work. - Instead, choose a chunk of tasks of size K.
- If K is large, access overhead for task queue is
small - If K is small, we are likely to have even finish
times (load balance) - Variations
- Use a fixed chunk size
- Guided self-scheduling
- Tapering
- Weighted Factoring
- Note there are more
51V1 Fixed Chunk Size
- Kruskal and Weiss give a technique for computing
the optimal chunk size - Requires a lot of information about the problem
characteristics - e.g., task costs, number
- Results in an off-line algorithm. Not very
useful in practice. - For use in a compiler, for example, the compiler
would have to estimate the cost of each task - All tasks must be known in advance
52V2 Guided Self-Scheduling
- Idea use larger chunks at the beginning to avoid
excessive overhead and smaller chunks near the
end to even out the finish times. - The chunk size Ki at the ith access to the task
pool is given by - ceiling(Ri/p)
- where Ri is the total number of tasks remaining
and - p is the number of processors
- See Polychronopolous, Guided Self-Scheduling A
Practical Scheduling Scheme for Parallel
Supercomputers, IEEE Transactions on Computers,
Dec. 1987.
53V3 Tapering
- Idea the chunk size, Ki is a function of not
only the remaining work, but also the task cost
variance - variance is estimated using history information
- high variance gt small chunk size should be used
- low variant gt larger chunks OK
- See S. Lucco, Adaptive Parallel Programs, PhD
Thesis, UCB, CSD-95-864, 1994. - Gives analysis (based on workload distribution)
- Also gives experimental results -- tapering
always works at least as well as GSS, although
difference is often small
54V4 Weighted Factoring
- Idea similar to self-scheduling, but divide task
cost by computational power of requesting node - Useful for heterogeneous systems
- Also useful for shared resource NOWs, e.g., built
using all the machines in a building - as with Tapering, historical information is used
to predict future speed - speed may depend on the other loads currently
on a given processor - See Hummel, Schmit, Uma, and Wein, SPAA 96
- includes experimental data and analysis
55V5 Distributed Task Queues
- The obvious extension of self-scheduling to
distributed memory is - a distributed task queue (or bag)
- When are these a good idea?
- Distributed memory multiprocessors
- Or, shared memory with significant
synchronization overhead - Locality is not (very) important
- Tasks that are
- known in advance, e.g., a bag of independent ones
- dependencies exist, i.e., being computed on the
fly - The costs of tasks is not known in advance
56Theory of Distributed Queues
- Main result A simple randomized algorithm is
optimal with high probability - Karp and Zhang 88 show this for a tree of unit
cost (equal size) tasks - Chakrabarti et al 94 show this for a tree of
variable cost tasks - using randomized pushing of tasks
- Blumofe and Leiserson 94 show this for a fixed
task tree of variable cost tasks - uses task pulling (stealing), which is better for
locality - Also have (loose) bounds on the total memory
required
57Engineering Distributed Queues
- A lot of papers on engineering these systems on
various machines, and their applications - If nothing is known about task costs when created
- organize local tasks as a stack (push/pop from
top) - steal from the stack bottom (as if it were a
queue) - If something is known about tasks costs and
communication costs, can be used as hints. (See
Wen, UCB PhD, 1996.) - Goldstein, Rogers, Grunwald, and others
(independent work) have all shown - advantages of integrating into the language
framework - very lightweight thread creation
58Diffusion-Based Load Balancing
- In the randomized schemes, the machine is treated
as fully-connected. - Diffusion-based load balancing takes topology
into account - Locality properties better than prior work
- Load balancing somewhat slower than randomized
- Cost of tasks must be known at creation time
- No dependencies between tasks
59Diffusion-based load balancing
- The machines is modeled as a graph
- At each step, we compute the weight of task
remaining on each processor - This is simply the number if they are unit cost
tasks - Each processor compares its weight with its
neighbors and performs some averaging - See Ghosh et al, SPAA96 for a second order
diffusive load balancing algorithm - takes into account amount of work sent last time
- avoids some oscillation of first order schemes
- Note locality is not directly addressed
60DAG Scheduling
- Some problems involve a DAG of tasks
- nodes represent computation (may be weighted)
- edges represent orderings and usually
communication (may also be weighted) - Two application domains
- Digital Signal Processing computations
- Sparse direct solvers (mainly Cholesky, since it
doesnt require pivoting). - The basic strategy partition DAG to minimize
communication and keep all processors busy - NP complete
- See Gerasoulis and Yang, IEEE Transaction on
PDS, Jun 93.
61Heterogeneous Machines
- Diffusion-based load balancing for heterogeneous
environment - Fizzano, Karger, Stein, Wein
- Graduate declustering
- Remzi Arpaci-Dusseau et al
- And more