Title: Load Balancing Part 2: Static Load Balancing
1Load Balancing Part 2 Static Load Balancing
- Kathy Yelick
- yelick_at_cs.berkeley.edu
- www.cs.berkeley.edu/yelick/cs194f07
2Load Balancing Overview
- Load balancing differs with properties of the
tasks (chunks of work) - Tasks costs
- Do all tasks have equal costs?
- If not, when are the costs known?
- Before starting, when task created, or only when
task ends - Task dependencies
- Can all tasks be run in any order (including
parallel)? - If not, when are the dependencies known?
- Before starting, when task created, or only when
task ends - Locality
- Is it important for some tasks to be scheduled on
the same processor (or nearby) to reduce
communication cost? - When is the information about communication known?
3Task Cost Spectrum
4Task Dependency Spectrum
5Task Locality Spectrum (Communication)
6Spectrum of Solutions
- A key question is when certain information about
the load balancing problem is known. - Many combinations of answer leads to a spectrum
of solutions - Static scheduling. All information is available
to scheduling algorithm, which runs before any
real computation starts. - Off-line algorithms make decisions before
execution time - Semi-static scheduling. Information may be known
at program startup, or the beginning of each
timestep, or at other well-defined points. - Offline algorithms may be used, between major
steps. - Dynamic scheduling. Information is not known
until mid-execution. - On-line algorithms make decisions mid-execution
7Solutions for Specific Problems
- For the solutions we have so far, locality is not
considered, i.e., the techniques do not optimized
for it - Loops with independent iterations
- Divide-and-conquer problems with little/no
communication (bound may be communicated in
branch-and-bound search) - Computationally intensive tasks like matrix
multiply
Equal cost tasks Unequal, but known cost Unpredictable cost
Unordered bag of tasks Trivial Bin packing Self Scheduling
Task tree (unknown shape) Work stealing Work stealing Work stealing
Task graph (DAG) ? ? ?
8Solutions for Specific Problems
- If locality is important then we may need other
solutions - Two cases
- Task bag (independent tasks) that need to
communicate ? run on same processor (serialize)
or nearby - Task graph (dependencies) ? if two dependent
tasks need to share data, try to schedule on same
processor
Equal cost tasks Unequal, but known cost Unpredictable cost
Unordered bag of tasks Minimize surface to volume ratio Array decomposition or graph partition Minimize surface to volume ratio Array decomposition or graph partition Minimize surface to volume ratio Array decomposition or graph partition
Task tree Treat as general DAG if locality is really critical Treat as general DAG if locality is really critical Treat as general DAG if locality is really critical
Task graph (DAG) General scheduling problem General scheduling problem General scheduling problem
9Regular Meshes (e.g., Game of Life)
- Independent tasks (bag not DAG or tree)
- Load balancing ? equal size partitions
- Locality ? minimize perimeter using low aspect
ratio partitions - Will hopefully reduce cache misses, false sharing
2n(p1/2 1) edge crossings
n(p-1) edge crossings
10Irregular Communication Patterns
- A task interaction graph shows which tasks
communicate/share date with others - May be weighted by volume of data shared
- If the data is constant, it may be replicated and
doesnt count - The task interaction graph for the game-of-life
is a regular 2D mesh - For animations, simulations of complex
structures, etc., unstructured meshes are used
instead
11Definition of Graph Partitioning
- Given a graph G (N, E, WN, WE)
- N nodes (or vertices),
- WN node weights
- E edges
- WE edge weights
- Ex N tasks, WN task costs, edge (j,k) in
E means task j sends WE(j,k) words to task k - Choose a partition N N1 U N2 U U NP such that
- The sum of the node weights in each Nj is about
the same - The sum of all edge weights of edges connecting
all different pairs Nj and Nk is minimized - Ex balance the work load, while minimizing
communication - Special case of N N1 U N2 Graph Bisection
2 (2)
3 (1)
1
4
1 (2)
2
4 (3)
3
1
2
2
5 (1)
8 (1)
1
6
5
6 (2)
7 (3)
12Definition of Graph Partitioning
- Given a graph G (N, E, WN, WE)
- N nodes (or vertices),
- WN node weights
- E edges
- WE edge weights
- Ex N tasks, WN task costs, edge (j,k) in
E means task j sends WE(j,k) words to task k - Choose a partition N N1 U N2 U U NP such that
- The sum of the node weights in each Nj is about
the same - The sum of all edge weights of edges connecting
all different pairs Nj and Nk is minimized
(shown in black) - Ex balance the work load, while minimizing
communication - Special case of N N1 U N2 Graph Bisection
2 (2)
3 (1)
1
4
1 (2)
2
4 (3)
3
1
2
2
5 (1)
8 (1)
1
6
5
6 (2)
7 (3)
13Applications
- Telephone network design
- Original application, algorithm due to Kernighan
- Load Balancing while Minimizing Communication
- Sparse Matrix times Vector Multiplication
- Solving PDEs
- N 1,,n, (j,k) in E if A(j,k) nonzero,
- WN(j) nonzeros in row j, WE(j,k) 1
- VLSI Layout
- N units on chip, E wires, WE(j,k) wire
length - Sparse Gaussian Elimination
- Used to reorder rows and columns to increase
parallelism, and to decrease fill-in - Data mining and clustering
- Physical Mapping of DNA
14Sparse Matrix Vector Multiplication y y Ax
declare A_local, A_remote(1num_procs),
x_local, x_remote, y_local y_local y_local
A_local x_local for all procs P that need part
of x_local send(needed part of x_local, P) for
all procs P owning needed part of
x_remote receive(x_remote, P) y_local y_local
A_remote(P)x_remote
15Cost of Graph Partitioning
- Many possible partitionings
to search - Just to divide in 2 parts there are
- n choose n/2
- sqrt(2/(np))2n possibilities
- Choosing optimal partitioning is NP-complete
- (NP-complete we can prove it is a hard as other
well-known hard problems in a class
Nondeterministic Polynomial time) - Only known exact algorithms have cost
exponential(n) - We need good heuristics
16Overview of heuristics
17First Heuristic Repeated Graph Bisection
- To partition N into 2k parts
- bisect graph recursively k times
- Henceforth discuss mostly graph bisection
18Edge Separators vs. Vertex Separators
- Edge Separator Es (subset of E) separates G if
removing Es from E leaves two equal-sized,
disconnected components of N N1 and N2 - Vertex Separator Ns (subset of N) separates G if
removing Ns and all incident edges leaves two
equal-sized, disconnected components of N N1
and N2 - Making an Ns from an Es pick one endpoint of
each edge in Es - Ns lt Es
- Making an Es from an Ns pick all edges incident
on Ns - Es lt d Ns where d is the maximum degree of
the graph - We will find Edge or Vertex Separators, as
convenient
G (N, E), Nodes N and Edges E Es green edges
or blue edges Ns red vertices
19Overview of Bisection Heuristics
- Partitioning with Nodal Coordinates
- Each node has x,y,z coordinates ? partition space
- Partitioning without Nodal Coordinates
- E.g., Sparse matrix of Web documents
- A(j,k) times keyword j appears in URL k
- Multilevel acceleration (advanced topic)
- Approximate problem by coarse graph, do so
recursively
20Partitioning with Nodal Coordinatesi.e., nodes
at point in (x,y) or (x,y,z) space
21Nodal Coordinates How Well Can We Do?
- A planar graph can be drawn in plane without edge
crossings - Ex m x m grid of m2 nodes vertex separator Ns
with Ns m sqrt(N) (see last slide for m5
) - Theorem (Tarjan, Lipton, 1979) If G is planar,
Ns such that - N N1 U Ns U N2 is a partition,
- N1 lt 2/3 N and N2 lt 2/3 N
- Ns lt sqrt(8 N)
- Theorem motivates intuition of following
algorithms
22Nodal Coordinates Inertial Partitioning
- For a graph in 2D, choose line with half the
nodes on one side and half on the other - In 3D, choose a plane, but consider 2D for
simplicity - Choose a line L, and then choose a line L
perpendicular to it, with half the nodes on
either side
23Inertial Partitioning Choosing L
- Clearly prefer L on left below
- Mathematically, choose L to be a total least
squares fit of the nodes - Minimize sum of squares of distances to L (green
lines on last slide) - Equivalent to choosing L as axis of rotation that
minimizes the moment of inertia of nodes (unit
weights) - source of name
L
N1
N1
N2
L
N2
24Inertial Partitioning choosing L (continued)
(a,b) is unit vector perpendicular to L
Sj (length of j-th green line)2 Sj (xj -
xbar)2 (yj - ybar)2 - (-b(xj - xbar) a(yj -
ybar))2 Pythagorean
Theorem a2 Sj (xj - xbar)2 2ab Sj
(xj - xbar)(xj - ybar) b2 Sj (yj - ybar)2
a2 X1 2ab X2
b2 X3 a b
X1 X2 a X2 X3
b Minimized by choosing (xbar , ybar)
(Sj xj , Sj yj) / n center of mass (a,b)
eigenvector of smallest eigenvalue of X1
X2
X2 X3
25Nodal Coordinates Random Spheres
- Generalize nearest neighbor idea of a planar
graph to higher dimensions - Any graph can fit in 3D with edge crossings
- Capture intuition of planar graphs of being
connected to nearest neighbors but in
higher than 2 dimensions - For intuition, consider graph defined by a
regular 3D mesh - An n by n by n mesh of N n3 nodes
- Edges to 6 nearest neighbors
- Partition by taking plane parallel to 2 axes
- Cuts n2 N2/3 O(E2/3) edges
- For the general graphs
- Need a notion of well-shaped like mesh
26Random Spheres Well Shaped Graphs
- Approach due to Miller, Teng, Thurston, Vavasis
- Def A k-ply neighborhood system in d dimensions
is a set D1,,Dn of closed disks in Rd such
that no point in Rd is strictly interior to more
than k disks - Def An (a,k) overlap graph is a graph defined in
terms of a gt 1 and a k-ply neighborhood system
D1,,Dn There is a node for each Dj, and an
edge from j to i if expanding the radius of the
smaller of Dj and Di by gta causes the two disks
to overlap
Ex n-by-n mesh is a (1,1) overlap graph Ex Any
planar graph is (a,k) overlap for some a,k
2D Mesh is (1,1) overlap graph
27Generalizing Lipton/Tarjan to Higher Dimensions
- Theorem (Miller, Teng, Thurston, Vavasis, 1993)
Let G(N,E) be an (a,k) overlap graph in d
dimensions with nN. Then there is a vertex
separator Ns such that - N N1 U Ns U N2 and
- N1 and N2 each has at most n(d1)/(d2) nodes
- Ns has at most O(a k1/d n(d-1)/d ) nodes
- When d2, same as Lipton/Tarjan
- Algorithm
- Choose a sphere S in Rd
- Edges that S cuts form edge separator Es
- Build Ns from Es
- Choose S randomly, so that it satisfies Theorem
with high probability
28Stereographic Projection
- Stereographic projection from plane to sphere
- In d2, draw line from p to North Pole,
projection p of p is where the line and sphere
intersect - Similar in higher dimensions
p
p
p (x,y) p (2x,2y,x2 y2 1) / (x2
y2 1)
29Choosing a Random Sphere
- Do stereographic projection from Rd to sphere S
in Rd1 - Find centerpoint of projected points
- Any plane through centerpoint divides points
evenly - There is a linear programming algorithm, cheaper
heuristics - Conformally map points on sphere
- Rotate points around origin so centerpoint at
(0,0,r) for some r - Dilate points (unproject, multiply by
sqrt((1-r)/(1r)), project) - this maps centerpoint to origin (0,,0), spreads
points around S - Pick a random plane through origin
- Intersection of plane and sphere S is circle
- Unproject circle
- yields desired circle C in Rd
- Create Ns j belongs to Ns if aDj intersects C
30Random Sphere Algorithm (Gilbert)
31Random Sphere Algorithm (Gilbert)
32Random Sphere Algorithm (Gilbert)
33Random Sphere Algorithm (Gilbert)
34Random Sphere Algorithm (Gilbert)
35Random Sphere Algorithm (Gilbert)
36Nodal Coordinates Summary
- Other variations on these algorithms
- Algorithms are efficient
- Rely on graphs having nodes connected (mostly) to
nearest neighbors in space - algorithm does not depend on where actual edges
are! - Common when graph arises from physical model
- Ignores edges, but can be used as good starting
guess for subsequent partitioners that do examine
edges - Can do poorly if graph connection is not spatial
- Details at
- www.cs.berkeley.edu/demmel/cs267/lecture18/lectur
e18.html - www.cs.ucsb.edu/gilbert
- www.cs.bu.edu/steng
37Partitioning without Nodal CoordinatesE.g., In
the WWW, nodes are web pages
38Coordinate-Free Breadth First Search (BFS)
- Given G(N,E) and a root node r in N, BFS produces
- A subgraph T of G (same nodes, subset of edges)
- T is a tree rooted at r
- Each node assigned a level distance from r
Level 0 Level 1 Level 2 Level 3 Level 4
N1
N2
Tree edges Horizontal edges Inter-level edges
39Partitioning via Breadth First Search
- BFS identifies 3 kinds of edges
- Tree Edges - part of T
- Horizontal Edges - connect nodes at same level
- Interlevel Edges - connect nodes at adjacent
levels - No edges connect nodes in levels
- differing by more than 1 (why?)
- BFS partioning heuristic
- N N1 U N2, where
- N1 nodes at level lt L,
- N2 nodes at level gt L
- Choose L so N1 close to N2
BFS partition of a 2D Mesh using center as root
N1 levels 0, 1, 2, 3 N2 levels 4, 5, 6
40Coordinate-Free Kernighan/Lin
- Take a initial partition and iteratively improve
it - Kernighan/Lin (1970), cost O(N3) but easy to
understand - Fiduccia/Mattheyses (1982), cost O(E), much
better, but more complicated - Given G (N,E,WE) and a partitioning N A U B,
where A B - T cost(A,B) S W(e) where e connects nodes in
A and B - Find subsets X of A and Y of B with X Y
- Swapping X and Y should decrease cost
- newA A - X U Y and newB B - Y U X
- newT cost(newA , newB) lt cost(A,B)
- Need to compute newT efficiently for many
possible X and Y, choose smallest
41Kernighan/Lin Preliminary Definitions
- T cost(A, B), newT cost(newA, newB)
- Need an efficient formula for newT will use
- E(a) external cost of a in A S W(a,b) for b
in B - I(a) internal cost of a in A S W(a,a) for
other a in A - D(a) cost of a in A E(a) - I(a)
- E(b), I(b) and D(b) defined analogously for b in
B - Consider swapping X a and Y b
- newA A - a U b, newB B - b U a
- newT T - ( D(a) D(b) - 2w(a,b) ) T -
gain(a,b) - gain(a,b) measures improvement gotten by swapping
a and b - Update formulas
- newD(a) D(a) 2w(a,a) - 2w(a,b) for a
in A, a ! a - newD(b) D(b) 2w(b,b) - 2w(b,a) for b
in B, b ! b
42Kernighan/Lin Algorithm
Compute T cost(A,B) for initial A, B
cost O(N2)
Repeat One pass greedily computes
N/2 possible X,Y to swap, picks best
Compute costs D(n) for all n in N
cost O(N2)
Unmark all nodes in N
cost O(N)
While there are unmarked nodes
N/2
iterations Find an unmarked pair
(a,b) maximizing gain(a,b) cost
O(N2) Mark a and b (but do not
swap them)
cost O(1) Update D(n) for all
unmarked n, as though a
and b had been swapped
cost O(N) Endwhile
At this point we have computed a sequence of
pairs (a1,b1), , (ak,bk)
and gains gain(1),., gain(k)
where k N/2, numbered in the order in which
we marked them Pick m maximizing Gain
Sk1 to m gain(k)
cost O(N) Gain is reduction
in cost from swapping (a1,b1) through (am,bm)
If Gain gt 0 then it is worth swapping
Update newA A - a1,,am U
b1,,bm cost O(N)
Update newB B - b1,,bm U a1,,am
cost O(N)
Update T T - Gain
cost O(1)
endif Until Gain lt 0
43 Comments on Kernighan/Lin Algorithm
- Most expensive line shown in red, O(n3)
- Some gain(k) may be negative, but if later gains
are large, then final Gain may be positive - can escape local minima where switching no pair
helps - How many times do we Repeat?
- K/L tested on very small graphs (Nlt360) and
got convergence after 2-4 sweeps - For random graphs (of theoretical interest) the
probability of convergence in one step appears to
drop like 2-N/30
44Coordinate-Free Spectral Bisection
- Based on theory of Fiedler (1970s), popularized
by Pothen, Simon, Liou (1990) - Motivation, by analogy
to a vibrating
string - Implementation via the
Lanczos Algorithm - To optimize sparse-matrix-vector multiply, we
graph partition - To graph partition, we find an eigenvector of a
matrix associated with the graph - To find an eigenvector, we do sparse-matrix
vector multiply - No free lunch ...
45(No Transcript)
46What About DAG Scheduling?
10
Each node weight corresponds to task execution
time, e.g.
5
12
4
4
7
8
0
47List Scheduling
min max
min max different between 2nd smallest and
smallest
48List Scheduling
- MinMin (aggressively pick the task that can be
done soonest) - for each task T pick the host H that achieves the
smallest CT for task T - pick the task with the smallest such CT
- schedule T on H
- MaxMin (pick the largest tasks first)
- for each task T pick the host H that achieves the
smallest CT for task T - pick the task with the largest such CT
- schedule T on H
- Sufferage (pick the task that would suffer the
most if not picked) - for each task T pick the host H that achieves the
smallest CT for task T - for each task T pick the host H that achieves
the second smallest CT for task T - pick the task with the largest (CT - CT) value
- schedule T on H
49Example (MinMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MinMin algorithm
- P110, P28, P323
50Example (MinMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MinMin algorithm
- P110, P28, P323
- Pick T2, schedule it on H2
51Example (MinMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MinMin algorithm
- P110, P28, P323
- Pick T2, schedule it on H2
- Update matrix
tasks
10 23 24 38 70 27
machines
52Example (MinMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MinMin algorithm
- P110, P28, P323
- Pick T2, schedule it on H2
- Update matrix
- P110, P323
tasks
10 23 24 38 70 27
machines
53Example (MinMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MinMin algorithm
- P110, P28, P323
- Pick T2, schedule it on H2
- Update matrix
- P110, P323
- Pick T1, schedule it on H1
tasks
10 23 24 38 70 27
machines
54Example (MinMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MinMin algorithm
- P110, P28, P323
- Pick T2, schedule it on H2
- Update matrix
- P110, P323
- Pick T1, schedule it on H1
- Update matrix
tasks
10 23 24 38 70 27
machines
tasks
33 38 27
machines
55Example (MinMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MinMin algorithm
- P110, P28, P323
- Pick T2, schedule it on H2
- Update matrix
- P110, P323
- Pick T1, schedule it on H1
- Update matrix
- P3 27
- Pick T3, schedule it on H3
- makespan 27 seconds
tasks
10 23 24 38 70 27
machines
tasks
31 38 27
machines
56Example (MaxMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MaxMin algorithm
- P110, P28, P323
- Pick T3, schedule it on H1
- Update matrix
- P124, P28
- Pick T1, schedule it on H2
- Update matrix
- P2 12
- Pick T2, schedule it on H3
- Makespan 24 seconds
tasks
33 47 24 8 70 12
machines
tasks
47 32 12
machines
57Resulting Schedules
MinMin
machine 1
Task 1
machine 2
Task 2
machine 3
Task 3
machine 1
Task 3
machine 2
Task 1
MaxMin
machine 3
Task 2
58DAGs?
- While independent tasks occur in real
application, the most general model of
computation is a Directed Acyclic Graph (DAG) - A set of weighted nodes
- A set of edges
- Representative of tasks that have dependencies
among each other
59Example of DAG
10
DAG Length 5 number of nodes on the longest path
5
12
4
4
7
8
0
60Example of DAG
10
DAGs levels 5 5 sets of tasks that can be
done concurrently
5
12
4
4
7
8
0
61Example of DAG
10
DAGs width 3 size of the largest level No more
than 3 processors are useful for running this DAG
5
12
4
4
7
8
0
62Example of DAG
10
Critical path 34 Sum of the weights along the
heaviest path Is a lower bound on the DAG
execution time
5
12
4
4
7
8
0
63Where do DAGs come from?
- Some applications are naturally structured as
DAGs - Example image processing
- Apply a bunch of filters, whose output feeds into
each others input - But other than that, DAGs emerge from the code to
parallelize - Example Linear system back-solve
- Ax b, where A is lower triangular
- for (i0 iltn i)
- xi bi / aii // Task
Ti,i - for (ji1 jltn j)
- bj bj - aji xi // Task
Ti,j -
-
- Leads to a DAG (see next slide)
64Where do DAGs come from?
T1,1
T1,2
T1,3
T1,4
T1,5
T2,2
T2,3
T2,4
T2,5
T3,3
T3,4
T3,5
T4,4
- for (i0 iltn i)
- xi bi / aii // Task
Ti,i - for (ji1 jltn j)
- bj bj - aji xi // Task
Ti,j -
-
T4,5
T5,5
65Where do DAGs come from?
T1,1
T1,2
T1,3
T1,4
T1,5
T2,2
T2,3
T2,4
T2,5
T3,3
T3,4
T3,5
T4,4
- for (i0 iltn i)
- xi bi / aii // Task
Ti,i - for (ji1 jltn j)
- bj bj - aji xi // Task
Ti,j -
-
T4,5
T5,5
66Where do DAGs come from?
T1,1
T1,2
T1,3
T1,4
T1,5
9 levels width 4 length 9
T2,2
T2,3
T2,4
T2,5
T3,3
T3,4
T3,5
- for (i0 iltn i)
- xi bi / aii // Task
Ti,i - for (ji1 jltn j)
- bj bj - aji xi // Task
Ti,j -
-
T4,4
T4,5
T5,5
67DAG Scheduling Problem
- Question
- I have a bunch of processors
- Lets assume that theyre identical
- I have a DAG
- Which processor does which task so that the DAG
execution time is minimized? - The solution is called a schedule
- A list of assignments of tasks to processors
- P1 does T1 and T4
- P2 does T2 and T7
- etc..
- Goal find the optimal schedule
- NP-hard
- List-scheduling is often used
68Critical Path
- The critical path gives a lower bound on the
execution time - Therefore, its intuitively a good idea to
perform all tasks on the critical path as fast as
possible - Running them slowly is certain to decrease
performance - Therefore, people have developed DAG scheduling
techniques that account for the critical path
when scheduling tasks - Lets look at one possibility
69Scheduling for Critical Path
- First step compute the weight of the critical
path
T1
10
T2
CP 33
4
T3
T4
8
T5
8
18
1
2
T7
T6
0
T8
70Scheduling for Critical Path
- Second step for each task, compute the weight of
the heaviest path from the task to the exit node
T1
T1
33
10
T2
T2
23
4
T3
T3
T4
T4
9
T5
8
T5
10
8
19
18
1
2
1
2
T7
T7
T6
T6
0
T8
0
T8
71Scheduling for Critical Path
- Third step Compute the CP - the obtained weight
T1
T1
0
33
T2
T2
10
23
T3
T3
T4
T4
25
T5
9
T5
23
10
12
19
32
31
1
2
T7
T7
T6
T6
33
T8
0
T8
72Scheduling for Critical Path
- Fourth step Sort the tasks in order
T1
0
T2
10
T3
T4
T1, T2, T4, T5, T3, T7, T6, T8
25
T5
23
12
32
31
T7
T6
33
T8
73Scheduling for Critical Path
- Fifth step assign to processors
T1
10
T2
4
T3
T4
8
T5
T1, T2, T4, T5, T3, T7, T6, T8
8
18
1
2
T7
T6
0
T8
P1 P2
74More Complex cases
- There can be communications among tasks
- denoted by edge weights
- network transfer times in addition to computation
time - The underlying platform can be heterogeneous
- makes the scheduling process much more
complicated
75Conclusion
- Scheduling is the land of heuristics
- come up with an intuitive reasoning for what a
good schedule may look like - validate it via simulation
- analytical results are typically not possible
- announce it as yet another scheduling heuristic
(MCP, ETF, DSC, DLS, ..) - Its a good idea to know what type of heuristics
are out there - people in the field often implement nothing
beyond a greedy algorithm, which may be extremely
harmful in many cases