Title: A Stable Broadcast Algorithm
1A Stable Broadcast Algorithm
- Kei Takahashi Hideo Saito
- Takeshi Shibata Kenjiro Taura
- (The University of Tokyo, Japan)
CCGrid 2008 - Lyon, France
2Broadcasting Large Messages
- To distribute the same, but large data to many
nodes - Ex content delivery
- Widely used in parallel processing
Data
Data
Data
Data
Data
lt A Stable Broadcast Algorithm gt Kei Takahashi,
Hideo Saito, Takeshi Shibata and Kenjiro Taura
3Problem of Broadcast
- Usually, in a broadcast transfer, the source can
deliver much less data than a single transfer
from the source
100
S
D
25
25
25
25
S
D
lt A Stable Broadcast Algorithm gt Kei Takahashi,
Hideo Saito, Takeshi Shibata and Kenjiro Taura
4Problem of Slow Nodes
- Pipeline-manner transfers improve the performance
- Even in a pipeline transfer, nodes with small
bandwidth (slow nodes) may degrade receiving
bandwidth of all other nodes
100
10
10
100
100
10
10
?
?
?
?
5Contributions
- Propose an idea of Stable Broadcast
- In a stable broadcast
- Slow nodes never degrade receiving bandwidth to
other nodes - All nodes receive the maximum possible amount of
data
6Contributions (cont.)
- Propose a stable broadcast algorithm for tree
topologies - Proved to be stable in a theoretical model
- Improve performances in general graph networks
- In a real-machine experiment, our algorithm
achieved 2.5 times the aggregate bandwidth than
the previous algorithm (FPFR)
7Agenda
- Introduction
- Problem Settings
- Related Work
- Proposed Algorithm
- Evaluation
- Conclusion
lt A Stable Broadcast Algorithm gt Kei Takahashi,
Hideo Saito, Takeshi Shibata and Kenjiro Taura
8Problem Settings
- Target large message broadcast
- Only computational nodes handle messages
lt A Stable Broadcast Algorithm gt Kei Takahashi,
Hideo Saito, Takeshi Shibata and Kenjiro Taura
9Problem Settings (cont.)
- Only bandwidth matters for large messages
- (Transfer time) (Latency)
- Bandwidth is only limited by link capacities
- Assume that nodes and switches have enough
processing throughput
1GB
(Message Size)
(Bandwidth)
50msec
1Gbps
99
10Problem Settings (cont.)
- Bandwidth-annotated topologies are given in
advance - Bandwidth and topologies can be rapidly inferred
- - Shirai et al. A Fast Topology Inference - A
building block for network-aware parallel
computing. (HPDC 2007) - - Naganuma et al. Improving Efficiency of Network
Bandwidth Estimation Using Topology Information
(SACSIS 2008, Tsukuba, Japan)
80
10
30
100
40
11Evaluation of Broadcast
- Previous algorithms evaluated broadcast by
completion time - However, it cannot evaluate the effect of slowly
receiving nodes - It is desirable that each node receives as much
data as possible - Aggregate Bandwidth is a more reasonable
evaluation criterion in many cases
12Definition of Stable Broadcast
- All nodes receive maximum possible bandwidth
- Receiving bandwidth for each node does not lessen
by adding other nodes to the broadcast
Single Transfer
120
D2
Broadcast
10
120
100
100
D0
D1
D2
D3
13Properties of Stable broadcast
- Maximize aggregate bandwidth
- Minimize completion time
14Agenda
- Introduction
- Problem Settings
- Related Work
- Proposed Algorithm
- Evaluation
- Conclusion
lt A Stable Broadcast Algorithm gt Kei Takahashi,
Hideo Saito, Takeshi Shibata and Kenjiro Taura
15Single-Tree Algorithms
- Flat tree
- The outgoing link from the source becomes a
bottleneck - Random Pipeline
- Some links used many times become bottlenecks
- Depth-first Pipeline
- Each link is only used once, but fast nodes
suffer from slow nodes - Dijkstra
- Fast nodes do not suffer from slow nodes, but
some link are used many times
Flat Tree
Random Pipeline
Dijkstra
Depth-First (FPFR)
lt A Stable Broadcast Algorithm gt Kei Takahashi,
Hideo Saito, Takeshi Shibata and Kenjiro Taura
16FPFR Algorithm
- FPFR (Fast Parallel File Replication) has
improved the aggregate bandwidth from algorithms
that use only one tree - Idea
- (1) Construct multiple spanning trees
- (2) Use these trees in parallel
Izmailov et al. Fast Parallel File
Replication in Data Grid. (GGF-10, March
2004.)
17Tree constructions in FPFR
- Iteratively construct spanning trees
- Create a spanning tree (Tn) by tracing every
destination - Set the throughput (Vn) to the bottleneck
bandwidth in Tn - Subtract Vn from the remaining bandwidth of each
link
First Tree (T1)
V2
Bottleneck
V1
18Data transfer with FPFR
- Each tree sends different fractions of data in
parallel - The proportion of data sent through each tree may
be optimized by linear programming (Balanced
Multicasting)
T1 Sends the former part
T2 sends the latter part
V2
V1
den Burger et al. Balanced Multicasting
High-throughput Communication for Grid
Applications (SC 2005)
19Problems of FPFR
- In FPFR, slow nodes degrade receiving bandwidth
to other nodes - For tree topologies, FPFR only outputs one
depth-first pipeline, which cannot utilize the
potential network performance
?
Bottleneck
?
?
?
20Agenda
- Introduction
- Problem Settings
- Related Work
- Our Algorithm
- Evaluation
- Conclusion
lt A Stable Broadcast Algorithm gt Kei Takahashi,
Hideo Saito, Takeshi Shibata and Kenjiro Taura
21Our Algorithm
- Modify FPFR algorithm
- Create both spanning trees and partial trees
- Stable for tree topologies whose links have the
same bandwidth in both directions
V
V
lt A Stable Broadcast Algorithm gt Kei Takahashi,
Hideo Saito, Takeshi Shibata and Kenjiro Taura
22Tree Constructions
- Iteratively construct trees
- Create a tree Tn by tracing every destination
- Set the throughput Vn to the bottleneck in Tn
- Subtract Vn from the remaining capacities
Throughput of T1
T1 First Tree (Spanning)
V1
S
A
B
C
T3 Third Tree(Partial Tree)
T2 Second Tree (Partial Tree)
V3
V2
S
A
B
C
S
A
B
C
23Data Transfer
- Send data proportional to the tree throughput Vn
- Example
- Stage1 use T1, T2 and T3
- Stage2 use T1 and T2 to send data previously
sent by T3 - Stage3 use T1 to send data previously sent by T2
T3
(V3)
T2
(V2)
T1
(V1)
A
B
S
C
24Properties of Our Algorithm
- Our algorithm is Stable for tree topologies
(whose links have the same capacities in both
directions) - Every node receives maximum bandwidth
- For any topology, it achieves greater aggregate
bandwidth than the baseline algorithm (FPFR) - Fully utilize link capacity by using partial
trees - It has small calculation cost to create a
broadcast plan
25Agenda
- Introduction
- Problem Settings
- Related Work
- Proposed Algorithm
- Evaluation
- Conclusion
lt A Stable Broadcast Algorithm gt Kei Takahashi,
Hideo Saito, Takeshi Shibata and Kenjiro Taura
26(1) Simulations
- Simulated 5 broadcast algorithms using a real
topology - Compared the aggregate bandwidth of each method
- Many bandwidth distributions
- Broadcast to 10, 50, and 100 nodes
- 10 kinds of conditions (src, dest)
lt A Stable Broadcast Algorithm gt Kei Takahashi,
Hideo Saito, Takeshi Shibata and Kenjiro Taura
27Compared Algorithms
Random
Flat Tree
Depth-First (FPFR)
Dijkstra
and OURS
28Result of Simulations
- Mixed two kinds of Links (100 and 1000)
- Vertical axis speedup from FlatTree
- 40 times more than random, 3 times more than
depth-first (FPFR) with 100 nodes
100
1000
100
100
1000
1000
29Result of Simulations (cont.)
- Tested 8 bandwidth distributions
- Uniform distribution (500-1000)
- Uniform distribution (100-1000)
- Mixed 100 and 1000 links
- Uniform distribution (500-100) between switches
- (for each distribution, tested two conditions
that bandwidth of both directions are the same
and different) - Our method achieved the largest bandwidth in 7/8
cases - Large improvement especially in large bandwidth
variance - In a uniform distribution (100-1000) and link
bandwidth in two directions are different,
Dijkstra achieved 2 more aggregate bandwidth
30(2) Real Machine Experiment
- Performed broadcasts in 4 clusters
- Number of destinations10, 47 and 105 nodes
- Bandwidths of each link (10M - 1Gbps)
- Compared the aggregate bandwidth in 4 algorithms
- Our algorithm
- Depth-first (FPFR)
- Dijkstra
- Random (Best among 100 trials)
31Theoretical Maximum Aggregate Bandwidth
- Also, we calculated the theoretical maximum
aggregate bandwidth - The total of the receiving bandwidth in a case of
separate direct transfer from the source to each
destination
10
120
100
100
D0
D1
D2
D3
32Evaluation of Aggregate Bandwidth
- For 105 nodes broadcast, 2.5 times more bandwidth
than the baseline algorithm DepthFirst (FPFR) - However, our algorithm stayed 50-70 the
aggregate bandwidth compared to the theoretical
maximum - Computational nodes cannot fully utilize up/down
network
700
700
33Evaluation of Stability
- Compared aggregate bandwidth of 9 nodes
before/after adding one slow node - Unlike DepthFirst(FPFR), existing nodes do not
suffer from adding a slow node in our algorithm - Achieved 1.6 times bandwidth than Dijkstra
Slow
34Agenda
- Introduction
- Problem Settings
- Related Work
- Our Algorithm
- Evaluation
- Conclusion
lt A Stable Broadcast Algorithm gt Kei Takahashi,
Hideo Saito, Takeshi Shibata and Kenjiro Taura
35Conclusion
- Introduced the notion of Stable Broadcast
- Slow nodes never degrade receiving bandwidth of
fast nodes - Proposed a stable broadcast algorithm for tree
topologies - Theoretically proved
- 2.5 times the aggregate bandwidth in real
machine experiments - Confirmed speedup in simulations with many
different conditions
lt A Stable Broadcast Algorithm gt Kei Takahashi,
Hideo Saito, Takeshi Shibata and Kenjiro Taura
36Future Work
- Algorithm that maximizes aggregate bandwidth in
general graph topologies - Algorithm that changes relay schedule by
detecting bandwidth fluctuations
lt A Stable Broadcast Algorithm gt Kei Takahashi,
Hideo Saito, Takeshi Shibata and Kenjiro Taura
37Future work
- Algorithm that maximizes aggregate bandwidth in
general graph topologies - Algorithm that changes relay schedule by
detecting bandwidth fluctuations
38All the graphs
39Broadcast with BitTorrent
- BitTorrent gradually improves the transfer
schedule by adaptively choosing the parent node - Since relaying structure created by BitTorrent
has many branches, these links may become
bottlenecks
Bottleneck Link
Transfer tree snapshot
Wei et al. Scheduling Independent Tasks
Sharing Large Data Distributed with BitTorrent.
(In GRID 05)
40Simulation 1
- Uniform distribution (100-1000) between switches
- Vertical axis speedup from FlatTree
- 36 times more than FlatTree, 1.2 times more than
DepthFirst (FPFR) for 100-nodes broadcast
1001000
1001000
1000
1000
41Topology-unaware pipeline
- Trace all the destinations from the source
- Some links used by many transfers become
bottlenecks
Bottleneck
42Depth-first Pipeline
- Construct a depth-first pipeline by using
topology information - Avoid link sharing by using each link only once
- Minimize the completion time in a tree topology
- Slow nodes degrade the performance of other nodes
Slow Node
Shirai et al. A Fast Topology Inference - A
building block for network-aware parallel
computing. (HPDC 2007)
43Dijkstra Algorithm
- Construct a relaying structure in a greedy manner
- Add a node reachable in the maximum bandwidth one
by one - Effects of slow nodes are small
- Some links may be used by many transfers, may
become bottlenecks
Bottleneck Link
Wang et al. A novel data grid coherence
protocol using pipeline-based aggressive copy
method. (GPC, pages 484495, 2007)