Title: Shortest Path Algorithms
1Shortest Path Algorithms
2Talk Outline
- Background for the problem
- Algorithms
- Code Listings
- Numerical Results
- Conclusions and Issues
3Talk Outline
- Background for the problem
- Algorithms
- Code Listings
- Numerical Results
- Conclusions and Issues
4Weighted Directed Graph
1
0
1
4
3
6
2
7
2
3
5Distance Matrix
1
0
1
4
3
6
2
7
2
3
6Shortest Path Problem
- Given the adjacency matrix A.
- Compute the distance matrix D.
7Talk Outline
- Background for the problem
- Algorithms
- Code Listings
- Numerical Results
- Conclusions and Issues
8Floyds Algorithm
for (k 0 k lt n k) for (i 0 i lt n
i) for (j 0 j lt n j)
Dij min(Dij,
Dik Dkj)
9Floyds Algorithm
Dij min(Dij, Dik Dkj)
j
dIJ
dKJ
k
dIK
i
10Floyd Parallel 1
- Give each processor a contiguous set of rows of A
and D. (Row-wise partition) - Can use at most N processors.
11Floyd Parallel 1
for (k 0 k lt n k) for (i
i_local_start i lt i_local_end 1 i)
for (j 0 j lt n j)
Dij min(Dij, Dik
Dkj)
12Floyd Parallel 1 (P3s view)
for (k 0 k lt n k) for (i
i_local_start i lt i_local_end 1 i)
for (j 0 j lt n j)
Dij min(Dij, Dik
Dkj)
kth row
my rows
13Floyd Parallel 1 costs
- Computation
- T (N3/P) tc
- Communication (broadcasts of kth row)
- T N log(P) (a bN)
- Overall
- T (N3/P) tc N log(P) (a bN)
14Floyd Parallel 2
- Give each processor a contiguous block of A and
D. (row/column partition) - Can use up to N2 processors.
15Floyd Parallel 2
for (k 0 k lt n k) for (i
i_local_start i lt i_local_end 1 i)
for (j j_local_start j lt j_local_end
j) Dij
min(Dij, Dik Dkj)
16Floyd Parallel 2 (P14s view)
for (k 0 k lt n k) for (i
i_local_start i lt i_local_end 1 i)
for (j j_local_start j lt j_local_end
j) Dij
min(Dij, Dik Dkj)
my block
17Floyd Parallel 2 (P14s view)
for (k 0 k lt n k) for (i
i_local_start i lt i_local_end 1 i)
for (j j_local_start j lt j_local_end
j) Dij
min(Dij, Dik Dkj)
kth row
my block
kth column
18Floyd Parallel 2 costs
- Computation
- T (N3/P) tc
- Communication (broadcasts of kth row)
- T 2N log(sqrt(P)) (a bN/sqrt(P))
- N log(P) (a bN/sqrt(P))
- Overall
- T (N3/P) tc N log(P) (a bN/sqrt(P))
19Dijkstras Algorithm find shortest paths from
vertex s to all others. (Ds)
Ds 0 if (i!s) Di inf TV / set of all
vertices / for (k 0 k lt n k) find
i in T with min di for each edge (i,j)
with j in T if (dj gt di
aij) dj di
aij T - i
20Dijkstras Algorithm
Dj min(Dj, Di Aij)
j
dJ
aIJ
I
dI
s
21Dijkstra Parallel
- Give each processor all of A and have it run
serial Dijkstra to compute contiguous rows of D. - Can use at most N processors. Local memory must
hold all of A.
A
D
22Dijkstras Algorithm Parallel
for (s local_firstrow s lt local_lastrow
s) Dss 0 if (i!s) Dsi
inf TV / set of all vertices / for (k
0 k lt n k) find i in T with min
Dsi for each edge (i,j) with j in T
if (Dsj gt Dsi Aij)
Dsj Dsi Aij
T - i
23Dijkstra Parallel costs
- From literature, Dijkstras slower than Floyd by
a factor F1.6. - Computation
- T (N3/P) F tc
- No Communication
24Cost summary
25Talk Outline
- Background for the problem
- Algorithms
- Code Listings
- Numerical Results
- Conclusions and Issues
26Talk Outline
- Background for the problem
- Algorithms
- Code Listings
- Numerical Results
- Conclusions and Issues
27Run Times on Bluemarlin
- On each run, the maximum time over all the
processors was recorded as the time for that run. - Three runs were made, and the median run time is
recorded in the following tables.
28Serial Run Times
- In serial, the Floyd code was faster than the
Dijkstra code. - Speed advantage for Floyd was less than reported
value, our F 1.15
29Parallel Run Times
30N720, Run times and Speed-up
Floyd 2 fastest up to P4 Dijkstra fastest
thereafter
Dijkstra has best speed ups
31Models of performance
- Ping pong test code to generate data
- Least squares fit
- a .002 sec/message
- b 8x10-7 sec/double
32Dijkstra Model Comparison
- Model
- T (N3/P) F tc
- No communication, no a or b terms.
33Floyd Model Comparison
- Poor match between model and results
- Model appears to overestimate the cost of
communication.
34Talk Outline
- Background for the problem
- Algorithms
- Code Listings
- Numerical Results
- Conclusions and Issues
35The actual run times for the codes confirmed
expectations
- Floyd faster than Dijkstra in serial
- With increasing number of processors, Dijkstra
eventually becomes faster because no
communication occurs. - Speed ups were good for Floyd1, better for
Floyd2, and best (near perfect) for Dijkstra. - Worst speed up was with Floyd1 and N576, where
the speedup was 9.3 for the 16 processor run.
36Model gave poor quantitative prediction for run
times of Floyd in parallel.
- Communication is only broadcasts
- The a and b terms are computed from MPI_Send and
MPI_Recv code. - Plugging these a and b into
- T N log(P) (a bN)
- must not give a good prediction for broadcast
time. - Actual broadcast times seem to be quite a bit
smaller
37Reconciling Model to Result
- Clear that the cost of the broadcasts are being
over estimated - Much better fit with model if latency factor a is
reduced by a factor of 10.
38Reconciling Model to Result
- Clear that the cost of the broadcasts are being
over estimated - Much better fit with model if latency factor a is
reduced by a factor of 10. - Better still if bandwidth parameter b is also
reduced by a factor of 10.
39Future Work?
- Time broadcast and see how it matches model
(expect poor match). - Adjust a and b to fit.
- Find another model with better match.