Title: A Parallel Implementation of UPGMA
1A Parallel Implementation of UPGMA
- Feng Yue
- Department of Computer Science
- University of South Carolina
- yue2_at_cse.sc.edu
2Phylogeny
- A phylogeny is a reconstruction of the
evolutionary history of a collection of
organisms. - It usually takes the form of a tree.
- Modern organisms are placed at the leaves.
- Edges denote evolutionary relationships.
3Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
412 Species of Campanulaceae
5Phylogenetic Data
- All kinds of data have been used
- behavioral, morphological, metabolic, etc.
- Predominant choice is molecular data.
- Two main kinds of molecular data
- Sequence data (DNA sequence on genes)
- Gene-order data (gene sequence on chromosomes)
6Sequence Data
- Typically the DNA sequence of a few genes.
- Characters are individual positions in the string
and can assume four states. - Evolves through point mutations, insertions
(incl.duplications), and deletions.
7DNA Sequence Evolution
8Phylogeny Problem
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
9Gene-Order Data
- The ordered sequence of genes on one or more
chromosomes. - Entire gene-order is a single character, which
can assume a huge number of states. - Evolves through inversions, insertions (incl.
duplications), and deletions also transpositions
(in mitochondria) and translocations (between
chromosomes).
10Gene-Order Guillardia Chloroplast
11(No Transcript)
12Gene-Order Data Attributes
- Advantages
- Low error rate (depends on recognizing
homologies). - No gene tree/ species tree problem.
- Rare evolutionary events and unlikely to cause
silent" changes so can go back hundreds of
millions years. - Problems
- Mathematics much more complex than for sequence
data. - Models of evolution not well characterized.
- Very limited data (mostly organelles).
- Possibly insufficient discrimination among
recently evolved organisms.
13Phylogenetic Reconstruction
- Three categories of methods
- Distance-based methods, such as neighbor-
joining, UPGMA. - Parsimony-based methods (such as implemented in
PAUP, Phylip, Mega, TNT, etc.) - Likelihood-based methods (including Bayesian
methods, such as implemented in PAUP,
Phylip,FastDNAML, MrBayes, GAML, etc.)
140. Distance Matrix
Neighbor Joining Method Example
Distance matrix
151. First Step
PAM distance 3.3 (Human - Monkey) is the minimum.
So we'll join Human and Monkey to MonHum and
we'll calculate the new distances.
Mon-Hum
Monkey
Human
Spinach
Mosquito
Rice
162. Calculation of New Distances
After we have joined two species in a subtree we
have to compute the distances from every other
node to the new subtree. We do this with a simple
average of distances DistSpinach, MonHum
(DistSpinach, Monkey DistSpinach, Human)/2
(90.8 86.3)/2 88.55
Mon-Hum
Monkey
Human
Spinach
173. Next Cycle
Mos-(Mon-Hum)
Mon-Hum
Human
Mosquito
Monkey
Spinach
Rice
184. Penultimate Cycle
Mos-(Mon-Hum)
Spin-Rice
Mon-Hum
Human
Mosquito
Monkey
Spinach
Rice
195. Last Joining
(Spin-Rice)-(Mos-(Mon-Hum))
Mos-(Mon-Hum)
Spin-Rice
Mon-Hum
Human
Mosquito
Monkey
Spinach
Rice
20Unrooted Neighbor Joining tree
Human
Spinach
Monkey
Mosquito
Rice
21UPGMA
- UPGMA (Unweighted Pair Group Method with
Arithmetic mean) - Cluster Analysis
- Start from a set of nodes in a Graph G and a
matrix D of pairwise distances between nodes. - The goal is to construct a tree
- length of arcs distance
- leaves original nodes of graph G
- internal node created as clusters of
children nodes. -
22UPGMA Example
1 2 3 4 5 6 7 8 9
1 0 2 8 10 12
2 2 0 14 16 24
3 8 14 0 4 6
4 10 16 4 0 6
5 12 24 6 6 0
6
7
8
9
Ht
- Original 5 x 5 matrix
- Final 9 x 9
- Nodes 1,2 ?6
- Nodes 3,4 ?7
- Nodes 5,7 ?8
- Nodes 6,8 ?9 Root
- Bold active
- Asterisks not active
23Combine nodes 1 and 2 to produce node 6
1 2 3 4 5 6 7 8 9
1 0 2 8 10 12
2 2 0 14 16 24
3 8 14 0 4 6 11
4 10 16 4 0 6 13
5 12 24 6 6 0 18
6 11 13 18 0
7
8
9
Ht 0 0 1
- Minimum 2 (1, 2)
- Generate node 6
- Height 2/2 1.
- Set nodes 1, 2 not active
- Calculate the distance
- d3,6 (d3,1 d3,2)/2
- (814)/2 11
- Update the distance matrix
24Combine nodes 3 and 4 to produce node 7
1 2 3 4 5 6 7 8 9
1 0 2 8 10 12
2 2 0 14 16 24
3 8 14 0 4 6 11
4 10 16 4 0 6 13
5 12 24 6 6 0 18 6
6 11 13 18 0 12
7 6 12 0
8
9
Ht 0 0 0 0 1 2
- Min 4 from (3,4)
- Generate node 7.
- The height of node 7 is 4/2 2.
- Set nodes 3, 4 not active
- Calculate the distance.
- D5,7 (d5,3d5,4)/2
- (66)/2 6
- Update the distance matrix
25Combine nodes 5 and 7 to produce node 8
1 2 3 4 5 6 7 8 9
1 0 2 8 10 12
2 2 0 14 16 24
3 8 14 0 4 6 11
4 10 16 4 0 6 13
5 12 24 6 6 0 18 6
6 11 13 18 0 12 14
7 6 12 0
8 14
9
Ht 0 0 0 0 0 1 2 3
- Min 6 (5, 7)
- Generate node 8.
- The height of node 8 is
- 6/2 3.
- Set nodes 5, 7 not active
- Calculate the distance.
- Update the distance matrix
26Combine nodes 6 and 8 to produce node 9
1 2 3 4 5 6 7 8 9
1 0 2 8 10 12
2 2 0 14 16 24
3 8 14 0 4 6 11
4 10 16 4 0 6 13
5 12 24 6 6 0 18 6
6 11 13 18 0 12 14
7 6 12 0
8 14
9
Ht 0 0 0 0 0 1 2 3 7
- Min 14 from (6, 8)
- Generate node 9.
- The height of node 9 is
- 14/2 7.
27Distance Tree
28UPGMA Algorithm Summarized
- 1. Initialization
- (a). Assign each sequence i to its own cluster
Ci. - (b). Define one leaf of T for each sequence, and
place at height zero. - 2. Iteration
- (a) Find the minimal distance in the matrix,
determine the two clusters i, j. - (b) Define a new cluster k by CkCi U Cj
- (c) Define a node k with children nodes i and j,
and place it at height Dij/2. - (d) Add k to the current clusters and remove i
and j. - 3. Termination
- When only two clusters I, j remain, place the
root and height Dij/2.
29Three Stages of MPI Execution
- When the user gives the command to execute the
program, a copy of the program is sent to each
processor. - Each processor executes its own copy of the
executable program. - Different processors may execute different
statements by branching, within the program,
based on their process rank. - User MPI_Barrier() to synchronize
30Parallel Implementation
- Process 0 is the master, the rest of the process
are the workers. - Process 0 reads data from external file.
distributes the data evenly to the workers. - Each worker find the local minimal and send it
back to the master process. - Master finds the global min, updates the matrix
and sends the new line to the proper worker. - The procedure will go n-1 times until we get the
final matrix.
31Loop 0, P0
0 1 2 3 4 5 6 7 8
0
1 2
2 8 14
3 10 16 4
4 12 24 6 6
5
6
7
8
- Original matrix 5 x 5
- Complete matrix 9 x 9
- 3 processors, process 0 is the master, process 1
and 2 are the slaves. - Asterisk indicates rows and columns are not
active - Bold means active.
- Only work on the lower triangle of the matrix.
32Loop 0, P1
- This is the matrix processor 1 receives from the
master processor. It gets three rows 0,2,4. The
local minimal is 6, coming from (4,2)
0 1 2 3 4 5 6 7 8
0
1
2 8 14
3
4 12 24 6 6
5
6
7
8
33Loop 0, P2
- This is the matrix processor 2 receives from the
master processor. It gets two rows 1and 3. The
local minimal is 2 from (1,0).
0 1 2 3 4 5 6 7 8
0
1 2
2
3 10 16 4
4
5
6
7
8
34Loop 1, P0
0 1 2 3 4 5 6 7 8
0
1 2
2 8 14
3 10 16 4
4 12 24 6 6
5 11 13 18
6
7
8
Ht. 0 0 0 0 0 1
- Collect local mins 2, 6
- Global min 2 (0, 1)
- Combine nodes 1 and 0 to get node 5.
- Update matrix
- Update the non-active list to 1, 0.
- Set row 5 to p2
35Loop 1, p1
- Data do not change, but the non-active list has
changed. - Row 1, 2 and Column 1, 2 are no longer active.
- The local minimal is 6, coming from (4,2)
0 1 2 3 4 5 6 7 8
0
1
2 8 14
3
4 12 24 6 6
5
6
7
8
36Loop 1, p2
- P2 receives a new row 5 from the master
processor. - Row 0, 1 and column 0, 1 are no longer active.
The local minimal is 4 from (3,2).
0 1 2 3 4 5 6 7 8
0
1
2 8 14
3
4 12 24 6 6
5
6
7
8
37Loop 2, P0
0 1 2 3 4 5 6 7 8
0
1 2
2 8 14
3 10 16 4
4 12 24 6 6
5 11 13 18
6 6 12
7
8
Ht. 0 0 0 0 0 1 2
- Collected local mins 4, 6
- Calculate global min 4 (3,2)
- Node 3 and 2 merge into node 6.
- Update the non-active list to 1, 0, 3, 2.
- Update matrix
- Send row 6 to p1
38Loop 2, p1
- P1 receives a new row no. 6 from the master
processor. - Now the non-active list is 0, 1, 2, 3.
- The local minimal is 6, coming from (6,4)
0 1 2 3 4 5 6 7 8
0
1
2 8 14
3
4 12 24 6 6
5
6 6 12
7
8
39Loop 2, p2
- The data in P2 do not change.
- It only needs to work on row 5 (row1, 3 are not
active). - The local minimal is 18 from (5,4).
0 1 2 3 4 5 6 7 8
0
1 2
2
3 10 16 4
4
5 11 13 18
6
7
8
40Loop 3, P0
0 1 2 3 4 5 6 7 8
0
1 2
2 8 14
3 10 16 4
4 12 24 6 6
5 11 13 18
6 6 12
7 14
8
Ht. 0 0 0 0 0 1 2 3
- Collected local mins 6,18.
- Calculate global min.
- Update the non-active list to 1, 0, 3, 2, 6, 4.
- Combine nodes 6 and 4 to get node 7.
- Send the new row no.7 to P2.
41Loop 3, p1
- The data in P1 do not change.
- It should work on 0, 2, 4, 6, but none of them
are active. - Therefore the local minimal is INFINITY.
0 1 2 3 4 5 6 7 8
0
1
2 8 14
3
4 12 24 6 6
5
6 6 12
7
8
42Loop 3, p2
- P2 gets a new row no.7 from the master slave.
- Non-active list is 0, 1, 2, 3, 4, 6.
- The local minimal is 14 from (7,5).
0 1 2 3 4 5 6 7 8
0
1 2
2
3 10 16 4
4
5 11 13 18
6
7 14
8
43Matrix on process 0
0 1 2 3 4 5 6 7 8
0
1 2
2 8 14
3 10 16 4
4 12 24 6 6
5 11 13 18
6 6 12
7 14
8
Ht. 0 0 0 0 0 1 2 3 7
- Local mins 14, infinity.
- Global min 14
- Combine nodes 7 and 5 to get node 8, the root.
- Update the non-active list to 1, 0, 3, 2, 6, 4,
7, 5. - Write down the height as 14/2 7.0
44Distance Tree
45Performance Speedup
- Only working on the lower triangular part of the
matrix. - How to break up the triangular
46Performance Speedup
- Check active elements by triple loops
- For i 1 to 1000
- for j 1 to 1000
- for active_live 1 to 1000
- check active or not
- next
- next
- next
- Check active elements by linklist.
- linkList a store active columns (same on
different processor) - linkList b store active rows (different on
different Processor)
47Performance Analysis
800 x 800 matrix
800x800 clock time Total User Time Total System Time speedup
1 9.69 7.11 0.64
2 6.74 7.26 0.86 1.43
4 6.58 7.44 1.26 1.47
8 7.725 8.46 3.38 1.25
16 10.7 9.55 7.47 N/A
24 14.83 12.38 12.04 N/A
48Performance Analysis
1600 x 1600 matrix
1600x1600 clock time Total User Time Total System Time Speedup
1 62.10 52.12 2.45
2 39.34 52.73 3.53 1.58
4 28.79 55.03 4.96 2.16
8 24.43 57.98 9.98 2.54
16 26.27 64.29 19.07 2.36
24 26.32 68.85 26.95 2.36
49Performance Analysis
3200 x 3200 matrix
Worker proc clock time Total User Time Total System Time Speedup
1 497.91 464.24 7.72
2 268.96 451.77 9.34 1.85
4 163.46 467.83 13.32 3.04
8 114.52 489.94 26.11 4.35
16 87.73 499.34 47.27 5.67
24 105.22 527.97 70.6 4.73
50(No Transcript)
51Clock time Vs. Number of Processors
1 2 2 4 4 8 8-16 16-24
800x800 30 2 Neg Neg Neg
1600x1600 37 27 15 3 Neg
3200x3200 48 39 30 23 Neg
- The bigger the problem size, the better the
performance. - Communication time.
52Speedup
- T(1)
- Speedup ---------------
- T(n)
- Only gain speedup from the part of a program
that can be parallelized - The maximum speedup is limited by the serial
part of the program.
53(No Transcript)
54Conclusion
- The parallelization strategy appears to be highly
effective. - The performance is closed related with the size
of the problem. - Different problem size has different optimal
number of processors.
55Reference
- The bioinformatics background come from
http//www.compbio.unm.edu/poincare.pdf
http//www.cs.utexas.edu/users/tandy - http//sansan.phy.ncu.edu.tw/hclee/lec/Lecture3_
Phylogeny.ppt - The parallel implementation of UPGMA is
abstracted from my master thesis under Dr.
Buells supervision.