Title: Parallel Bayesian Phylogenetic Inference
1Parallel Bayesian Phylogenetic Inference
- Xizhou Feng
- Directed by Dr. Duncan Buell
- Department of Computer Science and Engineering
- University of South Carolina, Columbia
- 2002-10-11
2Topics
- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Parallel implementation
- Numerical result
- Future research
3Background
- Darwin Species are related through a history of
common descent, and - The history can be organized as a tree structure
(phylogeny). - Modern species are put on the leaf nodes
- Ancient species are put on the internal nodes
- The time of the divergence is described by the
length of the branches. - A clade is a group of organisms whose members
share homologous features derived from a common
ancestor.
4Phylogenetic tree
Internal node Ancestral species
Clade
Leaf node Current species
Branch
Branch length
5Applications
- Phylogenetic tree is
- fundamental to understand evolution and diversity
- Principle to organize biological data
- Central to organism comparison
- Practical examples
- Resolve quarrel over bacteria-to-human gene
transfers (Nature 2001) - Tracing route of infectious disease transmission
- Identify new pathogens
- Phylogenetic distribution of biochemical pathways
6Use DNA data for phylogenetic inference
7Objectives of phylogenetic inference
output
input
- Major objectives include
- Estimate the tree topology
- Estimate the branch Length, and
- Describe the credibility of the result
8Phylogenetic inference methods
- Algorithmic methods
- Defining a sequence of specific steps that lead
to the determination of a tree - e.g. UPGMA (unweighted pair group method using
arithmetic average) and Neighbor-Joining - Optimality criterion-based methods
- 1) Define a criterion 2)Search the tree with
best values - Maximum parsimony (minimize the total tree
length) - Maximum likelihood (maximize the likelihood)
- Maximum posterior probability (the tree with the
highest probability to be the true tree)
9Common Used phylogeny methods
Algorithm
Statistical Supported
Search Strategy
Bayesian Methods
Stochastic search
Maximum Likelihood
Optimization method
Divide Conquer
Maximum Parsimony
Greedy search
GA, SA MCMC
Exact search
Fitch-Margolish
DCM, HGT, Quartet
Algorithmic method
Stepwise addition Global arrangement Star
decomposition
Exhaustive Branch Bound
Neighbor-join
UPGMA
Data set
Distance matrix
Character data
10Aspects of phylogenetic methods
- Accuracy
- Is the constructed tree a true tree?
- If not, the percentage of the wrong edges?
- Complexity
- Neighbor-Join O(n3)
- Maximum Parsimony (provably NP hard)
- Maximum Likelihood (conjectured NP hard)
- Scalability
- Good for small tree, how about large tree
- Robustness
- If the model or assumption or the data is not
exact correct, how about the result? - Convergence rate
- How long a sequence is needed to recover the true
tree? - Statistical support
- With what probability is the computed tree the
true tree?
11The computational challenge
- gt1.7 million known species
- Number of trees increase exponentially as new
species was added
- The complex of evolution
- Data Collection Computational system
Compute the tree of life Source
http//www.npaci.edu/envision/v16.3/hillis.html
12Topics
- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Parallel implementation
- Numerical result
- Future research
13Bayesian Inference-1
- Both observed data and parameters of models are
random variables - Setting up the joint distribution
Topology
Branch length
Parameter of models
- When data D is known, Bayes theory gives
likelihood
Prior probability
Posterior probability
Unconditional Probability of data
14Bayesian Inference-2
- Having the posterior probability distribution ,
we can compute the marginal probability of T as
- P(TD) can be interpreted as the probability of
the tree is correct - We need to do at least two things
- Approximate the posterior probability
distribution - Evaluate the integral for P(TD)
- These can be done via Markov Chain Monte Carlo
Method
15Markov chain Monte Carlo (MCMC)
- The basic idea of MCMC is
- To construct a Markov chain such that
- Have the parameters as the state space, and
- the stationary distribution is the posterior
probability distribution of the parameters - Simulate the chain
- Treat the realization as a sample from the
posterior probability distribution - MCMC sampling continue search
16Markov chain
- A Markov chain is a sequence of random variables
X0, X1, X2, whose transition kernel T(Xt,
Xt1) is determined only by the value of Xt
(tgt0). - Stationary distribution ?(x)Sx(?(x)T(x,x))
is invariant - Ergodic property pn(x) converges to ?(x) as n??
- A homogeneous Markov chain is ergodic if
min(T(x,x)/ ?(x)gt0
17Metropolis-Hasting algorithm-1
- Cornerstone of all MCMC methods, Metropolis(1953)
- Hasting proposed a generalized version in (1970)
- The key point is to how to define the accepted
probability - Metropolis
- Hasting
Proposal probability Can be any form Such that
18Metropolis-Hasting algorithm-2
- Initialize x0, set t0
- Repeat
- Sample x from T(xt, x)
- Draw Uuniform0,1
- Update
19Problems of MH Algorithm Improvement
- Problems
- Mixing rate is slow when
- Small step-gtlow movement
- Larger step-gtlow acceptance
- Stopped at local optima
- Dimension of state space may vary
- Improvement
- Metropolis-coupled MCMC
- Multipoint MCMC
- Population-based MCMC
- Time-reversible jump MCMC
20Metropolis-coupled MCMC (Geyer 1991)
- Run several MCMC chains with different
distribution ?i(x) (i1..m) in parallel - ?1(x) is used to sampling
- ?i(x) (i2..m) are used to improve mixing
- For example ?i(x) ?(x)1/(1?(I-1))
- After each iteration, attempt to swap the states
of two chains using a Metropolis-Hasting step
with acceptance probability of
21Illustration of Metropolis-coupled MCMC
?1(x)T10
?2(x) T22
?3(x) T34
?4(x) T48
Metropolis-coupled MCMC is also called Parallel
tempering
22Multiple Try Metropolis (Liu et al 2000)
23Population-based MCMC
- Metropolis-coupled MCMC uses a minimal
interaction between multiple chains, why not more
active interaction - Evolutionary Monte Carlo (Liang et al 2000)
- Combine Genetic Algorithm with MCMC
- Used to Simulate protein folding
- Conjugate Gradient Monte Carlo (Liu et al 2000)
- Use local optimization for adaptation
- An improvement of ADS (Adaptive Direction
Sampling)
24Topics
- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Choose the evolutionary model
- Compute the likelihood
- Design proposal mechanisms
- Parallel implementation
- Numerical result
- Future research
25DNA substitution rate matrix
Purine
transition
transversion
Pyrimidine
- Consider inference of un-rooted tree and the
computational complications, some simplified
models are used (see next slide)
26GTR-family of substitution models
GTR
- GTR general time-reversible model, corresponding
to a symmetric rate matrix.
Three substitution type (1 transversion, 2
transition)
Equal base frequencies
TN93
SYM
Two substitution types (transition v.s.
tranversion)
Three substitution type (1 transversion, 2
transition)
HKY85 F84
K3ST
Equal base frequencies
Two substitution types (transition v.s.
tranversion)
Single substitution type
K2P
F81
Single substitution type
Equal base frequencies
JC69
27More complex models
- Substitute rates vary across sites
- Invariable sites models gamma distribution
- Correlation in the rates at adjacent sites
- Codon models 61X61 instantaneous rate matrix
- Secondary structure models
28Compute conditional probability of branch
- Given substitution rate matrix, how to compute
p(ba,t)-the probability of a is substituted by b
after time t
Eigenvalue of Q
29Likelihood of a phylogeny tree for one site
When x4 x5 are known,
When x4 x5 are unknown,
30Likelihood calculation (Felsenstein 1981)
- Given a rooted tree with n leaf nodes (species) ,
and each leaf node is represented by a sequence
xi with length N, the likelihood of a rooted tree
is represented as
31Likelihood calculation-2
- Felsensteins algorithms for likelihood
calculation(1981) - Initiation
- Set k2n-1
- Recursion Compute for all a as follows
- If k is a leaf node
- Set if
- Set if .
- If k is not a leaf node
- compute for all a its children
nodes i, j. - And set
-
- Termination Likelihood at site u is
Note algorithm modified from Durbin et al (1998)
32Likelihood calculation-3
Site 1
Site 2
Site m-1
Site m-2
Taxa-1
Taxa-2
Taxa-3
Taxa-n
- The likelihood calculation requires filling an N
X M X S X R table - N number of sequences M number of sites
- S number of state of characters R number of
rate categories
33Local update Likelihood
- If we just change the topology and branch length
of tree locally, we only need refresh the table
at those affected nodes. - In the following example, only the nodes with red
color need to change their conditional likelihood.
Original tree
Proposed tree
34Proposal mechanism for trees
- Stochastic Nearest-neighbor interchange (NNI)
- Larget et al (1999) Huelsenbeck (2000)
35Proposal mechanisms for parameters
Larget et al. (1999)
- Independent parameter
- e.g. transition/tranversion ratio k
- A set of parameters constrained to sum to a
constant - e.g. base frequency distribution
- Draw a sample from the Dirichlet distribution
36Bayesian phylogenetic inference
Prior probability
DNA Data
Likelihood
Evolutionary model
Phylogenetic tree
Posterior prob.
Starting tree
Proposal
inference
MCMC
A sequence of Samples
Approximate the distribution
37Topics
- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Parallel implementation
- Challenges of serial computation
- Difficulty
- MCMC is a serial algorithm
- Multiple chains need to be synchronized
- Choose appropriate grid topology
- Synchronize using random number
- Numerical result
- Future research
38Computational challenge
- Computing global likelihood needs O(NMRS2)
multiplications - Local updating topology branch length needs
O(MRS2log(N)) - Updating model parameter needs O(NMRS2)
- local update needs all required data in memory
- Given N1000 species, each sequence has M5000
sites, rate category R5, and DNA nucleotide
model S4 - Run 5 chains each with length of 100 million
generations - Needs 400 days (assume 1 global updates, 99
local update) - And O(NMRSLX2X2X8)32Gigabyte memory
- gtSo until more advanced algorithms are
developed, parallel computation is the direct
solution. - Use 32 processor with 1 gigabyte memory, we can
compute the problem in 2 weeks
39Characteristic of good parallel algorithms
- Balancing workload
- Concurrency identify, manage, and granularity
- Reducing communication
- Communication-to-computation ratio
- Frequency, volume, balance
- Reducing extra work
- Computing assignment
- Redundant work
40Single-chain MCMC algorithm
41Multiple-chain MCMC algorithm
Chain 1
Chain 2
Chain 3
Chain 4
Generate S1(0) t0
Generate S2(0) t0
Generate S3(0) t0
Generate S4(0) t0
Propose Update S1(t)
Propose Update S2(t)
Propose Update S3(t)
Propose Update S4(t)
choose two chains to swap
Compute R and U
UltR
Swap the two selected chains
tt1
tt1
tt1
tt1
42Task Concurrency
- In one likelihood evaluation, the task can be
divided into n_chain x n_site x n_rate concurrent
tasks
43Task decomposition and assignment
Concurrent tasks
Processors virtual topology
- We organize the processor pool into an nx X ny X
nz 3D grid - Each grid runs n_chain/ny chains, each chain has
n_site/nx sites, and each site has n_rate/nz
category rate. (my current implementation dosent
consider rate dimension, so the topology is a 2D
grid). - Choose nx, ny, and nz such that each processor
has equal load
44Parallel configuration Modes
- Chain-level parallel configuration (Mode I)
- Each chain runs on one row
- Number of chains number of rows
- Site-level parallel configuration (Mode II)
- All chains run on the same row
- Different subsequences run on different columns
- Number of subsequences number of columns
- Hybrid parallel configuration (Mode III)
- Combine the first two configurations
- Several chains may run on the same row and also
- Divide the whole into segments
45Orchestration
- The data can be distributed at chain level and
sub-sequence level, spatial locality is always
preserved. At the same time, each process only
needs 1/(nxXny) original memory - Each step, each processor needs 1 collective
communication along its row and at most 2 (or 1)
direct communication if it is one of the selected
chain for swapping. - Processors at the same row can easily be
synchronized because they use the same random
number.
46Communication between different rows
- Processors at different rows need some trick to
synchronize without communication - The trick is to use two random number generators
- the first one is used to generate the proposal,
processors at different row use different seeds
for random number generator. - The second one is used to synchronize processors
in different rows, using the same seed. So
processors know whats going on in the whole
grid. - To reduce the size of the message, the selected
processors swap message in two steps - Swap current likelihood and temperature
- Compute the acceptance ratio r and draw a random
number U, if Ultr, then swap the tree and
parameters else continue to next step.
47Symmetric parallel algorithm-1
- Build grid topology
- Initialize
- Set seed for random number Rand_1(for proposal)
and Rand_2(for synchronize) - For each chain
- t 0
- Building random tree T0
- Choose initial parameter ?0
- Set initial state as x0
- While (tltmax_generation) do
- 3.1 In-chain update
- For each chain
- Choose proposal type using Rand_2
- Generate the proposal x
- Evaluate log likelihood locally
- Collect the sum of the local log likelihood
- For each chain
- Compute r using hasting transition kernel
- Draw a sample u from uniform distribution using
Rand_1 - If ultr set xx
48Symmetric parallel algorithm-2
- 3.2 inter-chain swap
- Draw two chain indexes I and J from Rand_2
- If IJ go to 3.3
- Map I, J to the processor coordinate row_I and
row_J - If row_I row_J
- Compute acceptance ratio r
- Draw u from uniform distribution using Rand_2
- If ultr Swap state x and the temperature of
chain I and chain J - Go to 3.3
- If row_I ! row_J
- Swap log likelihood and temperature of chain I
and chain J between processors - on chain row_I and row_J
- Compute acceptance r
- Draw u from from uniform distribution using
Rand_2 - If ultr Swap state x of chain I and chain J
between between processors - on chain row_I and row_J
- 3.3 tt1
- 3.4 if tsample_cycle0 output state x
49Load Balancing
- For simplicity, from now we just consider 2D grid
(x-y). - Along x-axis, the processors are synchronized
since they evaluate the same topology, same
branch length, and same local update step - But along y-axis, different chains will propose
different topologies, so according to local
likelihood evaluation, imbalance will occur. - In the parameter update step, we need global
likelihood evaluation, balancing is not a problem.
50Asymmetric parallel algorithm
- The symmetric algorithm assumes each processor
has the same performance and deals with the
imbalance between different rows by waiting and
synchronization. - An improvement is to use an asymmetric algorithm.
The basic idea is - Introduce a processor as the mediator
- Each row send its state to the mediator after it
finished in-chain update - The mediator keeps the current states of
different rows, and decides which two chains
should swap - The selected chains get new states from the
mediator - Asymmetric algorithm can solve the unbalance
problem, but assumes that different chains dont
need to synchronize.
51Topics
- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Parallel implementation
- Numerical result
- Future research
52The likelihood evolution of single chain
53The likelihood evolution of 4 chains
54Numerical result Case 1 run time
Dataset 7X2439 of chains4 generation10000 Pa
rallel Mode I
55Numerical result Case 1s speedup
Dataset 7X2439 of chains4 generation10000 Pa
rallel Mode I
56Speedup obtained by Mode II
Note Use the same dataset as the previous one.
But only parallel at sequence level.
57Comparison of two parallel modes
58Conclusion
- Bayesian phylogenetic inference
- Using MCMC sample posterior probability
- Efficient parallel strategies
- Stochastic algorithms vs. NP-problems
59Future research directions
- Improved parallel strategies
- Connection between Bayesian Bootstrap
- Performance of different Bayesian methods
- Alternative MCMC methods
- MCMC for other computational biology problems
- The power of stochastic algorithms
- Super tree construction