Parallel Bayesian Phylogenetic Inference - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel Bayesian Phylogenetic Inference

Description:

Identify new pathogens. Phylogenetic distribution of biochemical ... 1.7 million known species. Number of trees increase exponentially as new species was added ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 60
Provided by: fengx1
Learn more at: https://www.cse.sc.edu
Category:

less

Transcript and Presenter's Notes

Title: Parallel Bayesian Phylogenetic Inference


1
Parallel Bayesian Phylogenetic Inference
  • Xizhou Feng
  • Directed by Dr. Duncan Buell
  • Department of Computer Science and Engineering
  • University of South Carolina, Columbia
  • 2002-10-11

2
Topics
  • Background
  • Bayesian phylogenetic inference and MCMC
  • Serial implementation
  • Parallel implementation
  • Numerical result
  • Future research

3
Background
  • Darwin Species are related through a history of
    common descent, and
  • The history can be organized as a tree structure
    (phylogeny).
  • Modern species are put on the leaf nodes
  • Ancient species are put on the internal nodes
  • The time of the divergence is described by the
    length of the branches.
  • A clade is a group of organisms whose members
    share homologous features derived from a common
    ancestor.

4
Phylogenetic tree
Internal node Ancestral species
Clade
Leaf node Current species
Branch
Branch length
5
Applications
  • Phylogenetic tree is
  • fundamental to understand evolution and diversity
  • Principle to organize biological data
  • Central to organism comparison
  • Practical examples
  • Resolve quarrel over bacteria-to-human gene
    transfers (Nature 2001)
  • Tracing route of infectious disease transmission
  • Identify new pathogens
  • Phylogenetic distribution of biochemical pathways

6
Use DNA data for phylogenetic inference
7
Objectives of phylogenetic inference
output
input
  • Major objectives include
  • Estimate the tree topology
  • Estimate the branch Length, and
  • Describe the credibility of the result

8
Phylogenetic inference methods
  • Algorithmic methods
  • Defining a sequence of specific steps that lead
    to the determination of a tree
  • e.g. UPGMA (unweighted pair group method using
    arithmetic average) and Neighbor-Joining
  • Optimality criterion-based methods
  • 1) Define a criterion 2)Search the tree with
    best values
  • Maximum parsimony (minimize the total tree
    length)
  • Maximum likelihood (maximize the likelihood)
  • Maximum posterior probability (the tree with the
    highest probability to be the true tree)

9
Common Used phylogeny methods
Algorithm
Statistical Supported
Search Strategy
Bayesian Methods
Stochastic search
Maximum Likelihood
Optimization method
Divide Conquer
Maximum Parsimony
Greedy search
GA, SA MCMC
Exact search
Fitch-Margolish
DCM, HGT, Quartet
Algorithmic method
Stepwise addition Global arrangement Star
decomposition
Exhaustive Branch Bound
Neighbor-join
UPGMA
Data set
Distance matrix
Character data
10
Aspects of phylogenetic methods
  • Accuracy
  • Is the constructed tree a true tree?
  • If not, the percentage of the wrong edges?
  • Complexity
  • Neighbor-Join O(n3)
  • Maximum Parsimony (provably NP hard)
  • Maximum Likelihood (conjectured NP hard)
  • Scalability
  • Good for small tree, how about large tree
  • Robustness
  • If the model or assumption or the data is not
    exact correct, how about the result?
  • Convergence rate
  • How long a sequence is needed to recover the true
    tree?
  • Statistical support
  • With what probability is the computed tree the
    true tree?

11
The computational challenge
  • gt1.7 million known species
  • Number of trees increase exponentially as new
    species was added
  • The complex of evolution
  • Data Collection Computational system

Compute the tree of life Source
http//www.npaci.edu/envision/v16.3/hillis.html
12
Topics
  • Background
  • Bayesian phylogenetic inference and MCMC
  • Serial implementation
  • Parallel implementation
  • Numerical result
  • Future research

13
Bayesian Inference-1
  • Both observed data and parameters of models are
    random variables
  • Setting up the joint distribution

Topology
Branch length
Parameter of models
  • When data D is known, Bayes theory gives

likelihood
Prior probability
Posterior probability
Unconditional Probability of data
14
Bayesian Inference-2
  • Having the posterior probability distribution ,
    we can compute the marginal probability of T as
  • P(TD) can be interpreted as the probability of
    the tree is correct
  • We need to do at least two things
  • Approximate the posterior probability
    distribution
  • Evaluate the integral for P(TD)
  • These can be done via Markov Chain Monte Carlo
    Method

15
Markov chain Monte Carlo (MCMC)
  • The basic idea of MCMC is
  • To construct a Markov chain such that
  • Have the parameters as the state space, and
  • the stationary distribution is the posterior
    probability distribution of the parameters
  • Simulate the chain
  • Treat the realization as a sample from the
    posterior probability distribution
  • MCMC sampling continue search

16
Markov chain
  • A Markov chain is a sequence of random variables
    X0, X1, X2, whose transition kernel T(Xt,
    Xt1) is determined only by the value of Xt
    (tgt0).
  • Stationary distribution ?(x)Sx(?(x)T(x,x))
    is invariant
  • Ergodic property pn(x) converges to ?(x) as n??
  • A homogeneous Markov chain is ergodic if
    min(T(x,x)/ ?(x)gt0

17
Metropolis-Hasting algorithm-1
  • Cornerstone of all MCMC methods, Metropolis(1953)
  • Hasting proposed a generalized version in (1970)
  • The key point is to how to define the accepted
    probability
  • Metropolis
  • Hasting

Proposal probability Can be any form Such that
18
Metropolis-Hasting algorithm-2
  • Initialize x0, set t0
  • Repeat
  • Sample x from T(xt, x)
  • Draw Uuniform0,1
  • Update

19
Problems of MH Algorithm Improvement
  • Problems
  • Mixing rate is slow when
  • Small step-gtlow movement
  • Larger step-gtlow acceptance
  • Stopped at local optima
  • Dimension of state space may vary
  • Improvement
  • Metropolis-coupled MCMC
  • Multipoint MCMC
  • Population-based MCMC
  • Time-reversible jump MCMC

20
Metropolis-coupled MCMC (Geyer 1991)
  • Run several MCMC chains with different
    distribution ?i(x) (i1..m) in parallel
  • ?1(x) is used to sampling
  • ?i(x) (i2..m) are used to improve mixing
  • For example ?i(x) ?(x)1/(1?(I-1))
  • After each iteration, attempt to swap the states
    of two chains using a Metropolis-Hasting step
    with acceptance probability of

21
Illustration of Metropolis-coupled MCMC
?1(x)T10
?2(x) T22
?3(x) T34
?4(x) T48
Metropolis-coupled MCMC is also called Parallel
tempering
22
Multiple Try Metropolis (Liu et al 2000)
23
Population-based MCMC
  • Metropolis-coupled MCMC uses a minimal
    interaction between multiple chains, why not more
    active interaction
  • Evolutionary Monte Carlo (Liang et al 2000)
  • Combine Genetic Algorithm with MCMC
  • Used to Simulate protein folding
  • Conjugate Gradient Monte Carlo (Liu et al 2000)
  • Use local optimization for adaptation
  • An improvement of ADS (Adaptive Direction
    Sampling)

24
Topics
  • Background
  • Bayesian phylogenetic inference and MCMC
  • Serial implementation
  • Choose the evolutionary model
  • Compute the likelihood
  • Design proposal mechanisms
  • Parallel implementation
  • Numerical result
  • Future research

25
DNA substitution rate matrix
Purine
transition
transversion
Pyrimidine
  • Consider inference of un-rooted tree and the
    computational complications, some simplified
    models are used (see next slide)

26
GTR-family of substitution models
GTR
  • GTR general time-reversible model, corresponding
    to a symmetric rate matrix.

Three substitution type (1 transversion, 2
transition)
Equal base frequencies
TN93
SYM
Two substitution types (transition v.s.
tranversion)
Three substitution type (1 transversion, 2
transition)
HKY85 F84
K3ST
Equal base frequencies
Two substitution types (transition v.s.
tranversion)
Single substitution type
K2P
F81
Single substitution type
Equal base frequencies
JC69
27
More complex models
  • Substitute rates vary across sites
  • Invariable sites models gamma distribution
  • Correlation in the rates at adjacent sites
  • Codon models 61X61 instantaneous rate matrix
  • Secondary structure models

28
Compute conditional probability of branch
  • Given substitution rate matrix, how to compute
    p(ba,t)-the probability of a is substituted by b
    after time t

Eigenvalue of Q
29
Likelihood of a phylogeny tree for one site
When x4 x5 are known,
When x4 x5 are unknown,
30
Likelihood calculation (Felsenstein 1981)
  • Given a rooted tree with n leaf nodes (species) ,
    and each leaf node is represented by a sequence
    xi with length N, the likelihood of a rooted tree
    is represented as

31
Likelihood calculation-2
  • Felsensteins algorithms for likelihood
    calculation(1981)
  • Initiation
  • Set k2n-1
  • Recursion Compute for all a as follows
  • If k is a leaf node
  • Set if
  • Set if .
  • If k is not a leaf node
  • compute for all a its children
    nodes i, j.
  • And set
  • Termination Likelihood at site u is

Note algorithm modified from Durbin et al (1998)
32
Likelihood calculation-3
Site 1
Site 2
Site m-1
Site m-2
Taxa-1
Taxa-2
Taxa-3
Taxa-n
  • The likelihood calculation requires filling an N
    X M X S X R table
  • N number of sequences M number of sites
  • S number of state of characters R number of
    rate categories

33
Local update Likelihood
  • If we just change the topology and branch length
    of tree locally, we only need refresh the table
    at those affected nodes.
  • In the following example, only the nodes with red
    color need to change their conditional likelihood.

Original tree
Proposed tree
34
Proposal mechanism for trees
  • Stochastic Nearest-neighbor interchange (NNI)
  • Larget et al (1999) Huelsenbeck (2000)

35
Proposal mechanisms for parameters
Larget et al. (1999)
  • Independent parameter
  • e.g. transition/tranversion ratio k
  • A set of parameters constrained to sum to a
    constant
  • e.g. base frequency distribution
  • Draw a sample from the Dirichlet distribution

36
Bayesian phylogenetic inference
Prior probability
DNA Data
Likelihood
Evolutionary model
Phylogenetic tree
Posterior prob.
Starting tree
Proposal
inference
MCMC
A sequence of Samples
Approximate the distribution
37
Topics
  • Background
  • Bayesian phylogenetic inference and MCMC
  • Serial implementation
  • Parallel implementation
  • Challenges of serial computation
  • Difficulty
  • MCMC is a serial algorithm
  • Multiple chains need to be synchronized
  • Choose appropriate grid topology
  • Synchronize using random number
  • Numerical result
  • Future research

38
Computational challenge
  • Computing global likelihood needs O(NMRS2)
    multiplications
  • Local updating topology branch length needs
    O(MRS2log(N))
  • Updating model parameter needs O(NMRS2)
  • local update needs all required data in memory
  • Given N1000 species, each sequence has M5000
    sites, rate category R5, and DNA nucleotide
    model S4
  • Run 5 chains each with length of 100 million
    generations
  • Needs 400 days (assume 1 global updates, 99
    local update)
  • And O(NMRSLX2X2X8)32Gigabyte memory
  • gtSo until more advanced algorithms are
    developed, parallel computation is the direct
    solution.
  • Use 32 processor with 1 gigabyte memory, we can
    compute the problem in 2 weeks

39
Characteristic of good parallel algorithms
  • Balancing workload
  • Concurrency identify, manage, and granularity
  • Reducing communication
  • Communication-to-computation ratio
  • Frequency, volume, balance
  • Reducing extra work
  • Computing assignment
  • Redundant work

40
Single-chain MCMC algorithm
41
Multiple-chain MCMC algorithm
Chain 1
Chain 2
Chain 3
Chain 4
Generate S1(0) t0
Generate S2(0) t0
Generate S3(0) t0
Generate S4(0) t0
Propose Update S1(t)
Propose Update S2(t)
Propose Update S3(t)
Propose Update S4(t)
choose two chains to swap
Compute R and U
UltR
Swap the two selected chains
tt1
tt1
tt1
tt1
42
Task Concurrency
  • In one likelihood evaluation, the task can be
    divided into n_chain x n_site x n_rate concurrent
    tasks

43
Task decomposition and assignment
Concurrent tasks
Processors virtual topology
  • We organize the processor pool into an nx X ny X
    nz 3D grid
  • Each grid runs n_chain/ny chains, each chain has
    n_site/nx sites, and each site has n_rate/nz
    category rate. (my current implementation dosent
    consider rate dimension, so the topology is a 2D
    grid).
  • Choose nx, ny, and nz such that each processor
    has equal load

44
Parallel configuration Modes
  • Chain-level parallel configuration (Mode I)
  • Each chain runs on one row
  • Number of chains number of rows
  • Site-level parallel configuration (Mode II)
  • All chains run on the same row
  • Different subsequences run on different columns
  • Number of subsequences number of columns
  • Hybrid parallel configuration (Mode III)
  • Combine the first two configurations
  • Several chains may run on the same row and also
  • Divide the whole into segments

45
Orchestration
  • The data can be distributed at chain level and
    sub-sequence level, spatial locality is always
    preserved. At the same time, each process only
    needs 1/(nxXny) original memory
  • Each step, each processor needs 1 collective
    communication along its row and at most 2 (or 1)
    direct communication if it is one of the selected
    chain for swapping.
  • Processors at the same row can easily be
    synchronized because they use the same random
    number.

46
Communication between different rows
  • Processors at different rows need some trick to
    synchronize without communication
  • The trick is to use two random number generators
  • the first one is used to generate the proposal,
    processors at different row use different seeds
    for random number generator.
  • The second one is used to synchronize processors
    in different rows, using the same seed. So
    processors know whats going on in the whole
    grid.
  • To reduce the size of the message, the selected
    processors swap message in two steps
  • Swap current likelihood and temperature
  • Compute the acceptance ratio r and draw a random
    number U, if Ultr, then swap the tree and
    parameters else continue to next step.

47
Symmetric parallel algorithm-1
  • Build grid topology
  • Initialize
  • Set seed for random number Rand_1(for proposal)
    and Rand_2(for synchronize)
  • For each chain
  • t 0
  • Building random tree T0
  • Choose initial parameter ?0
  • Set initial state as x0
  • While (tltmax_generation) do
  • 3.1 In-chain update
  • For each chain
  • Choose proposal type using Rand_2
  • Generate the proposal x
  • Evaluate log likelihood locally
  • Collect the sum of the local log likelihood
  • For each chain
  • Compute r using hasting transition kernel
  • Draw a sample u from uniform distribution using
    Rand_1
  • If ultr set xx

48
Symmetric parallel algorithm-2
  • 3.2 inter-chain swap
  • Draw two chain indexes I and J from Rand_2
  • If IJ go to 3.3
  • Map I, J to the processor coordinate row_I and
    row_J
  • If row_I row_J
  • Compute acceptance ratio r
  • Draw u from uniform distribution using Rand_2
  • If ultr Swap state x and the temperature of
    chain I and chain J
  • Go to 3.3
  • If row_I ! row_J
  • Swap log likelihood and temperature of chain I
    and chain J between processors
  • on chain row_I and row_J
  • Compute acceptance r
  • Draw u from from uniform distribution using
    Rand_2
  • If ultr Swap state x of chain I and chain J
    between between processors
  • on chain row_I and row_J
  • 3.3 tt1
  • 3.4 if tsample_cycle0 output state x

49
Load Balancing
  • For simplicity, from now we just consider 2D grid
    (x-y).
  • Along x-axis, the processors are synchronized
    since they evaluate the same topology, same
    branch length, and same local update step
  • But along y-axis, different chains will propose
    different topologies, so according to local
    likelihood evaluation, imbalance will occur.
  • In the parameter update step, we need global
    likelihood evaluation, balancing is not a problem.

50
Asymmetric parallel algorithm
  • The symmetric algorithm assumes each processor
    has the same performance and deals with the
    imbalance between different rows by waiting and
    synchronization.
  • An improvement is to use an asymmetric algorithm.
    The basic idea is
  • Introduce a processor as the mediator
  • Each row send its state to the mediator after it
    finished in-chain update
  • The mediator keeps the current states of
    different rows, and decides which two chains
    should swap
  • The selected chains get new states from the
    mediator
  • Asymmetric algorithm can solve the unbalance
    problem, but assumes that different chains dont
    need to synchronize.

51
Topics
  • Background
  • Bayesian phylogenetic inference and MCMC
  • Serial implementation
  • Parallel implementation
  • Numerical result
  • Future research

52
The likelihood evolution of single chain
53
The likelihood evolution of 4 chains
54
Numerical result Case 1 run time
Dataset 7X2439 of chains4 generation10000 Pa
rallel Mode I
55
Numerical result Case 1s speedup
Dataset 7X2439 of chains4 generation10000 Pa
rallel Mode I
56
Speedup obtained by Mode II
Note Use the same dataset as the previous one.
But only parallel at sequence level.
57
Comparison of two parallel modes
58
Conclusion
  • Bayesian phylogenetic inference
  • Using MCMC sample posterior probability
  • Efficient parallel strategies
  • Stochastic algorithms vs. NP-problems

59
Future research directions
  • Improved parallel strategies
  • Connection between Bayesian Bootstrap
  • Performance of different Bayesian methods
  • Alternative MCMC methods
  • MCMC for other computational biology problems
  • The power of stochastic algorithms
  • Super tree construction
Write a Comment
User Comments (0)
About PowerShow.com