PAML: Phylogenetic Analysis by Maximum Likelihood - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

PAML: Phylogenetic Analysis by Maximum Likelihood

Description:

ML for amino acids & codons. The world's best named simulation program ... Uses of PAML (iv): Codon substitution models & detection of selection in protein ... – PowerPoint PPT presentation

Number of Views:706
Avg rating:3.0/5.0
Slides: 24
Provided by: zihen
Category:

less

Transcript and Presenter's Notes

Title: PAML: Phylogenetic Analysis by Maximum Likelihood


1
PAMLPhylogenetic Analysis by Maximum Likelihood
Ziheng Yang Depart of BiologyUniversity College
London http//abacus.gene.ucl.ac.uk/
2
Plan
  • Overview of PAML, things it can do, and
    especially things that other program dont do.
  • An example (of detecting amino acids under
    positive selection)
  • The trouble with sliding windows

3
PAML programs, currently in ver 3.15
  • PAML programs are written in ANSI C. Executables
    are provided for MS Windows and Mac OSX. Source
    codes can be compiled for unix and other
    platforms.
  • Free for academics (and everybody else).
  • Sequential, not parallelized.
  • Old-style command-line programs, with no GUI, no
    menu, no mice.
  • Yangs theorem Every version of PAML has bugs.

4
PAML programs
baseml basemlg codeml evolver yn00chi2 ML under nucleotide-based models Continuous-gamma, for bases (Yang 1993) ML for amino acids codons The worlds best named simulation program dN and dS estimation using YN2000?2 critical values and p values
pamp mcmctree Parsimony calculations (Yang and Kumar 1996)Species divergence times, soft bounds, relaxed clocks (Yang Rannala 2006)
5
PAML docs examples
  • doc in doc/ pamlDOC.pdf, pamlFAQ.pdf,
    pamlHistory.txt
  • examples/ are provided with README files
  • Apologies for poor support. Bug reports can
    come to my mailbox. Questions should go to paml
    discussion group http//www.rannala.org/gsf

6
Major weaknesses
  • Poor tree search
  • Poor user interface

Major strength
  • Many models implemented in the likelihood
    framework.

7
Uses of PAML (i)
Maximum likelihood parameter estimation and
likelihood ratio tests of hypotheses under a
number of substitution models based on
nucleotides, amino acids, and codons (such as the
molecular clock, rate variation among
sites). Most of the nucleotide-based models are
available in PAUP. Most of models are available
in MrBayes?
8
Uses of PAML (ii)
Likelihood (empirical Bayes) reconstruction of
ancestral nucleotide, amino acid, or codon
sequences. This is the same as parsimony
reconstruction except that it accounts for
different branch lengths and different rates of
change between states.
Yang, Z., S. Kumar, and M. Nei. 1995. Genetics
1411641-1650. Koshi, J. M., and R. A. Goldstein.
1996. J. Mol. Evol. 42313-320. Pupko, T., I.
Peer, R. Shamir, and D. Graur. 2000. Mol. Biol.
Evol. 17890-896.
9
Uses of PAML (iii)
  • Combined analysis of heterogeneous data sets.
  • MrBayes has implemented more powerful models of
    this kind (Nylander, et al. 2004. Syst. Biol.
    5347-67).
  • These should make the following debates
    unnecessary
  • combined analysis (total evidence) vs. separate
    analysis
  • Supertree vs. supermatrix

Yang, Z. 1996. Maximum-likelihood models for
combined analyses of multiple sequence data. J.
Mol. Evol. 42587-596. Pupko, T., D. Huchon, Y.
Cao et al. 2002. Combining multiple data sets in
a likelihood analysis which models are the best?
Mol. Biol. Evol. 192294-2307.
10
Uses of PAML (iv)
Likelihood ratio test of the clock and likelihood
estimation of species divergences under clock and
relaxed-clock models (baseml codeml) Bayesian
estimation of species divergence times using soft
bounds and relaxed molecular clocks (mcmctree),
similar to Jeff Thornes multidivtime.
Rambaut, A., and L. Bromham. 1998. Mol. Biol.
Evol. 15442-448. Yoder, A. D., and Z. Yang.
2000. Mol. Biol. Evol. 171081-1090. Yang, Z.,
and A. D. Yoder. 2003. Syst. Biol.
52705-716. Yang, Z., and B. Rannala. 2006. Mol.
Biol. Evol. 23212-226. Rannala, B. and Z. Yang.
in preparation.
11
Uses of PAML (iv) Codon substitution models
detection of selection in protein-coding genes
(codeml)
  • Branch models to test positive selection on
    lineages on the tree (Yang 1998. Mol. Biol.
    Evol. 15568-573)
  • Site models to test positive selection affecting
    individual sites(Nielsen Yang. 1998. Genetics
    148929-936 Yang, et al. 2000. Genetics
    155431-449)
  • Branch-site models to detect positive selection
    at a few sites on a particular lineage(Yang
    Nielsen. 2002. Mol. Biol. Evol. 19908-917 Yang,
    et al. 2005. Mol. Biol. Evol. 221107-1118
    Zhang, J., R. Nielsen, and Z. Yang. 2005. Mol.
    Biol. Evol. 222472-2479)

12
MacCallum, C., and E. Hill. 2006. Being positive
about selection. PLoS Biol 4e87.
PLoS Biol is receiving and rejecting too many
manuscripts that use the MK test and paml/codeml
to detect positive selection. Their main
criterion right now is that the ms. should
include experimental verification to justify
publication in such high-profile journals.
13
LRT of amino acid sites under positive selection
H0 there are no sites at which ? gt 1H1 there
are such sites Compare 2?? 2(?1 ? ?0) with a ?2
distribution
(Nielsen Yang 1998 Genetics 148929-936Yang,
Nielsen, Goldman Pedersen 2000. Genetics
155431-449)
14
Models M1a M2a
  • M1a (neutral)
  • Site class 0 1
  • Proportion p0 p1
  • ? ratio ?0lt1 ?11
  • M2a (selection)
  • Site class 0 1 2
  • Proportion p0 p1 p2
  • ? ratio ?0lt1 ?11 ?2gt1

Modified from Nielsen Yang (1998), where ?00
is fixed
15
Human MHC Class I data192 alleles, 270 codons
Model ? Parameter estimates M1a
(neutral) ?7,490.99 p0 0.830, ?0 0.041
p1 0.170, ?1 1 M2a (selection)
?7,231.15 p0 0.776, ?0 0.058 p1
0.140, ?1 1 p2 0.084, ?2 5.389
Likelihood ratio test of positive selection 2??
2 ? 259.84 519.68, P lt 0.000, d.f. 2
16
Posterior probabilities for MHC (M2a)
17
25 sites identified under M2a
18
There are a few wrong ways for detecting
selection, one of which is sliding windows.
19
Sliding window analysis (mouse-rat BRCA1)
20
Sliding window analysis (fake data)
21
Two trends in sliding window analysis
  • Both dS and dN fluctuate smoothly (because
    consecutive windows overlap)
  • dS fluctuates more than dN (because there are
    fewer silent than replacement sites)

Sliding windows may be useful for displaying
trends that are known to exist, but is misleading
if used to detect trends.
22
  • Orthodox statistical analysis
  • formulate a biological hypothesis
  • design the experiment collect data
  • test whether the data are compatible with the
    hypothesis
  • The more-common way of data analysis in biology
  • a large amount of data, no a priori hypothesis
  • filter and plot data to identify unexpected
    patterns
  • test the patterns using statistical tests

23
Acknowledgment (sliding-window analysis)
Karl Schmid
Write a Comment
User Comments (0)
About PowerShow.com