Molecular Clocks - PowerPoint PPT Presentation

About This Presentation
Title:

Molecular Clocks

Description:

Molecular Clocks Rose Hoberman The Holy Grail Fossil evidence is sparse and imprecise (or nonexistent) Given a phylogenetic tree branch lengths (rt) a time estimate ... – PowerPoint PPT presentation

Number of Views:201
Avg rating:3.0/5.0
Slides: 45
Provided by: csCmuEdu9
Learn more at: http://www.cs.cmu.edu
Category:
Tags: clocks | molecular

less

Transcript and Presenter's Notes

Title: Molecular Clocks


1
Molecular Clocks
  • Rose Hoberman

2
The Holy Grail
  • Fossil evidence issparse and imprecise (or
    nonexistent)

Predict divergence times by comparing molecular
data
3
  • Given
  • a phylogenetic tree
  • branch lengths (rt)
  • a time estimate for one (or more) node

110 MYA
  • Can we date other nodes in the tree?
  • Yes... if the rate of molecular change is
    constant across all branches

4
Rate Constancy?
Page Holmes p240
5
Protein Variability
  • Protein structures functions differ
  • Proportion of neutral sites differ
  • Rate constancy does not hold across different
    protein types
  • However...
  • Each protein does appear to have a characteristic
    rate of evolution

6
Evidence for Rate Constancyin Hemoglobin
Large carniverous marsupial
Page and Holmes p229
7
TheMolecular Clock Hypothesis
  • Amount of genetic difference between sequences is
    a function of time since separation.
  • Rate of molecular change is constant (enough) to
    predict times of divergence

8
Outline
  • Methods for estimating time under a molecular
    clock
  • Estimating genetic distance
  • Determining and using calibration points
  • Sources of error
  • Rate heterogeneity
  • reasons for variation
  • how its taken into account when estimating times
  • Reliability of time estimates
  • Estimating gene duplication times

9
Measuring Evolutionary time with a molecular clock
  • Estimate genetic distance
  • d number amino acid replacements
  • Use paleontological data to determine date of
    common ancestor
  • T time since divergence
  • Estimate calibration rate (number of genetic
    changes expected per unit time)
  • r d / 2T
  • Calculate time of divergence for novel sequences
  • T_ij d_ij / 2r

10
Estimating Genetic Differences
  • If all nt equally likely, observed difference
    would plateau at 0.75
  • Simply counting differences underestimates
    distances
  • Fails to count for multiple hits
  • (Page Holmes p148)

11
Estimating Genetic Distance with a Substitution
Model
  • accounts for relative frequency of different
    types of substitutions
  • allows variation in substitution rates between
    sites
  • given learned parameter values
  • nucleotide frequencies
  • transition/transversion bias
  • alpha parameter of gamma distribution
  • can infer branch length from differences

12
Distances from Gamma-Distributed Rates
  • rate variation among sites
  • fast/variable sites
  • 3rd codon positions
  • codons on surface of globular protein
  • slow/invariant sites
  • Trytophan (1 codon) structurally required
  • 1st or 2nd codon position when di-sulfide bond
    needed
  • alpha parameter of gamma distribution describes
    degree of variation of rates across positions
  • modeling rate variation changes branch length/
    sequence differences curve

13
Gamma Corrected Distances
  • high rate sites saturate quickly
  • sequence difference rises much more slowly as the
    low-rate sites gradually accumulate
    differences
  • Felsenstein Inferring Phylogenies p219

14
The Sloppy Clock
  • Ticks are stochastic, not deterministic
  • Mutations happen randomly according to a Poisson
    distribution.
  • Many divergence times can result in the same
    number of mutations
  • Actually over-dispersed Poisson
  • Correlations due to structural constraints

15
Poisson Variance(Assuming A Pefect Molecular
Clock)
  • If mutation every MY
  • Poisson variance
  • 95 lineages 15 MYA old have 8-22 substitutions
  • 8 substitutions also could be 5 MYA
  • Molecular Systematics p532

16
Need for Calibrations
  • Changes ratetime
  • Can explain any observed branch length
  • Fast rate, short time
  • Slow rate, long time
  • Suppose 16 changes along a branch
  • Could be 2 8 or 8 2
  • No way to distinguish
  • If told time 8, then rate 2
  • Assume rate2 along all branches
  • Can infer all times

17
Estimating Calibration Rate
  • Calculate separate rate for each data set
    (species/genes) using known date of divergence
    (from fossil, biogeography)
  • One calibration point
  • Rate d/2T
  • More than one calibration point
  • use regression
  • use generative model that constrains time
    estimates (more later)

18
Calibration Complexities
  • Cannot date fossils perfectly
  • Fossils usually not direct ancestors
  • branched off tree before (after?) splitting
    event.
  • Impossible to pinpoint the age of last common
    ancestor of a group of living species

19
Linear Regression
  • Fix intercept at (0,0)
  • Fit line between divergence estimates and
    calibration times
  • Calculate regression and prediction confidence
    limits
  • Molecular Systematics p536

20
Molecular DatingSources of Error
  • Both X and Y values only estimates
  • substitution model could be incorrect
  • tree could be incorrect
  • errors in orthology assignment
  • Poisson variance is large
  • Pairwise divergences correlated (Systematics
    p534?)
  • inflates correlation between divergence time
  • Sometimes calibrations correlated
  • if using derived calibration points
  • Error in inferring slope
  • Confidence interval for predictions much larger
    than confidence interval for slope

21
Rate Heterogeneity
  • Rate of molecular evolution can differ between
  • nucleotide positions
  • genes
  • genomic regions
  • genomes (nuclear vs organelle), species
  • species
  • over time
  • If not considered, introduces bias into
    time estimates

22
Rate Heterogeneity among Lineages
Cause Reason
Repair equipment e.g. RNA viruses have error-prone polymerases
Metabolic rate More free radicals
Generation time Copies DNA more frequently
Population size Effects mutation fixation rate
23
Local Clocks?
  • Closely related species often share similar
    properties, likely to have similar rates
  • For example
  • murid rodents on average 2-6 times faster than
    apes and humans (Graur Li p150)
  • mouse and rat rates are nearly equal (Graur Li
    p146)

24
Rate Changes within a Lineage
Cause Reason
Population size changes Genetic drift more likely to fix neutral alleles in small population
Strength of selection changes over time new role/environment gene duplication change in another gene
25
Working Around Rate Heterogeneity
  • Identify lineages that deviate and remove them
  • Quantify degree of rate variation to put limits
    on possible divergence dates
  • requires several calibration dates, not always
    available
  • gives very conservative estimates of molecular
    dates
  • Explicity model rate variation

26
Search for Genes with Uniform Rate across Taxa
  • Many clock tests
  • Relative rates tests
  • compares rates of sister nodes using an outgroup
  • Tajima test
  • Number of sites in which character shared by
    outgroup and only one of two ingroups should be
    equal for both ingroups
  • Branch length test
  • deviation of distance from root to leaf compared
    to average distance
  • Likelihood ratio test
  • identifies deviance from clock but not the
    deviant sequences

27
Likelihood Ratio Test
  • estimate a phylogeny under molecular clock and
    without it
  • e.g. root-to-tip distances must be equal
  • difference in likelihood 2Chi2 with n-2
    degrees of freedom
  • asymptotically
  • when models are nested
  • when nested parameters arent set to boundary

28
Relative Rates Tests
  • Tests whether distance between two taxa and an
    outgroup are equal (or average rate of two clades
    vs an outgroup)
  • need to compute expected variance
  • many triples to consider, and not independent
  • Lacks power, esp
  • short sequences
  • low rates of change
  • Given length and number of variable sites in
    typical sequences used for dating, (Bronham et al
    2000) says
  • unlikely to detect moderate variation between
    lineages (1.5-4x)
  • likely to result in substantial error in date
    estimates

29
Modeling Rate VariationRelaxing the Molecular
Clock
  • Learn rates and times, not just branch
    lengths
  • Assume root-to-tip times equal
  • Allow different rates on different branches
  • Rates of descendants correlate with that of
    common acnestor
  • Restricts choice of rates, but still too much
    flexibility to choose rates well

30
Relaxing the Molecular Clock
  • Likelihood analysis
  • Assign each branch a rate parameter
  • explosion of parameters, not realistic
  • User can partition branches based on domain
    knowledge
  • Rates of partitions are independent
  • Nonparametric methods
  • smooth rates along tree
  • Bayesian approach
  • stochastic model of evolutionary change
  • prior distribution of rates
  • Bayes theorem
  • MCMC

31
Parsimonious Approaches
  • Sanderson 1997, 2002
  • infer branch lengths via parsimony
  • fit divergence times to minimize difference
    between rates in successive branches
  • (unique solution?)
  • Cutler 2000
  • infer branch lengths via parsimony
  • rates drawn from a normal distribution (negative
    rates set to zero)

32
Bayesian ApproachesLearn rates, times, and
substitution parameters simultaneously
  • Devise model of relationship between rates
  • Thorne/Kishino et al
  • Assigns new rates to descendant lineages from a
    lognormal distribution with mean equal to
    ancestral rate and variance increasing with
    branch length
  • Huelsenbeck et al
  • Poisson process generates random rate changes
    along tree
  • new rate is current rate gamma-distributed
    random variable

33
Comparison of Likelihood Bayesan Approaches for
Estimating Divergence Times (Yang Yoder 2003)
  • Analyzed two mitochondrial genes
  • each codon position treated separately
  • tested different model assumptions
  • used
  • 7 calibration points
  • Neither model reliable when
  • using only one codon position
  • using a single model for all positions
  • Results similar for both methods
  • using the most complex model
  • use separate parameters for each codon position
    (could use codon model?)

34
Sources of Error/Variance
  • Lack of rate constancy (due to lineage,
    population size or selection effects)
  • Wrong assumptions in evolutionary model
  • Errors in orthology assignment
  • Incorrect tree
  • Stochastic variability
  • Imprecision of calibration points
  • Imprecision of regression
  • Human sloppiness in analysis
  • self-fulfilling prophecies

35
Reading the entrails of chickens (Graur and
Martin 2004)
  • single calibration point
  • error bars removed from calibration points
  • standard error bars instead of 95 confidence
    intervals
  • secondary/tertiary calibration points treated as
    reliable and precise
  • based on incorrect initial estimates
  • variance increases with distance from
    original estimate
  • few proteins used

36
Multiple Gene Loci
  • Trying to estimate time of divergence from one
    protein is like trying to estimate the average
    height of humans by measuring one human
  • --Molecular Systematics p539
  • Use multiple genes!
  • (and multiple calibration points)

37
Even so...Be Very Wary Of
Molecular Times
  • Point estimates are absurd
  • Sample errors often based only on the
    difference between estimates in the
    same study
  • Even estimates with confidence intervals
    unlikely to really capture all sources of variance

38
McLysaght, Hokamp, Wolfe 2002Dating Human Gene
Duplications
  • 758 Trees generated (ML method using PAM
    matrix)
  • 602 Alpha parameter for gamma distribution
    learned
  • (Gu and Zhang 1997) faster than ML, more accurate
    than parsimony
  • Thrown out if variance gt mean. Why would this
    happen?
  • May be problematic to apply this model for gene
    family evolution because of the possible
    functional divergence among paralogous genes
  • 481 NJ trees built from Gamma-corrected
    distances
  • Family kept only if worm/fly group together
  • 191 Two-cluster test of rate constancy
    (Takezaki et al 1995)

39
Blanc, Hokamp, WolfeDating Arabadopsis
Duplications
  • Create nucleotide alignments
  • Estimate Level of Synonymous substitutions
    (Yangs ML method)
  • per site? per synonymous site?
  • Ks values gt 10 ignored (Yang Anisimova)
  • Why used different method than for human?
  • How reliable is ranking of Ks values? How much
    variance expected?

40
Ks gt 10 unreliable ?
  • Yang (abstract) calculates effect of evolutionary
    rate on accuracy of phylogenic reconstruction
  • Anisimova calculates accuracy and power of LRT in
    detecting adaptive molecular evolution
  • Neither seems to give any cutoff regarding dS gt
    10.

41
Future Improvements
  • Calculate accurate confidence intervals
    taking into account multiple sources
    of variance
  • Novel models that account for variation in rates
    between taxa
  • Build explicit models that predict rates based on
    an understanding of the underlying processes that
    generate differences in substitutions rates

42
General References
  • Reviews/Critiques
  • Bronham and Penny. The modern molecular clock,
    Nature review in genetics?, 2003.
  • Graur and Martin. Reading the entrails of
    chickens...the illusion of precision. Trends in
    Genetics, 2004.
  • Textbooks
  • Molecular Systematics. 2nd edition. Edited by
    Hillis, Moritz, and Mable.
  • Inferring Phylogenies. Felsenstein.
  • Molecular Evolution, a phylogenetic approach.
    Page and Holmes.

43
Rate Heterogeneity References
  • Dealing with Rate Heterogeneity
  • Yang and Yoder. Comparison of likelihood and
    bayesian methods for estimating divergence
    times... Syst. Biol, 2003.
  • Kishino, Thorne, and Bruno. Performance of a
    divergence time estimation method under a
    probabilistic model of rate evolution. Mol. Biol.
    Evol, 2001.
  • Huelsenbeck, Larget, and Swofford. A compound
    poisson process for relaxing the molecular clock.
    Genetics, 2000.
  • Testing for Rate heterogeneity
  • Takezaki, Rzhetsky and Nei. Phylogenetic test of
    the molecular clock and linearized trees. Mol.
    Bio. Evol., 1995.
  • Bronham, Penny, Rambaut, and Hendy. The power of
    relative rates test depends on the data. J Mol
    Evol, 2000.

44
Dating Duplications References
  • Dating duplications
  • McLysaght, Hokamp, and Wolfe. Extensive genomic
    duplication during early chordate evolution.
    Nature Genetics?, 2002.
  • Blanc, Hokamp, and Wolfe. Recent polyploidy
    superimposed on older large-scale duplications in
    the Arabidopsis genome. Genome Research, 2003.
  • Reference used for dating duplications in above
    papers
  • Gu and Zhang. A simple method for estimating the
    parameter of substitution rate variation among
    sites. Mol. Biol. Evol., 1997.
  • Yang Z. On the best evolutionary rate for
    phylogenetic analysis. Syst. Biol, 1998.
  • Anisimova, Bielawski, Yang. Accuracy and power of
    the likelihood ratio test in detecting adaptive
    molecular evolution. Mol. Biol. Evol., 2001.

45
Relative vs Absolute Rates
  • M. Systematics p540
  • Differences in rates of divergence among
    lineages detract only from methods of analysis
    that require clocklike behavior of molecules, and
    alternative methods of analysis exist for all
    applications of molecular systematics except for
    the absolute estimation of time.
  • t1 2 t2 still requires clocklike behavior?

46
Synonymous vs Nonsynonymous Distance
  • Syn sites are sites where a nt change does not
    cause an AA change
  • only 25 of sites, so become saturated more
    quickly
  • Between proteins
  • more variation in non-synonymous rates
  • Within same protein
  • more variation in synonymous rates
  • Which are used? What is effect?

47
Two-cluster TestTakezaki, Rzhetsky and Nei
(1995?)
  • estimate tree
  • for each nonroot interior node
  • calculate average rate for both descendant
    clades
  • test equality of rates (using variance
    covariance of branch lengths) doesnt appear to
    correct for multiple testing
  • move up from leaves, eliminating a cluster if not
    equal
  • finally, linear tree created
  • reestimate branch lengths under clock constraint

48
Neutral Hypothesis
  • Most mutations have no influence on fitness of
    the organism
  • Advantageous mutations rare
  • Deleterious mutations rapidly removed
  • Greatest proportion of mutations have no effect
    on protein function
  • Rate of change is thus affected only by mutation
    rate, and so should be relatively constant within
    a species
  • Variation in rate among genes b/c differences in
    selective constraints

49
Mutation Rate in Nuclear Genes of Mammals (Yang
Nielsen 1997)
dS (P) dS (R) dN (P) dN(R)
Acid phosphotase 0.354 0.680 0.028 0.049
Myelin Proteolipid 0.033 0.117 0.009 0.000
Interleukin 6 0.100 0.566 0.191 0.373
IGF binding 1 0.307 0.667 0.109 0.084
Thrombomodulin 0.414 1.337 0.092 0.108
Average 0.190 0.525 0.039 0.066
50
Perfect Molecular Clock
  • Change linear function time (substitutions
    Poisson)
  • Rates constant (positions/lineages)
  • Tree perfect
  • Molecular distance estimated perfectly
  • Calibration dates without error
  • Regression (time vs substitutions) without error

51
Yang, effect of evol. rate abstract
  • Yang calculates effect of evolutionary rate on
    accuracy of phylogenic reconstruction
  • simulation study
  • branch length expected total number nt
    substitutions per site (not synonymous?)
  • estimates proportion of correctly recovered
    branch partitions
  • optimum levels of sequence divergence were even
    higher than previously suggested for saturation
    of substitutions, indicating that the problem of
    saturation may have been exaggerated

52
Bayesian parametric estimation
  • Density function for x, given the training data
    set
  • From the definition of conditional probability
    densities
  • The first factor is independent of X(n) since it
    just our assumed form
    for parameterized density.
  • Therefore

53
Bayesian parametric estimation
  • Instead of choosing a specific value , the
    Bayesian approach performs a weighted average
    over all values of
  • If the weighting factor ,
    which is a posterior of peaks very sharply
    about some value we obtain
    .
  • Thus the optimal estimator is the most likely
    value of given the data and the prior of
    .

54
The Holy Grail
  • Fossil evidence issparse and imprecise (or
    nonexistent)

Predict divergence times by comparing molecular
data
Write a Comment
User Comments (0)
About PowerShow.com