MaximumLikelihood Estimation of Substitution Heterogeneity through Clustering - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

MaximumLikelihood Estimation of Substitution Heterogeneity through Clustering

Description:

After locating the first cluster, the method partitions sequences into three sub ... Locating regions of differential variability in DNA and protein sequences. ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 2
Provided by: zhang3
Category:

less

Transcript and Presenter's Notes

Title: MaximumLikelihood Estimation of Substitution Heterogeneity through Clustering


1
,
Maximum-likelihood estimation of substitution
heterogeneity through clustering Zhang Zhang and
Jeffrey P. Townsend Department of Ecology and
Evolutionary Biology, Yale University, New Haven,
Connecticut, United States of America
Summary Detecting substitution heterogeneity is
of great importance for locating divergent and
conservative regions along DNA and protein
sequences. Heterogeneous regions correspond to
nonuniform features and imply different
evolutionary process arising from different
functional constraints or selections. Here we
propose a maximum-likelihood method to detect
substitution heterogeneity in sequences. The
method uses divide and conquer to cluster
sequences into heterogeneous regions of different
substitutions. To determine whether a cluster is
deviated significantly from the sequences, the
method adopts several criteria, such as
Likelihood Ratio Test (LRT), Akaike Information
Criterion (AIC) and its variation (AICc), and
Bayesian Information Criterion (BIC). The method
does not need a priori knowledge for clustering
or the number of clusters, and particularly, it
is more accurate testified by the application to
several real data and can be applied to
comparative evolutionary studies. In addition, we
recommend that criteria with a consideration of
sampling size (AICc and BIC) be used in all
circumstances.
  • Introduction
  • Substitutions along sequences are concentrated in
    some regions and/or relatively sparse in others.
  • Why should we care
  • Substitution heterogeneity provides
  • clues to sequence function
  • implications of sequence structure
  • evidence of interesting evolutionary phenomena
  • What is the question
  • Substitutions are not uniform.
  • Positive selection is often acting on small
    regions along sequence.
  • Heterogeneity of substitution within genes is
    not well accounted for by existing methods that
    average selective pressure across all sites under
    the assumption that all sites evolve at the same
    rate, or estimate selective pressure at each site
    under a given statistical distribution.
  • Lack of a statistical method to detect
    clustering within discrete linear sequences when
    there is no priori specification of cluster size
    or the number of clusters.
  • How we can do it
  • Here we propose a method based on maximum
    likelihood estimation (MLE) to detect regional
    clusters with different substitution
    heterogeneity.
  • Model selection
  • To examine whether a clustered model best fits a
    sequence, our method adopts several different
    criteria for model selection.
  • Likelihood Ratio Test (LRT) is a most popular
    strategy for model selection. The maximized
    log-likelihoods of the null (lnL0) and the
    clustering (lnLc) models should be asymptotically
    distributed as a ?2 with two degrees of freedom.
    That is, the p-value lt the significance level
    (usually 0.05), indicates that the clustering
    model fits the data significantly better than the
    null model and vice versa.
  • Akaike Information Criterion (AIC) represents
    the Kullback-Leibler distance between a true
    model and an examined model and quantifies the
    information lost by approximating the true model,
    where L is the maximized likelihood value and k
    is the number of parameters. We define k0 and kc
    as the number of parameters under the null
    hypothesis and the clustering model, respectively
    and thus k00 and kc2.
  • AICc, a modification of AIC, allows for
    sampling size (n) as well as k and L, especially
    for clustering short sequences into heterogeneous
    regions which probably involves more biases.
  • Bayesian Information Criterion (BIC), similar
    to AICc, is a function of its maximized
    log-likelihood, number of parameters and sampling
    size.
  • Algorithm implementation
  • The new method uses divide and conquer approach
    to detect all possible clusters among sequences.
    After locating the first cluster, the method
    partitions sequences into three sub-sequences and
    then repeats the same analysis for these three
    sub-sequences, until all segments of the sequence
    have failed to demonstrate clustering (see Figure
    2).
  • Results
  • We used our method to detect clusters along the
    Drosophila Adh gene within five species of
    Drosophila melanogaster species subgroup (D.
    melanogaster, D. sechellia, D. simulans, D.
    yakuba and D. erecta) and identified
    heterogeneous clusters with different
    substitution (see Table 1 and Figure 3).
  • Materials and methods
  • Alignment
  • Suppose that two or more aligned sequences have N
    sites and for each site, 0 represents identical
    and 1 represents variant. Therefore, the aligned
    sequences can be denoted as
  • Clustering model
  • The null hypothesis assumes no heterogeneous
    cluster among sequences. Consequently, the
    likelihood of the null model (without clusters)
    is calculated as
  • where n is the number of variant sites.
  • Under the clustering model, the entire
    sequence is partitioned into three regions and
    the central region is considered as the
    heterogeneous cluster (see Figure 1).
  • Figure 1 Illustration of a cluster among sequence
    X, suppose that cs and ce are the start position
    and end position of cluster, respectively, and
    ns, nc and ne are the number of variant sites in
    the beginning, central and ending regions,
    respectively, where n ns nc ne.
  • The likelihood of the clustering model is
    formulated as

Acknowledgments We thank Zheng Wang, Francesc
López, Aleksandra Adomas, Gina Wilpiszeski and
Andrea Hodgins-Davis for valuable discussions.
This work is supported by a grant from the
National Institute of General Medical Sciences at
the U.S. National Institutes of Health (GM068087).
Further information Please contact
Zhang.Zhang_at_yale.edu and Jeffrey.Townsend_at_yale.edu
. The proposed method has been implemented in the
program MLCluster that is freely available at
www.yale.edu/townsend/software.html. A PDF
version of the poster can be obtained at
www.yale.edu/townsend/Poster/MLCluster.pdf.
Literature cited Gaut, B. S., and B. S. Weir.
1994. Detecting substitution-rate heterogeneity
among regions of a nucleotide sequence. Mol Biol
Evol 11620-629. Goss, P. J., and R. C. Lewontin.
1996. Detecting heterogeneity of substitution
along DNA and protein sequences. Genetics
143589-602. Posada, D., and T. R. Buckley.
2004. Model selection and model averaging in
phylogenetics advantages of akaike information
criterion and bayesian approaches over likelihood
ratio tests. Syst Biol 53793-808. Tang, H., and
R. C. Lewontin. 1999. Locating regions of
differential variability in DNA and protein
sequences. Genetics 153485-495.
Write a Comment
User Comments (0)
About PowerShow.com