Title: MaximumLikelihood Estimation of Substitution Heterogeneity through Clustering
1,
Maximum-likelihood estimation of substitution
heterogeneity through clustering Zhang Zhang and
Jeffrey P. Townsend Department of Ecology and
Evolutionary Biology, Yale University, New Haven,
Connecticut, United States of America
Summary Detecting substitution heterogeneity is
of great importance for locating divergent and
conservative regions along DNA and protein
sequences. Heterogeneous regions correspond to
nonuniform features and imply different
evolutionary process arising from different
functional constraints or selections. Here we
propose a maximum-likelihood method to detect
substitution heterogeneity in sequences. The
method uses divide and conquer to cluster
sequences into heterogeneous regions of different
substitutions. To determine whether a cluster is
deviated significantly from the sequences, the
method adopts several criteria, such as
Likelihood Ratio Test (LRT), Akaike Information
Criterion (AIC) and its variation (AICc), and
Bayesian Information Criterion (BIC). The method
does not need a priori knowledge for clustering
or the number of clusters, and particularly, it
is more accurate testified by the application to
several real data and can be applied to
comparative evolutionary studies. In addition, we
recommend that criteria with a consideration of
sampling size (AICc and BIC) be used in all
circumstances.
- Introduction
- Substitutions along sequences are concentrated in
some regions and/or relatively sparse in others. - Why should we care
- Substitution heterogeneity provides
- clues to sequence function
- implications of sequence structure
- evidence of interesting evolutionary phenomena
- What is the question
- Substitutions are not uniform.
- Positive selection is often acting on small
regions along sequence. - Heterogeneity of substitution within genes is
not well accounted for by existing methods that
average selective pressure across all sites under
the assumption that all sites evolve at the same
rate, or estimate selective pressure at each site
under a given statistical distribution. - Lack of a statistical method to detect
clustering within discrete linear sequences when
there is no priori specification of cluster size
or the number of clusters. - How we can do it
- Here we propose a method based on maximum
likelihood estimation (MLE) to detect regional
clusters with different substitution
heterogeneity.
- Model selection
- To examine whether a clustered model best fits a
sequence, our method adopts several different
criteria for model selection. - Likelihood Ratio Test (LRT) is a most popular
strategy for model selection. The maximized
log-likelihoods of the null (lnL0) and the
clustering (lnLc) models should be asymptotically
distributed as a ?2 with two degrees of freedom.
That is, the p-value lt the significance level
(usually 0.05), indicates that the clustering
model fits the data significantly better than the
null model and vice versa. - Akaike Information Criterion (AIC) represents
the Kullback-Leibler distance between a true
model and an examined model and quantifies the
information lost by approximating the true model,
where L is the maximized likelihood value and k
is the number of parameters. We define k0 and kc
as the number of parameters under the null
hypothesis and the clustering model, respectively
and thus k00 and kc2. - AICc, a modification of AIC, allows for
sampling size (n) as well as k and L, especially
for clustering short sequences into heterogeneous
regions which probably involves more biases. - Bayesian Information Criterion (BIC), similar
to AICc, is a function of its maximized
log-likelihood, number of parameters and sampling
size. - Algorithm implementation
- The new method uses divide and conquer approach
to detect all possible clusters among sequences.
After locating the first cluster, the method
partitions sequences into three sub-sequences and
then repeats the same analysis for these three
sub-sequences, until all segments of the sequence
have failed to demonstrate clustering (see Figure
2).
- Results
- We used our method to detect clusters along the
Drosophila Adh gene within five species of
Drosophila melanogaster species subgroup (D.
melanogaster, D. sechellia, D. simulans, D.
yakuba and D. erecta) and identified
heterogeneous clusters with different
substitution (see Table 1 and Figure 3).
- Materials and methods
- Alignment
- Suppose that two or more aligned sequences have N
sites and for each site, 0 represents identical
and 1 represents variant. Therefore, the aligned
sequences can be denoted as - Clustering model
- The null hypothesis assumes no heterogeneous
cluster among sequences. Consequently, the
likelihood of the null model (without clusters)
is calculated as - where n is the number of variant sites.
- Under the clustering model, the entire
sequence is partitioned into three regions and
the central region is considered as the
heterogeneous cluster (see Figure 1). - Figure 1 Illustration of a cluster among sequence
X, suppose that cs and ce are the start position
and end position of cluster, respectively, and
ns, nc and ne are the number of variant sites in
the beginning, central and ending regions,
respectively, where n ns nc ne. - The likelihood of the clustering model is
formulated as
Acknowledgments We thank Zheng Wang, Francesc
López, Aleksandra Adomas, Gina Wilpiszeski and
Andrea Hodgins-Davis for valuable discussions.
This work is supported by a grant from the
National Institute of General Medical Sciences at
the U.S. National Institutes of Health (GM068087).
Further information Please contact
Zhang.Zhang_at_yale.edu and Jeffrey.Townsend_at_yale.edu
. The proposed method has been implemented in the
program MLCluster that is freely available at
www.yale.edu/townsend/software.html. A PDF
version of the poster can be obtained at
www.yale.edu/townsend/Poster/MLCluster.pdf.
Literature cited Gaut, B. S., and B. S. Weir.
1994. Detecting substitution-rate heterogeneity
among regions of a nucleotide sequence. Mol Biol
Evol 11620-629. Goss, P. J., and R. C. Lewontin.
1996. Detecting heterogeneity of substitution
along DNA and protein sequences. Genetics
143589-602. Posada, D., and T. R. Buckley.
2004. Model selection and model averaging in
phylogenetics advantages of akaike information
criterion and bayesian approaches over likelihood
ratio tests. Syst Biol 53793-808. Tang, H., and
R. C. Lewontin. 1999. Locating regions of
differential variability in DNA and protein
sequences. Genetics 153485-495.