Statistical Tests - PowerPoint PPT Presentation

About This Presentation

Title:

Statistical Tests

Description:

Statistical Tests for Gene Clusters Spanning Three Genomic Regions Narayanan Raghupathy, Rose Hoberman, and Dannie Durand Carnegie Mellon University, Pittsburgh, PA – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 2

Provided by: nrag

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Tests

1
Statistical Tests for Gene Clusters Spanning
Three Genomic Regions Narayanan Raghupathy, Rose
Hoberman, and Dannie Durand Carnegie Mellon
University, Pittsburgh, PA
Gene clusters evidence of common ancestry?
Many analyses use gene clusters---distinct
chromosomal regions that share homologous gene
pairs, but for which neither gene order nor gene
content is preserved---as evidence of shared
ancestry. However, it is necessary to first rule
out the possibility that the regions are
unrelated, and simply share homologous genes by
chance.
A gene cluster
W1
Contributed equally
W2
Are W1 and W2 homologous regions?
Statistical Tests We propose a novel test that
takes into account both the genes conserved in
all three regions (x123) and in only pairs of
regions ( and ). We use a
combinatorial approach to obtain expressions for
each genome model for the probability
, under the null hypothesis of random gene
order, (Equations omitted for brevity.) where
denotes the
random variables drawn from the distribution
given by the null hypothesis. The expression
is shorthand for
and ,
that is, each of the quantities
is at least as large as the
observed quantity. Using these expressions, we
computed cluster probabilities in Mathematica for
typical genome parameters and window sizes. These
simulations were used to investigate the
following questions.
Gene content overlap models The significance of
a cluster depends not only on the properties of
windows, but also on the size of the genomes and
the number of genes in common between the
genomes. We design statistical tests for genome
models that are appropriate for two common types
of comparative genomics problems. The first
model is designed for analyses of conserved
linkage of genes in three regions from three
distinct genomes. The second model is for
detection of segments duplicated by a whole
genome duplication (WGD), via comparison with the
genome of a related, pre-duplication species. We
again use a Venn diagram representation to
illustrate the extent of gene content overlap
among the genomes.

When comparing two regions, x, the number of
shared genes is a natural test statistic the
more genes that are shared, the less likely the
genes are shared by chance. In contrast, when
comparing three regions, there are many
quantities that provide evidence of homology
In the cluster at left, x123 1, x12 3, x13
1, x23 1
Previous attempts to test the significance of
three or more regions have either used multiple
pairwise comparisons (reviewed by Simillion et al
2), or only considered genes shared between all
regions (x123) 1. How best to combine evidence
from different subsets of regions remains an
unsolved problem.

W1
Current statistical approaches primarily focus on
comparisons of two regions only. With the rapid
rate of whole genome sequencing, analysis of gene
clusters that span three or more chromosomal
regions is of increasing interest. However, the
statistical questions are more difficult.

the number of genes shared among all three
regions (x123)
the number of genes shared between exactly two
regions (x12, x13, x23)
the number of genes unique to one window (x1,
x2, x3)

W2
W3
Given a third region W3, are W1 and W2 homologous?
Orthology model n123 genes are shared
between all three genomes. The remaining genes
in each genome (n1,n2,n3) are singletons, genes
which do not have homologs in any of the other
genomes.
(a)
To design statistical tests for three regions we
need to model

the number of genes shared among the three
regions
the extent of gene content overlap among the
genomes

Our goals
Develop genome models appropriate for common
comparative genomics problems.
Develop statistical tests for clusters spanning
three regions, for each model.
Study the relative importance of the above
quantities to cluster significance.
Investigate how the genome model affects cluster
significance.
Compare our proposed tests to previous
statistical approaches.

Hypothesis Testing Approach Our statistical
approach tests the hypothesis that a gene cluster
is evidence of shared ancestry against a null
hypothesis of random gene order. We try to rule
out the null hypothesis by showing that the
probability of the observed cluster is small
under the null hypothesis. Given a set of three
windows, each containing r consecutive genes, we
wish to determine whether the windows share more
homologous genes than expected by chance. A
gene cluster spanning two regions can be
characterized by the following quantities

Duplication model is a genome
that has undergone a whole genome duplication
(WGD) and is a related genome that diverged
from a common ancestor before the WGD.
genes appear twice in and once in
. These are the genes that are retained in
duplicate.
genes appear once in and once in
. These are the genes that were preferentially
lost.
genes appear once in but do not appear
in .

We present the first attempt to evaluate the
significance of clusters spanning exactly three
regions, taking into account both the genes
conserved in all regions and in only pairs of
regions. We
(b)
Are pairwise statistical tests sufficient?
The most common strategy for testing
significance of multiple regions is to conduct
multiple pairwise comparisons (reviewed in 2).
For example, if region W1 is significantly
similar to region W2, and W2 is significantly
similar to region W3, then homology between all
three regions is inferred, even if W1 and W3
share few or no genes.
How do retained duplicates after WGD affect
cluster significance?
Does the proportion of singleton genes in the
genome matter?

the number of shared genes (x)
the number of genes unique to each window

Following a WGD, in many cases there is no
immediate selective advantage for retaining a
gene in duplicate, so one of the duplicates is
often lost. Therefore, paralogous regions may
share few paralogous genes. Thus, these
duplicated regions are often detected by
comparison to a related pre-duplication genome.
We computed cluster probabilities for the
duplication model using the following parameters
n1,1 3600, n1,2 450 and n0,1 500. This is
consistent with a recent study of pre- and
post-duplication yeast species 3, in which only
16 of duplicates were retained following WGD in
S. cerevisiae
We illustrate these by a Venn diagram
representation of a gene cluster, where each
circle represents a window, and the number of
shared genes (x) is given in the intersection.
This approach allows the use of existing
statistical methods, which are designed for
comparing two regions. However, the pairwise
approach
Genomes under comparison often contain
singletons, genes which do not have homologs in
any of the other genomes (n1, n2, n3 in the
orthology model).

requires at least two of the three pairwise
comparisons to be independently significant
does not consider the greater impact of genes
shared among all three regions.

As the proportion of singletons in the genomes
increases, cluster significance increases
substantially. This is because as fewer homologs
are shared between the genomes, it is more
surprising to find them clustered together.
Wpost1 Wpre Wpost2
Wpost1 Wpre Wpost2
How much more does a gene shared by all three
regions contribute to significance?
We compared the pairwise probabilities to our
three-way probabilities for various cluster
parameter values. The figure below shows that,
even when x123 0, pairwise tests underestimate
the significance, when compared to our three-way
test, which considers all three regions jointly.
Which cluster is less likely to occur by chance,
when genes are arranged randomly?
n1n2n3s, n123s 5000, r 100
W1
n1235000, n1n2n30, r 100, x1230
b) Wpre shares only two genes each with Wpost1
and Wpost2, but Wpost1 and Wpost2 share an
additional gene

Wpre shares three genes with Wpost1, and three
other genes with Wpost2

For example, given a significance threshold of
, the pairwise approach requires two of
the three regions to share at least seven genes.
In contrast, using our three-way test a cluster
is significant when each pair of regions shares
only four genes.
W2
Which cluster is less likely to occur by chance,
if 84 of duplicates were lost following WGD?
W3
n1,1 3600, n1,2 450, n0,1 500, r100

Two genes are shared by all three
windows (x123 2, x12x13x230)

b) Two distinct genes are shared by each pair of
windows (x123 0, x12 x13 x23 2)
The figure at right shows that the two scenarios
shown above are actually quite close in
significance, even though the second scenario
shares fewer homologous matches. Current
approaches typically compare the pre-duplication
region independently with each of the
post-duplication regions, and thus ignore the
values of x23 and x123. These methods could fail
to detect clearly significant clusters.
x12 x13 x23
n1235000, n1n2n30, r100
Our results suggest that pairwise tests are not
always sufficient and multi-region tests will be
able to identify more distantly related
homologous regions.
In both cases, each pair of windows shares two
genes. However, the total number of genes shared
in (b) is twice as large as in (a). Nonetheless,
as the figure at right shows, the scenario shown
in cluster (a) is much less likely to occur by
chance under the orthology model. This
illustrates the importance of x123 to cluster
significance.
x12 x13
References 1 D Durand and D Sankoff, J Comput.
Biol.,10, 2003. 2 C Simillion et al, Bioessays
26, 2004. 3 KP Byrne and KH Wolfe, Genome
Res., 10, 2005.

Write a Comment

User Comments (0)