Title: The Statistical Significance of Maxgap Clusters
1The Statistical Significance of Max-gap Clusters
- Rose Hoberman
- David Sankoff
- Dannie Durand
2Glycolysis Pathway
Glycolysis Clusters
Clostridium acetobutylicum
3Gene Clustering for Functional Inference in
Bacterial Genomes
- The Use of Gene Clusters to Infer Functional
Coupling, Overbeek et al., PNAS 96 2896-2901,
1999.
4original genome
large scale duplication or speciation event
rearrangement, mutation
Gene content and order are preserved
Similarity in gene content Neither content nor
order is strictly preserved
5Evolution of gene order conservation in
prokaryotes Tamames, Genome Biology 2, 2001
6Evolution of gene order conservation in
prokaryotes Tamames, Genome Biology 2, 2001
Gene insertion/loss
7Evolution of gene order conservation in
prokaryotes Tamames, Genome Biology 2, 2001
Gene insertion/loss
Local rearrangement
8Two Possible Questions
- Given a set of genes that we believe are
functionally related, determine if they cluster
together spatially more than we would expect by
chance - Identify all significantly conserved gene
clusters as a starting point for making
functional inferences
9Two Possible Questions
- Given a set of genes that we believe are
functionally related, determine if they cluster
together spatially more than we would expect by
chance - Identify all significantly conserved gene
clusters as a starting point for making
functional inferences
Reference set scenario
Whole genome comparison
10Reference Set Scenario
11Reference Set Scenario
- Model of a genome
- G 1, , n an ordered set of n unique genes
- assume genes do not overlap
- chromosome breaks ignored
12Reference Set Scenario
- Model of a genome
- G 1, , n an ordered set of n unique genes
- assume genes do not overlap
- chromosome breaks ignored
- Reference gene scenario
- m genes of interest (in red) are pre-specified
- want to find clusters of (a subset of) these
genes
13Whole Genome Scenario
Given two genomes G 1, , n and H 1, ,
n
Find all significant clusters of at least k
homologs in close proximity in both genomes?
14Outline
- What formalisms do we need to address these
questions? - Definitions formulate a cluster definition
- Algorithms identifying clusters in real data
- Statistics assess the significance of one or
more clusters - Reference set scenario
- Whole genome comparison
- Conclusion
15Why develop a formal statistical model?
- Understand trends and verify that they match our
expectations - Choose parameters effectively
- Statistical tests for data analysis
Typically researchers use randomization tests to
estimate statistical significance
16Cluster Definitions
- An intuitive notion of a cluster is a group of
genes - occurring in close proximity
- neither gene content nor order is strictly
conserved - Algorithms and statistics require a formal
definition. - What properties are desirable?
- Do existing definitions have these properties?
17size 3 genes
- Possible Cluster Parameters
- size number of red genes in the cluster
- Example cluster size 3
18length 6
- Possible Cluster Parameters
- size number of red genes in the cluster
- Example cluster size 3
- length number of genes between first and last
red genes - Example cluster length 6
19length 6
- Possible Cluster Parameters
- size number of red genes in the cluster
- Example cluster size 3
- length number of genes between first and last
red genes - Example cluster length 6
20density 6/11
- Possible Cluster Parameters
- size number of red genes in the cluster
- Example cluster size 3
- length number of genes between first and last
red genes - Example cluster length 6
- density proportion of red genes (size/length)
- Example density 0.5
21density 6/11
- Possible Cluster Parameters
- size number of red genes in the cluster
- Example cluster size 3
- length number of genes between first and last
red genes - Example cluster length 6
- density proportion of red genes (size/length)
- Example density 0.5
22gap 4 genes
- Possible Cluster Parameters
- size number of red genes in the cluster
- Example cluster size 3
- length number of genes between first and last
red genes - Example cluster length 6
- density proportion of red genes (size/length)
- compactness maximum gap between adjacent red
genes
23Max-Gap Cluster
gap?? g
- Commonly used in analysis of genomic data
- Desirable properties
- Ensures minimum local density
- Extensible doesnt artificially limit cluster
length - Disjoint clusters will not overlap
24Outline
- Formalisms
- Reference set scenario
- Whole genome comparison
- Conclusion
25Formalisms
- Definitions formulate a cluster definition
- Algorithms identify clusters in real data
- Statistics assess the significance of a cluster
26A Statistical Model
- Given
- a genome G 1, , n unique genes
- a set of m reference genes
- a maximum-gap size g
- Null hypothesis
- Random gene order
- Alternate hypotheses
- Evolutionary history
- Functional selection
27Statistics of Max-Gap Gene Clusters
- We provide
- analytical and dynamic programming solutions
- to determine cluster significance exactly
- for the reference set scenario
- Hoberman, Sankoff and Durand. In Proceedings of
the RECOMB Satellite Workshop on Comparative
Genomics'', J. Lagergren, ed., Lecture Notes in
Bioinformatics, Springer Verlag, in press. - Hoberman, Sankoff, Durand. Submitted to RECOMB
2005.
28Test Statistic Complete Clusters
- The probability of observing all m reference
genes in a max-gap cluster in G
29Test Statistic Incomplete Clusters
- The probability of observing at least h of the m
reference genes in a max-gap cluster in G
30 Cluster significance
n 1000, m50
n 500, h m/2
- n number genes in each genome
- m number of genes shared between the two
genomes - g maximum allowed gap size
- h size of cluster (e.g. number of red genes)
31Significant Parameter Values (a 0.0001)
n 500
32Significant Parameter Values (a 0.0001)
n 500
33Outline
- Formalisms
- Reference set scenario
- Whole genome comparison
- Conclusion
34Formalisms
- Definitions formulate a cluster definition
- Algorithms identify clusters in real data
- Statistics assess the significance of one or
more clusters
35Whole genome comparison
g?? 10
g?? 10
- Find all sets of genes that form max-gap clusters
in both genomes.
36Properties of Max-Gap Clusters for Whole Genome
Comparison
- Clusters are locally dense in both genomes
- Clusters are still guaranteed to be disjoint.
- The definition is symmetric with respect to genome
Most existing cluster algorithms are not
symmetric!
37Algorithms Finding Max-Gap Clusters
- If g 2
- There is no valid max-gap cluster of size two or
three - There is a valid max-gap cluster of size four
38Algorithms Finding Max-Gap Clusters
- A consequence of this is that a greedy iterative
approach will not find all max-gap clusters - Specifically, larger clusters that dont contain
smaller ones will not be found
39Algorithms Finding Max-Gap Clusters
- There is an efficient divide-and-conquer
algorithm to find all max-gap clusters (Bergeron
et al, 2002) - Since algorithms are generally not stated
formally in application papers, we dont know
whether people are actually getting what they
think theyre getting
40Formalisms
- Definitions formulate a cluster definition
- Algorithms identify clusters in real data
- Statistics assess the significance of one or
more clusters
Work in Progress
41Statistics Whole genome comparison
g?? 10
g?? 10
- What is the probability that at least k genes
form a max-gap cluster in both genomes?
42Statistics Whole genome comparison
g?? 10
g?? 10
- What is the probability that at least k genes
form a max-gap cluster in both genomes? - Assuming identical gene content, the probability
of finding a max-gap cluster of size at least k
is always one!
43An Example
Example g 1
44An Example
Example g 1
45An Example
Example g 1
A cluster of size k does not necessarily contain
a cluster of size k-1
46An Example
Example g 1
47An Example
Example g 1
- When gene content is identical, there will always
be a cluster of size n
48An Example
Example g 1
- When gene content is identical, there will always
be a cluster of size n - Therefore, for all k, there will always be a
cluster of size at least k
49An Example
Example g 1
- When gene content is identical, there will always
be a cluster of size n - Therefore, for all k, there will always be a
cluster of size at least k - Therefore, the probability of finding a cluster
of size at least k is always one!
50Relaxing the Assumption of Identical Gene Content
- Assume only m of the n genes in each genome are
shared - If the longest run of non-shared genes is less
than g then we are still guaranteed to find a
complete cluster
51- More generally
- Simulations of randomly ordered genomes show that
large clusters may be very likely to occur merely
by chance
52Unexpected Statistical Trends
- There can be a significant probability of finding
a cluster that includes all homologous gene pairs - The significance of a cluster of size k can be
less than that of a cluster of size k-1 - Probabilities are not monotonic
- Large clusters may not be significant
n 1000, m 250, g20
Probability of a cluster of size 250 50
53Outline
- Formalisms
- Reference set scenario
- Whole genome comparison
- Conclusion
54Clusters Are Used in Many Other Applications
55Max-Gap Clusters are Especially Common
56- Formal statistical models allow us to
- understand trends and verify that they match our
expectations, - choose parameters effectively
- conduct statistical tests for data analysis
- Formal statistical models require
- a formal cluster definition
- a search procedure to find clusters
- These issues are more complicated than they might
seem!
57Summary
- Results statistical tests of significance for
max-gap clusters - Reference set scenario
- Genome comparison (work in progress)
- We need to
- explicitly consider the cluster properties we
would like our definitions to satisfy - rigorously evaluate whether our definition meets
these requirements - carefully prove that our search procedures match
our stated definitions
58