The Statistical Significance of Maxgap Clusters - PowerPoint PPT Presentation

1 / 58

About This Presentation

Title:

The Statistical Significance of Maxgap Clusters

Description:

... if they cluster together spatially more than we would expect ... explicitly consider the cluster properties we would like our definitions to satisfy ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 59

Provided by: dannie3

Category:

more less

Transcript and Presenter's Notes

Title: The Statistical Significance of Maxgap Clusters

1
The Statistical Significance of Max-gap Clusters

Rose Hoberman
David Sankoff
Dannie Durand

2
Glycolysis Pathway
Glycolysis Clusters
Clostridium acetobutylicum
3
Gene Clustering for Functional Inference in
Bacterial Genomes

The Use of Gene Clusters to Infer Functional
Coupling, Overbeek et al., PNAS 96 2896-2901,
1999.

4
original genome
large scale duplication or speciation event
rearrangement, mutation
Gene content and order are preserved
Similarity in gene content Neither content nor
order is strictly preserved
5
Evolution of gene order conservation in
prokaryotes Tamames, Genome Biology 2, 2001
6
Evolution of gene order conservation in
prokaryotes Tamames, Genome Biology 2, 2001
Gene insertion/loss
7
Evolution of gene order conservation in
prokaryotes Tamames, Genome Biology 2, 2001
Gene insertion/loss
Local rearrangement
8
Two Possible Questions

Given a set of genes that we believe are
functionally related, determine if they cluster
together spatially more than we would expect by
chance
Identify all significantly conserved gene
clusters as a starting point for making
functional inferences

9
Two Possible Questions

Given a set of genes that we believe are
functionally related, determine if they cluster
together spatially more than we would expect by
chance
Identify all significantly conserved gene
clusters as a starting point for making
functional inferences

Reference set scenario
Whole genome comparison
10
Reference Set Scenario
11
Reference Set Scenario

Model of a genome
G 1, , n an ordered set of n unique genes
assume genes do not overlap
chromosome breaks ignored

12
Reference Set Scenario

Model of a genome
G 1, , n an ordered set of n unique genes
assume genes do not overlap
chromosome breaks ignored
Reference gene scenario
m genes of interest (in red) are pre-specified
want to find clusters of (a subset of) these
genes

13
Whole Genome Scenario
Given two genomes G 1, , n and H 1, ,
n
Find all significant clusters of at least k
homologs in close proximity in both genomes?
14
Outline

What formalisms do we need to address these
questions?
Definitions formulate a cluster definition
Algorithms identifying clusters in real data
Statistics assess the significance of one or
more clusters
Reference set scenario
Whole genome comparison
Conclusion

15
Why develop a formal statistical model?

Understand trends and verify that they match our
expectations
Choose parameters effectively
Statistical tests for data analysis

Typically researchers use randomization tests to
estimate statistical significance
16
Cluster Definitions

An intuitive notion of a cluster is a group of
genes
occurring in close proximity
neither gene content nor order is strictly
conserved
Algorithms and statistics require a formal
definition.
What properties are desirable?
Do existing definitions have these properties?

17
size 3 genes

Possible Cluster Parameters
size number of red genes in the cluster
Example cluster size 3

18
length 6

Possible Cluster Parameters
size number of red genes in the cluster
Example cluster size 3
length number of genes between first and last
red genes
Example cluster length 6

19
length 6

Possible Cluster Parameters
size number of red genes in the cluster
Example cluster size 3
length number of genes between first and last
red genes
Example cluster length 6

20
density 6/11

Possible Cluster Parameters
size number of red genes in the cluster
Example cluster size 3
length number of genes between first and last
red genes
Example cluster length 6
density proportion of red genes (size/length)
Example density 0.5

21
density 6/11

Possible Cluster Parameters
size number of red genes in the cluster
Example cluster size 3
length number of genes between first and last
red genes
Example cluster length 6
density proportion of red genes (size/length)
Example density 0.5

22
gap 4 genes

Possible Cluster Parameters
size number of red genes in the cluster
Example cluster size 3
length number of genes between first and last
red genes
Example cluster length 6
density proportion of red genes (size/length)
compactness maximum gap between adjacent red
genes

23
Max-Gap Cluster
gap?? g

Commonly used in analysis of genomic data
Desirable properties
Ensures minimum local density
Extensible doesnt artificially limit cluster
length
Disjoint clusters will not overlap

24
Outline

Formalisms
Reference set scenario
Whole genome comparison
Conclusion

25
Formalisms

Definitions formulate a cluster definition
Algorithms identify clusters in real data
Statistics assess the significance of a cluster

26
A Statistical Model

Given
a genome G 1, , n unique genes
a set of m reference genes
a maximum-gap size g
Null hypothesis
Random gene order
Alternate hypotheses
Evolutionary history
Functional selection

27
Statistics of Max-Gap Gene Clusters

We provide
analytical and dynamic programming solutions
to determine cluster significance exactly
for the reference set scenario
Hoberman, Sankoff and Durand. In Proceedings of
the RECOMB Satellite Workshop on Comparative
Genomics'', J. Lagergren, ed., Lecture Notes in
Bioinformatics, Springer Verlag, in press.
Hoberman, Sankoff, Durand. Submitted to RECOMB
2005.

28
Test Statistic Complete Clusters

The probability of observing all m reference
genes in a max-gap cluster in G

29
Test Statistic Incomplete Clusters

The probability of observing at least h of the m
reference genes in a max-gap cluster in G

30
Cluster significance
n 1000, m50
n 500, h m/2

n number genes in each genome
m number of genes shared between the two
genomes
g maximum allowed gap size
h size of cluster (e.g. number of red genes)

31
Significant Parameter Values (a 0.0001)
n 500
32
Significant Parameter Values (a 0.0001)
n 500
33
Outline

Formalisms
Reference set scenario
Whole genome comparison
Conclusion

34
Formalisms

Definitions formulate a cluster definition
Algorithms identify clusters in real data
Statistics assess the significance of one or
more clusters

35
Whole genome comparison
g?? 10
g?? 10

Find all sets of genes that form max-gap clusters
in both genomes.

36
Properties of Max-Gap Clusters for Whole Genome
Comparison

Clusters are locally dense in both genomes
Clusters are still guaranteed to be disjoint.
The definition is symmetric with respect to genome

Most existing cluster algorithms are not
symmetric!
37
Algorithms Finding Max-Gap Clusters

If g 2
There is no valid max-gap cluster of size two or
three
There is a valid max-gap cluster of size four

38
Algorithms Finding Max-Gap Clusters

A consequence of this is that a greedy iterative
approach will not find all max-gap clusters
Specifically, larger clusters that dont contain
smaller ones will not be found

39
Algorithms Finding Max-Gap Clusters

There is an efficient divide-and-conquer
algorithm to find all max-gap clusters (Bergeron
et al, 2002)
Since algorithms are generally not stated
formally in application papers, we dont know
whether people are actually getting what they
think theyre getting

40
Formalisms

Definitions formulate a cluster definition
Algorithms identify clusters in real data
Statistics assess the significance of one or
more clusters

Work in Progress
41
Statistics Whole genome comparison
g?? 10
g?? 10

What is the probability that at least k genes
form a max-gap cluster in both genomes?

42
Statistics Whole genome comparison
g?? 10
g?? 10

What is the probability that at least k genes
form a max-gap cluster in both genomes?
Assuming identical gene content, the probability
of finding a max-gap cluster of size at least k
is always one!

43
An Example

Example g 1
44
An Example
Example g 1
45
An Example
Example g 1
A cluster of size k does not necessarily contain
a cluster of size k-1
46
An Example
Example g 1
47
An Example
Example g 1

When gene content is identical, there will always
be a cluster of size n

48
An Example
Example g 1

When gene content is identical, there will always
be a cluster of size n
Therefore, for all k, there will always be a
cluster of size at least k

49
An Example
Example g 1

When gene content is identical, there will always
be a cluster of size n
Therefore, for all k, there will always be a
cluster of size at least k
Therefore, the probability of finding a cluster
of size at least k is always one!

50
Relaxing the Assumption of Identical Gene Content

Assume only m of the n genes in each genome are
shared
If the longest run of non-shared genes is less
than g then we are still guaranteed to find a
complete cluster

More generally
Simulations of randomly ordered genomes show that
large clusters may be very likely to occur merely
by chance

52
Unexpected Statistical Trends

There can be a significant probability of finding
a cluster that includes all homologous gene pairs
The significance of a cluster of size k can be
less than that of a cluster of size k-1
Probabilities are not monotonic
Large clusters may not be significant

n 1000, m 250, g20
Probability of a cluster of size 250 50
53
Outline

Formalisms
Reference set scenario
Whole genome comparison
Conclusion

54
Clusters Are Used in Many Other Applications
55
Max-Gap Clusters are Especially Common
56

Formal statistical models allow us to
understand trends and verify that they match our
expectations,
choose parameters effectively
conduct statistical tests for data analysis
Formal statistical models require
a formal cluster definition
a search procedure to find clusters
These issues are more complicated than they might
seem!

57
Summary

Results statistical tests of significance for
max-gap clusters
Reference set scenario
Genome comparison (work in progress)
We need to
explicitly consider the cluster properties we
would like our definitions to satisfy
rigorously evaluate whether our definition meets
these requirements
carefully prove that our search procedures match
our stated definitions