The Statistical Significance of Maxgap Clusters - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

The Statistical Significance of Maxgap Clusters

Description:

... if they cluster together spatially more than we would expect ... explicitly consider the cluster properties we would like our definitions to satisfy ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 59
Provided by: dannie3
Category:

less

Transcript and Presenter's Notes

Title: The Statistical Significance of Maxgap Clusters


1
The Statistical Significance of Max-gap Clusters
  • Rose Hoberman
  • David Sankoff
  • Dannie Durand

2
Glycolysis Pathway
Glycolysis Clusters
Clostridium acetobutylicum
3
Gene Clustering for Functional Inference in
Bacterial Genomes
  • The Use of Gene Clusters to Infer Functional
    Coupling, Overbeek et al., PNAS 96 2896-2901,
    1999.

4
original genome
large scale duplication or speciation event
rearrangement, mutation
Gene content and order are preserved
Similarity in gene content Neither content nor
order is strictly preserved
5
Evolution of gene order conservation in
prokaryotes Tamames, Genome Biology 2, 2001
6
Evolution of gene order conservation in
prokaryotes Tamames, Genome Biology 2, 2001
Gene insertion/loss
7
Evolution of gene order conservation in
prokaryotes Tamames, Genome Biology 2, 2001
Gene insertion/loss
Local rearrangement
8
Two Possible Questions
  • Given a set of genes that we believe are
    functionally related, determine if they cluster
    together spatially more than we would expect by
    chance
  • Identify all significantly conserved gene
    clusters as a starting point for making
    functional inferences

9
Two Possible Questions
  • Given a set of genes that we believe are
    functionally related, determine if they cluster
    together spatially more than we would expect by
    chance
  • Identify all significantly conserved gene
    clusters as a starting point for making
    functional inferences

Reference set scenario
Whole genome comparison
10
Reference Set Scenario
11
Reference Set Scenario
  • Model of a genome
  • G 1, , n an ordered set of n unique genes
  • assume genes do not overlap
  • chromosome breaks ignored

12
Reference Set Scenario
  • Model of a genome
  • G 1, , n an ordered set of n unique genes
  • assume genes do not overlap
  • chromosome breaks ignored
  • Reference gene scenario
  • m genes of interest (in red) are pre-specified
  • want to find clusters of (a subset of) these
    genes

13
Whole Genome Scenario
Given two genomes G 1, , n and H 1, ,
n
Find all significant clusters of at least k
homologs in close proximity in both genomes?
14
Outline
  • What formalisms do we need to address these
    questions?
  • Definitions formulate a cluster definition
  • Algorithms identifying clusters in real data
  • Statistics assess the significance of one or
    more clusters
  • Reference set scenario
  • Whole genome comparison
  • Conclusion

15
Why develop a formal statistical model?
  • Understand trends and verify that they match our
    expectations
  • Choose parameters effectively
  • Statistical tests for data analysis

Typically researchers use randomization tests to
estimate statistical significance
16
Cluster Definitions
  • An intuitive notion of a cluster is a group of
    genes
  • occurring in close proximity
  • neither gene content nor order is strictly
    conserved
  • Algorithms and statistics require a formal
    definition.
  • What properties are desirable?
  • Do existing definitions have these properties?

17
size 3 genes
  • Possible Cluster Parameters
  • size number of red genes in the cluster
  • Example cluster size 3

18
length 6
  • Possible Cluster Parameters
  • size number of red genes in the cluster
  • Example cluster size 3
  • length number of genes between first and last
    red genes
  • Example cluster length 6

19
length 6
  • Possible Cluster Parameters
  • size number of red genes in the cluster
  • Example cluster size 3
  • length number of genes between first and last
    red genes
  • Example cluster length 6

20
density 6/11
  • Possible Cluster Parameters
  • size number of red genes in the cluster
  • Example cluster size 3
  • length number of genes between first and last
    red genes
  • Example cluster length 6
  • density proportion of red genes (size/length)
  • Example density 0.5

21
density 6/11
  • Possible Cluster Parameters
  • size number of red genes in the cluster
  • Example cluster size 3
  • length number of genes between first and last
    red genes
  • Example cluster length 6
  • density proportion of red genes (size/length)
  • Example density 0.5

22
gap 4 genes
  • Possible Cluster Parameters
  • size number of red genes in the cluster
  • Example cluster size 3
  • length number of genes between first and last
    red genes
  • Example cluster length 6
  • density proportion of red genes (size/length)
  • compactness maximum gap between adjacent red
    genes

23
Max-Gap Cluster
gap?? g
  • Commonly used in analysis of genomic data
  • Desirable properties
  • Ensures minimum local density
  • Extensible doesnt artificially limit cluster
    length
  • Disjoint clusters will not overlap

24
Outline
  • Formalisms
  • Reference set scenario
  • Whole genome comparison
  • Conclusion

25
Formalisms
  • Definitions formulate a cluster definition
  • Algorithms identify clusters in real data
  • Statistics assess the significance of a cluster

26
A Statistical Model
  • Given
  • a genome G 1, , n unique genes
  • a set of m reference genes
  • a maximum-gap size g
  • Null hypothesis
  • Random gene order
  • Alternate hypotheses
  • Evolutionary history
  • Functional selection

27
Statistics of Max-Gap Gene Clusters
  • We provide
  • analytical and dynamic programming solutions
  • to determine cluster significance exactly
  • for the reference set scenario
  • Hoberman, Sankoff and Durand. In Proceedings of
    the RECOMB Satellite Workshop on Comparative
    Genomics'', J. Lagergren, ed., Lecture Notes in
    Bioinformatics, Springer Verlag, in press.
  • Hoberman, Sankoff, Durand. Submitted to RECOMB
    2005.

28
Test Statistic Complete Clusters
  • The probability of observing all m reference
    genes in a max-gap cluster in G

29
Test Statistic Incomplete Clusters
  • The probability of observing at least h of the m
    reference genes in a max-gap cluster in G

30
Cluster significance
n 1000, m50
n 500, h m/2
  • n number genes in each genome
  • m number of genes shared between the two
    genomes
  • g maximum allowed gap size
  • h size of cluster (e.g. number of red genes)

31
Significant Parameter Values (a 0.0001)
n 500
32
Significant Parameter Values (a 0.0001)
n 500
33
Outline
  • Formalisms
  • Reference set scenario
  • Whole genome comparison
  • Conclusion

34
Formalisms
  • Definitions formulate a cluster definition
  • Algorithms identify clusters in real data
  • Statistics assess the significance of one or
    more clusters

35
Whole genome comparison
g?? 10
g?? 10
  • Find all sets of genes that form max-gap clusters
    in both genomes.

36
Properties of Max-Gap Clusters for Whole Genome
Comparison
  • Clusters are locally dense in both genomes
  • Clusters are still guaranteed to be disjoint.
  • The definition is symmetric with respect to genome

Most existing cluster algorithms are not
symmetric!
37
Algorithms Finding Max-Gap Clusters
  • If g 2
  • There is no valid max-gap cluster of size two or
    three
  • There is a valid max-gap cluster of size four

38
Algorithms Finding Max-Gap Clusters
  • A consequence of this is that a greedy iterative
    approach will not find all max-gap clusters
  • Specifically, larger clusters that dont contain
    smaller ones will not be found

39
Algorithms Finding Max-Gap Clusters
  • There is an efficient divide-and-conquer
    algorithm to find all max-gap clusters (Bergeron
    et al, 2002)
  • Since algorithms are generally not stated
    formally in application papers, we dont know
    whether people are actually getting what they
    think theyre getting

40
Formalisms
  • Definitions formulate a cluster definition
  • Algorithms identify clusters in real data
  • Statistics assess the significance of one or
    more clusters

Work in Progress
41
Statistics Whole genome comparison
g?? 10
g?? 10
  • What is the probability that at least k genes
    form a max-gap cluster in both genomes?

42
Statistics Whole genome comparison
g?? 10
g?? 10
  • What is the probability that at least k genes
    form a max-gap cluster in both genomes?
  • Assuming identical gene content, the probability
    of finding a max-gap cluster of size at least k
    is always one!

43
An Example

Example g 1
44
An Example
Example g 1
45
An Example
Example g 1
A cluster of size k does not necessarily contain
a cluster of size k-1
46
An Example
Example g 1
47
An Example
Example g 1
  • When gene content is identical, there will always
    be a cluster of size n

48
An Example
Example g 1
  • When gene content is identical, there will always
    be a cluster of size n
  • Therefore, for all k, there will always be a
    cluster of size at least k

49
An Example
Example g 1
  • When gene content is identical, there will always
    be a cluster of size n
  • Therefore, for all k, there will always be a
    cluster of size at least k
  • Therefore, the probability of finding a cluster
    of size at least k is always one!

50
Relaxing the Assumption of Identical Gene Content
  • Assume only m of the n genes in each genome are
    shared
  • If the longest run of non-shared genes is less
    than g then we are still guaranteed to find a
    complete cluster

51
  • More generally
  • Simulations of randomly ordered genomes show that
    large clusters may be very likely to occur merely
    by chance

52
Unexpected Statistical Trends
  • There can be a significant probability of finding
    a cluster that includes all homologous gene pairs
  • The significance of a cluster of size k can be
    less than that of a cluster of size k-1
  • Probabilities are not monotonic
  • Large clusters may not be significant

n 1000, m 250, g20
Probability of a cluster of size 250 50
53
Outline
  • Formalisms
  • Reference set scenario
  • Whole genome comparison
  • Conclusion

54
Clusters Are Used in Many Other Applications
55
Max-Gap Clusters are Especially Common
56
  • Formal statistical models allow us to
  • understand trends and verify that they match our
    expectations,
  • choose parameters effectively
  • conduct statistical tests for data analysis
  • Formal statistical models require
  • a formal cluster definition
  • a search procedure to find clusters
  • These issues are more complicated than they might
    seem!

57
Summary
  • Results statistical tests of significance for
    max-gap clusters
  • Reference set scenario
  • Genome comparison (work in progress)
  • We need to
  • explicitly consider the cluster properties we
    would like our definitions to satisfy
  • rigorously evaluate whether our definition meets
    these requirements
  • carefully prove that our search procedures match
    our stated definitions

58
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com