Title: Identifying conserved spatial patterns in genomes
1Identifying conserved spatial patterns in genomes
David Sankoff Dept. of Math and Statistics
University of Ottawa
Dannie Durand Depts. of Biological Sciences
and Computer Science, CMU
Student Seminar Series Jan 20, 2006
2The complete genetic material of an organism or
species
The Genome
3Key genomic component genes
A gene is a DNA subsequence
ACCCTTAGCTAGACCTTTAGGAGG...
- Genes encode proteins,
- the building blocks of the cell
4Comparing Genomes
75 Million years
Human Mouse Fly Rice E. Coli Chlamydia
Chromosomes 23 20 4 12 1 1
Genes 20-25k 20-25k 13.6k 40k 3200 936
5Accidental duplication of chromosome 21 causes
Down Syndrome
Human Chromosome 21 is broken into at least three
pieces in mouse
6Outline
- Evolution of genome organization
- Why identify related genomic regions?
- How do we find them?
- Identification Formal cluster definition
- Validation Testing cluster significance
7A simple model of a chromosome
8What are the processes of genomic change?
9A single species
10Speciation
- Initially the two populations have identical
genomes
- The populations evolve independently
?
3. Eventually, there will be two new species with
similar but distinct genomes
11Types of Genomic Rearrangements
Inversions
6
3
4
5
3
7
1
2
20
Species 2
Duplications/Insertions
Loss
12Types of Genomic Rearrangements
Chromosomal fissions and fusions
8
9
7
11
12
10
6
20
17
16
4
5
3
1
2
4
3
1
2
13
14
15
Species 2
13Genome Comparison
Species 1
8
12
9
11
10
4
5
3
7
1
2
13
14
15
3
17
16
8
9
7
11
12
10
4
5
3
1
6
2
4
3
1
2
20
13
14
15
4
3
1
2
Species 2
Our goal identify chromosomal regions that
descended from the same region in the genome of
the common ancestor
14Outline
- Evolution of genome organization
- Why identify related genomic regions?
- How do we find them?
- Identification Formal cluster definition
- Validation Testing cluster significance
15Genome Annotation Problem
Given the set of genes in the genome, label each
with its function
Gene
ACCCTTAGCTAGACCTTTAGGAGGTGCAGGA
Cellular Pathway Glucose Metabolism
Protein
16There are many aspects of gene function
- Gene trpA
- Biochemical Function cleaves a double bond
- Cellular Process amino-acid biosynthesis
- Protein-protein interactions binds trpB
17There are many aspects of gene function
- Gene a typical gene
- Biochemical Function ?
- Biological Process ?
- Protein-protein interactions ?
40-60 of genes in most genomes have unknown
function
Comparisons of spatial organization within
genomes can yield gene function predictions
18In bacteria, genes in the same pathway often
occur together in the genome
Tryptophan Synthesis Pathway
1-2 Carboxy-phenylaminodeoxy-ribulose-5P
N-5-P-ribosyl-anthranilate
3-Indole Glycerol-P
Chorismate
Tryptophan
Anthranilate
trpCF
trpD
trpB
trpA
trpE
E. coli
Bacillus Subtilis
trpD
trpC
trpB
trpA
trpE
trpF
19Conserved spatial organization between distantly
related species suggests functional associations
between the genes
C
B
D
A
G
E
F
A Glucose metabolism B Glucose metabolism C
? D Tryptophan synthesis E ? F ? G Tryptophan
synthesis
A
B
C
D
E
F
G
20Conserved spatial organization between distantly
related species suggests functional associations
between the genes
C
B
D
A
G
E
F
G
D
F
E
C
B
A
A Glucose metabolism B Glucose metabolism C
Prediction Glucose metabolism D Tryptophan
synthesis E ? F Prediction Tryptophan
synthesis G Tryptophan synthesis
A
B
C
D
E
F
G
21Outline
- Evolution of genome organization
- Why identify related genomic regions?
- How do we find them?
- Identification Formal cluster definition
- Validation Testing cluster significance
22Closely related genomes
Species 1
8
11
12
9
10
5
7
2
13
4
3
1
14
15
3
20
17
16
8
9
7
11
12
10
4
5
3
1
6
2
4
3
1
2
13
14
15
4
3
1
2
Species 2
Related regions, regions that descended from the
same region in the genome of the common ancestor,
are easy to identify
23A hundred million years...
24More Diverged Genomes
5
8
9
11
12
18
20
11
4
3
7
2
13
10
14
15
17
16
19
3
18
1
8
9
11
12
10
4
5
6
2
3
2
20
13
14
15
17
16
4
7
1
1
1
- Related regions are harder to detect, but there
is still spatial evidence of common ancestry - Similar gene content
- Neither gene content nor order is perfectly
preserved
25The signature of diverged regions
5
8
11
12
18
9
11
3
7
2
13
17
16
19
20
18
4
1
10
14
15
3
8
12
6
20
17
9
11
10
4
5
1
2
3
1
2
13
14
15
16
4
1
7
- Gene clusters
- Similar gene content
- Neither gene content nor order is perfectly
26A Framework for Identifying Gene Clusters
given as input
- Find corresponding genes
- Formally define a gene cluster
- Devise an algorithm to identify clusters
- Statistically verify clusters
review the most common definition
my work
27Clusters are signatures of distantly related
regions.
- Without functional constraints...
- After sufficient time has passed, gene order will
become randomized - Uniform random data tends to be clumpy
- some genes will end up proximal in both genomes
simply by chance
Not all clusters have biological significance.
28Cluster Validation via Hypothesis Testing
- Null hypothesis random gene order
- Reject gene clusters that could have arisen under
the null model - Clusters that cannot be rejected are likely to be
functionally constrained
29Outline
- Evolution of genome organization
- Why find related genomic regions?
- How do we find them?
- Identification max-gap cluster definition
- Validation Testing cluster significance
30A max-gap chain
g? 2
gap? 3
- The distance or gap between genes is equal to
the number of intervening genes - A set of genes in a genome form a max-gap chain
if - the gap between adjacent genes is never greater
than g (a user-specified parameter)
31Max-Gap cluster definition
g? 2
gap? 3
- A set of genes form a max-gap cluster of two
genomes if - the genes forms a max-gap chain in each genome
- the cluster is maximal (i.e. not contained within
a larger cluster)
32Max-Gap cluster definition
gap? 3
g? 2
g? 3
- A set of genes form a max-gap cluster of two
genomes if - the genes forms a max-gap chain in each genome
- the cluster is maximal (i.e. not contained within
a larger cluster)
33The max-gap definition is the most widely used
cluster definition in genomic analyses
- Allows extensive rearrangement of gene order
- Allows limited gene insertion and losses
There is no formal statistical model for max-gap
clusters
34Outline
- Evolution of genome organization
- Why find related genomic regions?
- How do we find them?
- Identification max-gap cluster definition
- Validation Testing cluster significance
35The Questions
Suppose two whole genomes were compared, and this
max-gap cluster was identified
- Is this cluster biologically meaningful?
- Could it have occurred in a comparison of random
genomes?
36The Inputs
h4
n number of genes in each genome m number of
matching genes pairs g the maximum gap allowed
in a cluster h number of matching genes in the
cluster
37The Problem
h4
- What is the probability of observing a max-gap
cluster - containing exactly h matching gene pairs
- assuming the genomes are randomly ordered
38Probability of a cluster of size h
m genes
m-h genes
h genes
Basic approach Enumerate all ways to
- Place m-h remaining genes so they do not extend
the cluster
- Create chains of h genes in both genomes
- Normalize to get a probability
39Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
40Total number of configurations of m gene pairs
in two genomes of size n
m genes
41Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
42Number of ways to place h genes in two genomes so
they form a cluster
h genes
m genes
m-h genes
Select h spots in each genome, so they form a
max-gap chain
Choose h genes to compose the cluster
Assign each gene to a selected spot in each genome
43The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 places left
1 2 3 4 5 .
n-L1 .
n
The maximum length of the chain is L (h-1)g h
44The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Choices for the size of each gap (from 0 to g)
There are h-1 gaps in a chain of h genes
45The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Chains near the end of the genome
Choices for the size of each gap (from 0 to g)
There are h-1 gaps in a chain of h genes
1 2 3 4 5 .
n-L1 .
n
46Number of ways to position h genes in a genome
of n genes so they form a max-gap chain
Starting positions near end
Starting positions
Ways to place remaining h-1 genes
47Probability of a cluster of size h
m-h genes
h genes
Basic approach Enumerate all ways to
- Place m-h remaining genes so they do not extend
the cluster
- Create chains of h genes in both genomes
48Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
49Counting the number of ways to place m-h genes
outside the cluster
g 1
h 3
- Approach
- design a rule specifying where the genes can be
placed so that the cluster is not extended - count the positions
50Counting the number of ways to place m-h genes
outside the cluster
gaps 1
g 1
- Rule 1 A gene can go anywhere except in the
cluster (the white box). -
Too lenient
51Counting the number of ways to place m-h genes
outside the cluster
g 1
g 1
g 1
- Rule 2 Every gene must be at least g1 positions
from the cluster (outside the grey box). -
Too strict
52Counting the number of ways to place m-h genes
outside the cluster
gap gt 1
g 1
h 3
gap gt 1
- Rule 2 Every gene must be at least g1 positions
from the cluster (outside the grey box). -
Too strict
53Counting the number of ways to place m-h genes
outside the cluster
g 1
gap gt 1
- Rule 3 At most one member of each gene pair can
be in the grey box. -
Too lenient
54Counting the number of ways to place m-h genes
outside the cluster
gaps 1
g 1
- Rule 3 At most one member of each gene pair can
be in the grey box. -
Too lenient
55Counting the number of ways to place m-h genes
outside the cluster
g 1
- Acceptable positions for a gene depend on the
positions of the remaining genes - Use strict and lenient rules to calculate upper
and lower bounds on G
56Estimating G
- Upper bound
- Erroneously enumerates this configuration
- Lower bound
- Fails to enumerate this configuration
57Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
Hoberman, Sankoff, Durand Journal of
Computational Biology, 2005
58What can we learn from this statistical result?
- Are we less likely to observe a large cluster
(containing more gene pairs) than a small
cluster? - How large does a cluster have to be before we are
surprised to observe it? - How do we choose the maximum allowed gap value?
Larger values will - yield more clusters
- more of these will be false positives
59Whole-genome comparison cluster statistics
n1000, m250
g20
With a significance threshold of 10-4, any
cluster containing 8 or more genes is significant.
h (cluster size)
60Conclusion
- Statistical analysis of max-gap gene clusters
- Provides a principled approach for choosing a gap
size that will yield significant clusters - Allows statistically significant max-gap clusters
to be identified - Provides insight on criteria for cluster
definitions
61Odd properties of max-gap clusters
- A larger cluster may be less significant
- Moving a gene further away may make a cluster
more likely
62Acknowledgements
- Barbara Lazarus Women_at_IT Fellowship
- The Sloan Foundation
- The Durand Lab
63Thanks
64Questions?
65(No Transcript)
66(No Transcript)
67(No Transcript)
68Cluster Significance Related Work
- Randomization tests
- Requires complete genome (confusing!)
- Not useful for choosing parameter values
- Very simple models
- Excessively strict simplifying assumptions
- Overly conservative cluster definitions
- A few more general statistical approaches
- Not applicable to max-gap clusters
69Groups find very different clusters when
analyzing the same data
70Generative Models of Genome Rearrangement
- Construct a probabilistic model specifying rates
for each type of genomic rearrangement - Reject regions that are unlikely to have evolved
via the model - Challenges
- Relative rates of rearrangement processes are not
known - requires identification of clusters
- Rates may differ significantly
- within regions of the genome
- between species
- over time (e.g. depending on population sizes)
71Advantages of an analytical approach
- Analyzing incomplete datasets
- Principled parameter selection
- Efficiency?
- Accuracy?
- Understanding statistical trends
- Insight into tradeoffs between definitions
72- plot graph with fixed cluster size and varying
maximum gap sizes - is it monotonic?
- is a function of density and size monotonic?
73- not capturing
- difference in density between max-gap clusters
- partially conserved order
74Identifying gene clusters
- Formally define a gene cluster
- Devise an algorithm to identify clusters
- Verify that clusters indicate common ancestry
...modeling
...algorithms
...statistics
75Identifying gene clusters
- Formally define a gene cluster
- Devise an algorithm to identify clusters
- Verify that clusters indicate common ancestry
...modeling
...algorithms
...statistics
76- These are criteria. Size and density
- Hard to capture
- One Ive chosen is widely use, but see at end of
talk has some problems
77Genome
The complete set of genetic material of an
organism or species
Chromosome
A double-stranded molecule of DNA
Gene
CCCCGCCCCCCGCCCCCCCCCTCGTCTTCAGACCCTTAGCTAGACCTTTA
GGAGGATTAAAAATGAGGGAGAGGGGC
GGGGCGGGGGGCGGGGGGGGGAGCAGAAGTCTGGGAATCGATCTGGAAAT
CCTCCTAATTTTTACTCCCTCTCCCCG
A protein coding sequence
78Genome
The complete set of genetic material of an
organism or species
TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAG
AGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC
TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAG
AGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC
AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTC
TCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG
AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTC
TCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG
Genes protein coding sequences
Large stretches of DNA with unknown function.
as an ordered list of genes
Regions where proteins bind to turn genes on and
off
79Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Example h 4 and g 1
1 2 3 4 5 6
. n-L1 .
n
The maximum length of a chain L (h-1)g h
80Ways to place the remaining h-1 genes when the
gaps and length are constrained
1 2 3 4 5 .
n-L1 . n
l lt L
- Gaps are constrained
- And sum of gaps is constrained
A known solution
81g2
g3
gm-1
g1
l
A known solution
82Counting chains at the end of the genome
- Gaps are constrained
- And sum of gaps is constrained
l w-1
l h
83Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Chains near the end of the genome
Ways to place the remaining h-1 genes, so no gap
exceeds g
1 2 3 4 5 .
n-L1 . n
1
.
L-h
84Number of ways to position h genes in a genome
of n genes so they form a max-gap chain
Starting positions near end
Starting positions
Ways to place remaining h-1 genes
85Whole-genome comparison cluster statistics
n1000, m250
g10
g20
Cluster Probability
h (cluster size)
86(No Transcript)
87Constructive Approach
Number of configurations that contain a cluster
of exactly size h
number of ways to place m-h remaining genes so
they do not extend the cluster
number of ways to position h genes so they form a
chain in both genomes
number of ways to position h genes so they form a
chain in a single genome
88Constructive Approach
Number of configurations that contain a cluster
of exactly size h
number of ways to place m-h remaining genes so
they do not extend the cluster
number of ways to position h genes so they form a
cluster in both genomes
89Building Phylogenetic Trees
Genes may be laterally transferred between
distantly related species
AAACATTTT E. coli
GTCGGTTGG E. coli
AAACATTTA Salmonella
AAACGTTTC Chlamydia
GTCGGTTGC Thermococcus
GTCAGTTGC Methanococcus
- Trees are often constructed based on a single
gene - species with the fewest differences between their
gene sequences are grouped together in the tree - The history of a gene may not indicate the
history of the species - Construct trees based on evidence
- from the whole genome
90An Essential Task forSpatial Comparative Genomics
Identify gene clusters, groups of genes that are
derived from the same chromosomal region in an
ancestral genome
8
9
11
12
10
4
5
3
7
2
13
14
15
3
1
20
4
5
3
1
6
2
4
3
1
2
4
3
1
2
91Phylogenetic Trees
Human
Chimp
Mouse
Rat
Dog
Possum
100 50
0
Million years Ago
- Describe evolutionary relationships between
species - each internal node represents the most recent
common ancestor of the descendants - edge lengths correspond to time estimates.
92Building Phylogenetic Trees
Human
AAACATTTTA
Opposable thumbs
Chimp
AAATATTTA
Mouse
AACATTTTG
Single pair of incisors
Rat
AACATTTCG
Flesh shearing teeth
Dog
ATCAGTTGC
No placenta Opposable thumbs
TGCACTTGT
Opossum
- Trees can be built from
- Physiological features
- Gene sequences
- Spatial genome organization
Species with the fewest differences between their
gene sequences are grouped together
93Whole-genome phylogenies based on spatial
organization
- Find gene clusters
- Determine the minimum number of rearrangements
between genome pairs - Use rearrangement distances to build phylogenies
Guillaume Bourque et al. Genome Res. 2004 14
507-516
94Conserved spatial organization between distantly
related species suggests functional associations
betweeen the genes
Snel, Bork, Huynen. PNAS 2002
B
C
D
A
C
E
D
A
B
E
D
E
D
?
E
D
?
95(No Transcript)
96Statistical Testing Provides Additional Evidence
for Common Ancestry
- How can we verify that a gene cluster indicates
common ancestry? - True histories are rarely known
- Experimental verification is often not possible
- Rates and patterns of large-scale rearrangement
processes are not well understood
97Constructive ApproachEnumerating configurations
that contain a cluster of exactly h gene pairs
h genes
m genes
m-h genes
- Select h spots in each genome, so that they form
a max-gap chain - Choose h genes to compose the cluster
- Assign each gene to a selected spot in each
genome - Choose the location of the remaining m-h genes so
they dont extend the cluster
98Where are the gene clusters?
- Intuitive notions of what clusters look like
- Similar gene content
- Neither gene content nor order is perfectly
preserved - Need more rigorous criteria
99Ways to place the remaining h-1 genes when the
gaps and length are constrained
1 2 3 4 5 .
n-L1 . n
l lt L
A known solution
but not closed form
100Ways to place the remaining h-1 genes when the
gaps and length are constrained
1 2 3 4 5 .
n-L1 . n
1
.
L-h
101Future Work
- Evluate
- Developed statistical tests for max-gap clusters
identified by whole-genome comparison using a
combinatoric approach - Results raise concerns about current methods used
in comparative genomics studies
102What characteristics should we use to evaluate a
cluster?
- Extent of gene loss/insertion
- Density? (constrained by def to 1/g)
- Number of insertions/delections between matches
(constrained to g) - Size of fragment
- Number of matching genes (unconstrained)
- Degree of rearrangement
- Number of order violations (unconstrained)
103Assumptions
- A single, linear chromosome
- The mapping between genes is one-to-one
104Evaluate clusters based on size
gap?gt 3
size 4
- The size of a cluster is the number of matching
gene pairs it contains
105(No Transcript)
106Existing Algorithms Impose Order Constraints
g 2
- Typical approaches to finding max-gap clusters
use a greedy, agglomerative algorithm - initialize a cluster as a single matching gene
pair - search for a gene in proximity in both genomes
- either extend the cluster and repeat, or
terminate and choose a new seed
107Algorithms and Definition Mismatch
g 2
A max-gap cluster of size four
- Agglomerative algorithms will not find highly
disordered max-gap clusters - A divide-and-conquer algorithm has been developed
(Bergeron et al, 2002) - this work is not known by the biological community
108Future Work
- Generalize the model
- Remove the assumption that gene correspondences
are one-to-one - Evaluate clusters based on
- density, e.g. size and total gaps
- the degree to which order is conserved
- Take phylogenetic distance into account
- for more closely related species, random gene
order is not a reasonable null hypothesis
109In bacteria, genes in the same pathway often
occur together in the genome
E. coli
Tryptophan Synthesis Pathway
trpCF
trpD
trpB
trpA
trpE
Chorismate
trpC
trpB
trpA
trpE
trpD
trpF
Anthranilate
Bacillus Subtilis
trpD trpE
N-5-Phophoribosyl-anthranilate
Enol-1-o-carboxy phenylamino-1-deoxyribulose
phosphate
trpD trpE
Indole-3-glycerol phosphate
trpCF
trpA trpB
L-Tryptophan
trpA trpB
110Speciation
An ancestral species a uniform population
111Speciation
- Initially the two populations have identical
genomes
- The populations evolve independently
- Eventually, there will be two species with
similar but distinct genomes
112Time passes,
more rearrangements accumulate
113Common blocks are now harder to detectbut there
is still evidence of common ancestry
5
8
9
11
12
18
7
19
20
11
4
3
2
13
10
14
15
17
16
3
18
1
8
9
11
12
10
4
5
6
2
3
2
20
13
14
15
17
16
4
7
1
1
1
- Gene clusters
- Similar gene content
- Neither gene content nor order is perfectly
preserved
114Gene Clusters
5
8
9
11
12
18
7
19
20
11
4
3
2
13
10
14
15
17
16
3
18
1
4
8
12
6
20
17
9
11
10
5
1
2
3
1
2
13
14
15
16
4
1
7
- Intuitive notions of what clusters look like
- Similar gene content
- Neither gene content nor order is perfectly
preserved - Need more rigorous criteria
115Genome
The genetic material of an organism or
species Specifies the complete blueprint for the
organism
Chromosome
A long double-stranded molecule of DNA
TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAG
AGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC
TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAG
AGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC
AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTC
TCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG
AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTC
TCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG
Gene
A DNA sequence that encodes a protein Proteins
are the building blocks of cells
116- Benoits outline
- example and a little motivation
- here are the issues, in order to solve this we
need to - need to cluster
- ways to cluster exist but we dont know how good
they are - want to have a statistical way of measuring it
- cluster def
117What are the processes of genomic change?
- Small-scale point mutations
- Change gene sequences
- Large-scale genomic rearrangements
- Change gene content and order
118In bacteria, genes in the same pathway often
occur together in the genome
Tryptophan Synthesis Pathway
Enol-1-o-carboxy phenylamino-1-deoxyribulose
phosphate
N-5-Phophoribosyl-anthranilate
Indole-3-glycerol phosphate
Chorismate
Tryptophan
Anthranilate
trpD trpE
trpCF
trpA trpB
trpA trpB
trpD trpE
E. coli
trpCF
trpD
trpB
trpA
trpE
Bacillus Subtilis
trpD
trpC
trpB
trpA
trpE
trpF
119Human genome
Human Chromosome 21 is broken into at least three
pieces in mouse
Accidental duplication of chromosome 21 causes
Down Syndrome
Human genome
X is scrambled but conserved
Mouse genome as scrambled human genome
Guillaume Bourque et al. Genome Res. 2004 14
507-516
120Other applications
- build evolutionary trees based on rearrangements
- detect ancient whole genome duplications
- identify operons
- estimate rearrangement frequencies
- ...
121Common Blocks
regions that descended from the same region in
the genome of the common ancestor
Species 1
8
9
11
12
10
4
5
3
7
2
13
14
15
3
1
8
7
11
12
10
20
17
16
9
4
5
3
1
6
2
4
3
1
2
13
14
15
4
3
1
2
Species 2
122Common Blocks
are harder to detect between more distantly
related organisms, but there is still evidence of
common ancestry
Species 1
5
8
9
11
12
18
11
4
3
7
2
13
10
14
15
17
16
19
20
3
18
1
8
11
12
10
5
6
2
3
2
20
13
14
15
17
16
7
9
4
1
1
4
1
Species 2
- Similar gene content
- Neither gene content nor order is perfectly
preserved
1238
11
12
5
9
18
7
17
19
20
18
11
4
3
1
2
13
10
14
15
16
3
8
9
11
12
10
4
5
6
2
3
2
20
13
14
15
17
16
4
7
1
1
1
- Gene clusters
- Similar gene content
- Neither gene content nor order is perfectly
preserved
124Inputs
- Two genomes (i.e, ordered lists of genes)
- A mapping of corresponding genes
125Hypothesis Testing
- Null hypothesis random gene order
- Alternate hypothesis shared ancestry
- Reject clusters that could have arisen under the
null model
126Number of ways to position h genes in a genome
of n genes so they form a max-gap chain
Probability that h randomly placed genes will
form a chain in a genome of n genes
127Probability of h randomly placed genes forming a
chain
n 1000 (total genes in genome)
h (size of the chain)
128Number of ways to place h genes in two genomes so
they form a cluster
h genes
m genes
m-h genes
Choose h genes to compose the cluster
Assign each gene to a selected spot in each genome
Select h spots in a genome, so they form a
max-gap chain
129Calculating the NumeratorEnumerate the
configurations that contain a cluster of exactly
h gene pairs
h genes
m genes
m-h genes
Assign each gene to a selected spot in each genome
- Choose the location of the remaining m-h genes so
they dont extend the cluster
Choose h genes to compose the cluster
Select h spots in a genome, so they form a
max-gap chain
130Closely related genomes
Species 1
8
11
12
9
10
5
7
2
13
4
3
1
14
15
3
20
17
16
8
9
7
11
12
10
4
5
3
1
6
2
4
3
1
2
13
14
15
4
3
1
2
Species 2
Related regions, regions that descended from the
same region in the genome of the common ancestor,
are easy to identify
131More Diverged Genomes
5
8
9
11
12
18
20
11
4
3
7
2
13
10
14
15
17
16
19
3
18
1
8
9
11
12
10
4
5
6
2
3
2
20
13
14
15
17
16
4
7
1
1
1
- Related regions are harder to detect, but there
is still spatial evidence of common ancestry - Similar gene content
- Neither gene content nor order is perfectly
preserved
132Genome Comparison
Species 1
8
12
9
11
10
4
5
3
7
1
2
13
14
15
3
17
16
8
9
7
11
12
10
4
5
3
1
6
2
4
3
1
2
20
13
14
15
4
3
1
2
Species 2
Our goal identify chromosomal regions that
descended from the same region in the genome of
the common ancestor
133Comparing Genomes
75 Million years
Chromo-somes Millions of nucleotides Genes
Human 23 2900 20-25k
Mouse 20 2500 20-25k
Fly 4 180 13.6k
Rice 12 430 40k
E. coli 1 4.7 3200
Chlamydia 1 1 936
134Comparing Genomes
75 Million years
Chromo-somes Genes
Human 23 20-25k
Mouse 20 20-25k
Fly 4 13.6k
Rice 12 40k
E. coli 1 3200
Chlamydia 1 936