Identifying conserved spatial patterns in genomes - PowerPoint PPT Presentation

About This Presentation
Title:

Identifying conserved spatial patterns in genomes

Description:

Identifying conserved spatial patterns in genomes Rose Hoberman David Sankoff Dept. of Math and Statistics University of Ottawa Dannie Durand Depts. of Biological ... – PowerPoint PPT presentation

Number of Views:160
Avg rating:3.0/5.0
Slides: 135
Provided by: DerekD65
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Identifying conserved spatial patterns in genomes


1
Identifying conserved spatial patterns in genomes
  • Rose Hoberman

David Sankoff Dept. of Math and Statistics
University of Ottawa
Dannie Durand Depts. of Biological Sciences
and Computer Science, CMU
Student Seminar Series Jan 20, 2006
2
The complete genetic material of an organism or
species
The Genome
3
Key genomic component genes
A gene is a DNA subsequence
ACCCTTAGCTAGACCTTTAGGAGG...
  • Genes encode proteins,
  • the building blocks of the cell

4
Comparing Genomes
75 Million years
Human Mouse Fly Rice E. Coli Chlamydia
Chromosomes 23 20 4 12 1 1
Genes 20-25k 20-25k 13.6k 40k 3200 936
5
Accidental duplication of chromosome 21 causes
Down Syndrome
Human Chromosome 21 is broken into at least three
pieces in mouse
6
Outline
  • Evolution of genome organization
  • Why identify related genomic regions?
  • How do we find them?
  • Identification Formal cluster definition
  • Validation Testing cluster significance

7
A simple model of a chromosome
  • an ordered list of genes

8
What are the processes of genomic change?
9
A single species
10
Speciation
  1. Initially the two populations have identical
    genomes
  1. The populations evolve independently

?
3. Eventually, there will be two new species with
similar but distinct genomes
11
Types of Genomic Rearrangements
Inversions
6
3
4
5
3
7
1
2
20
Species 2
Duplications/Insertions
Loss
12
Types of Genomic Rearrangements
Chromosomal fissions and fusions
8
9
7
11
12
10
6
20
17
16
4
5
3
1
2
4
3
1
2
13
14
15
Species 2
13
Genome Comparison
Species 1
8
12
9
11
10
4
5
3
7
1
2
13
14
15
3
17
16
8
9
7
11
12
10
4
5
3
1
6
2
4
3
1
2
20
13
14
15
4
3
1
2
Species 2
Our goal identify chromosomal regions that
descended from the same region in the genome of
the common ancestor
14
Outline
  • Evolution of genome organization
  • Why identify related genomic regions?
  • How do we find them?
  • Identification Formal cluster definition
  • Validation Testing cluster significance

15
Genome Annotation Problem
Given the set of genes in the genome, label each
with its function
Gene
ACCCTTAGCTAGACCTTTAGGAGGTGCAGGA
Cellular Pathway Glucose Metabolism
Protein
16
There are many aspects of gene function
  • Gene trpA
  • Biochemical Function cleaves a double bond
  • Cellular Process amino-acid biosynthesis
  • Protein-protein interactions binds trpB

17
There are many aspects of gene function
  • Gene a typical gene
  • Biochemical Function ?
  • Biological Process ?
  • Protein-protein interactions ?

40-60 of genes in most genomes have unknown
function
Comparisons of spatial organization within
genomes can yield gene function predictions
18
In bacteria, genes in the same pathway often
occur together in the genome
Tryptophan Synthesis Pathway
1-2 Carboxy-phenylaminodeoxy-ribulose-5P
N-5-P-ribosyl-anthranilate
3-Indole Glycerol-P
Chorismate
Tryptophan
Anthranilate
trpCF
trpD
trpB
trpA
trpE
E. coli
Bacillus Subtilis
trpD
trpC
trpB
trpA
trpE
trpF
19
Conserved spatial organization between distantly
related species suggests functional associations
between the genes
C
B
D
A
G
E
F
A Glucose metabolism B Glucose metabolism C
? D Tryptophan synthesis E ? F ? G Tryptophan
synthesis
A
B
C
D
E
F
G
20
Conserved spatial organization between distantly
related species suggests functional associations
between the genes
C
B
D
A
G
E
F
G
D
F
E
C
B
A
A Glucose metabolism B Glucose metabolism C
Prediction Glucose metabolism D Tryptophan
synthesis E ? F Prediction Tryptophan
synthesis G Tryptophan synthesis
A
B
C
D
E
F
G
21
Outline
  • Evolution of genome organization
  • Why identify related genomic regions?
  • How do we find them?
  • Identification Formal cluster definition
  • Validation Testing cluster significance

22
Closely related genomes
Species 1
8
11
12
9
10
5
7
2
13
4
3
1
14
15
3
20
17
16
8
9
7
11
12
10
4
5
3
1
6
2
4
3
1
2
13
14
15
4
3
1
2
Species 2
Related regions, regions that descended from the
same region in the genome of the common ancestor,
are easy to identify
23
A hundred million years...
24
More Diverged Genomes
5
8
9
11
12
18
20
11
4
3
7
2
13
10
14
15
17
16
19
3
18
1
8
9
11
12
10
4
5
6
2
3
2
20
13
14
15
17
16
4
7
1
1
1
  • Related regions are harder to detect, but there
    is still spatial evidence of common ancestry
  • Similar gene content
  • Neither gene content nor order is perfectly
    preserved

25
The signature of diverged regions
5
8
11
12
18
9
11
3
7
2
13
17
16
19
20
18
4
1
10
14
15
3
8
12
6
20
17
9
11
10
4
5
1
2
3
1
2
13
14
15
16
4
1
7
  • Gene clusters
  • Similar gene content
  • Neither gene content nor order is perfectly

26
A Framework for Identifying Gene Clusters
given as input
  1. Find corresponding genes
  2. Formally define a gene cluster
  3. Devise an algorithm to identify clusters
  4. Statistically verify clusters

review the most common definition
my work
27
Clusters are signatures of distantly related
regions.
  • Without functional constraints...
  • After sufficient time has passed, gene order will
    become randomized
  • Uniform random data tends to be clumpy
  • some genes will end up proximal in both genomes
    simply by chance

Not all clusters have biological significance.
28
Cluster Validation via Hypothesis Testing
  • Null hypothesis random gene order
  • Reject gene clusters that could have arisen under
    the null model
  • Clusters that cannot be rejected are likely to be
    functionally constrained

29
Outline
  • Evolution of genome organization
  • Why find related genomic regions?
  • How do we find them?
  • Identification max-gap cluster definition
  • Validation Testing cluster significance

30
A max-gap chain
g? 2
gap? 3
  • The distance or gap between genes is equal to
    the number of intervening genes
  • A set of genes in a genome form a max-gap chain
    if
  • the gap between adjacent genes is never greater
    than g (a user-specified parameter)

31
Max-Gap cluster definition
g? 2
gap? 3
  • A set of genes form a max-gap cluster of two
    genomes if
  • the genes forms a max-gap chain in each genome
  • the cluster is maximal (i.e. not contained within
    a larger cluster)

32
Max-Gap cluster definition
gap? 3
g? 2
g? 3
  • A set of genes form a max-gap cluster of two
    genomes if
  • the genes forms a max-gap chain in each genome
  • the cluster is maximal (i.e. not contained within
    a larger cluster)

33
The max-gap definition is the most widely used
cluster definition in genomic analyses
  • Allows extensive rearrangement of gene order
  • Allows limited gene insertion and losses

There is no formal statistical model for max-gap
clusters
34
Outline
  • Evolution of genome organization
  • Why find related genomic regions?
  • How do we find them?
  • Identification max-gap cluster definition
  • Validation Testing cluster significance

35
The Questions
Suppose two whole genomes were compared, and this
max-gap cluster was identified
  • Is this cluster biologically meaningful?
  • Could it have occurred in a comparison of random
    genomes?

36
The Inputs
h4
n number of genes in each genome m number of
matching genes pairs g the maximum gap allowed
in a cluster h number of matching genes in the
cluster
37
The Problem
h4
  • What is the probability of observing a max-gap
    cluster
  • containing exactly h matching gene pairs
  • assuming the genomes are randomly ordered

38
Probability of a cluster of size h
m genes
m-h genes
h genes
Basic approach Enumerate all ways to
  1. Place m-h remaining genes so they do not extend
    the cluster
  1. Create chains of h genes in both genomes

  1. Normalize to get a probability

39
Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
40
Total number of configurations of m gene pairs
in two genomes of size n
m genes
41
Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
42
Number of ways to place h genes in two genomes so
they form a cluster
h genes
m genes
m-h genes
Select h spots in each genome, so they form a
max-gap chain
Choose h genes to compose the cluster
Assign each gene to a selected spot in each genome
43
The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 places left
1 2 3 4 5 .
n-L1 .
n
The maximum length of the chain is L (h-1)g h
44
The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Choices for the size of each gap (from 0 to g)
There are h-1 gaps in a chain of h genes
45
The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Chains near the end of the genome
Choices for the size of each gap (from 0 to g)
There are h-1 gaps in a chain of h genes
1 2 3 4 5 .
n-L1 .
n
46
Number of ways to position h genes in a genome
of n genes so they form a max-gap chain
Starting positions near end
Starting positions
Ways to place remaining h-1 genes

47
Probability of a cluster of size h
m-h genes
h genes
Basic approach Enumerate all ways to
  1. Place m-h remaining genes so they do not extend
    the cluster
  1. Create chains of h genes in both genomes


48
Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
49
Counting the number of ways to place m-h genes
outside the cluster
g 1
h 3
  • Approach
  • design a rule specifying where the genes can be
    placed so that the cluster is not extended
  • count the positions

50
Counting the number of ways to place m-h genes
outside the cluster
gaps 1
g 1
  • Rule 1 A gene can go anywhere except in the
    cluster (the white box).

Too lenient
51
Counting the number of ways to place m-h genes
outside the cluster
g 1
g 1
g 1
  • Rule 2 Every gene must be at least g1 positions
    from the cluster (outside the grey box).

Too strict
52
Counting the number of ways to place m-h genes
outside the cluster
gap gt 1
g 1
h 3
gap gt 1
  • Rule 2 Every gene must be at least g1 positions
    from the cluster (outside the grey box).

Too strict
53
Counting the number of ways to place m-h genes
outside the cluster
g 1
gap gt 1
  • Rule 3 At most one member of each gene pair can
    be in the grey box.

Too lenient
54
Counting the number of ways to place m-h genes
outside the cluster
gaps 1
g 1
  • Rule 3 At most one member of each gene pair can
    be in the grey box.

Too lenient
55
Counting the number of ways to place m-h genes
outside the cluster
g 1
  • Acceptable positions for a gene depend on the
    positions of the remaining genes
  • Use strict and lenient rules to calculate upper
    and lower bounds on G

56
Estimating G
  • Upper bound
  • Erroneously enumerates this configuration
  • Lower bound
  • Fails to enumerate this configuration

57
Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
Hoberman, Sankoff, Durand Journal of
Computational Biology, 2005
58
What can we learn from this statistical result?
  • Are we less likely to observe a large cluster
    (containing more gene pairs) than a small
    cluster?
  • How large does a cluster have to be before we are
    surprised to observe it?
  • How do we choose the maximum allowed gap value?
    Larger values will
  • yield more clusters
  • more of these will be false positives

59
Whole-genome comparison cluster statistics
n1000, m250
g20
With a significance threshold of 10-4, any
cluster containing 8 or more genes is significant.
h (cluster size)
60
Conclusion
  • Statistical analysis of max-gap gene clusters
  • Provides a principled approach for choosing a gap
    size that will yield significant clusters
  • Allows statistically significant max-gap clusters
    to be identified
  • Provides insight on criteria for cluster
    definitions

61
Odd properties of max-gap clusters
  1. A larger cluster may be less significant
  1. Moving a gene further away may make a cluster
    more likely

62
Acknowledgements
  • Barbara Lazarus Women_at_IT Fellowship
  • The Sloan Foundation
  • The Durand Lab

63
Thanks
64
Questions?
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
Cluster Significance Related Work
  • Randomization tests
  • Requires complete genome (confusing!)
  • Not useful for choosing parameter values
  • Very simple models
  • Excessively strict simplifying assumptions
  • Overly conservative cluster definitions
  • A few more general statistical approaches
  • Not applicable to max-gap clusters

69
Groups find very different clusters when
analyzing the same data
70
Generative Models of Genome Rearrangement
  • Construct a probabilistic model specifying rates
    for each type of genomic rearrangement
  • Reject regions that are unlikely to have evolved
    via the model
  • Challenges
  • Relative rates of rearrangement processes are not
    known
  • requires identification of clusters
  • Rates may differ significantly
  • within regions of the genome
  • between species
  • over time (e.g. depending on population sizes)

71
Advantages of an analytical approach
  • Analyzing incomplete datasets
  • Principled parameter selection
  • Efficiency?
  • Accuracy?
  • Understanding statistical trends
  • Insight into tradeoffs between definitions

72
  • plot graph with fixed cluster size and varying
    maximum gap sizes
  • is it monotonic?
  • is a function of density and size monotonic?

73
  • not capturing
  • difference in density between max-gap clusters
  • partially conserved order

74
Identifying gene clusters
  1. Formally define a gene cluster
  2. Devise an algorithm to identify clusters
  3. Verify that clusters indicate common ancestry

...modeling
...algorithms
...statistics
75
Identifying gene clusters
  1. Formally define a gene cluster
  2. Devise an algorithm to identify clusters
  3. Verify that clusters indicate common ancestry

...modeling
...algorithms
...statistics
76
  • These are criteria. Size and density
  • Hard to capture
  • One Ive chosen is widely use, but see at end of
    talk has some problems

77
Genome
The complete set of genetic material of an
organism or species
Chromosome
A double-stranded molecule of DNA
Gene
CCCCGCCCCCCGCCCCCCCCCTCGTCTTCAGACCCTTAGCTAGACCTTTA
GGAGGATTAAAAATGAGGGAGAGGGGC
GGGGCGGGGGGCGGGGGGGGGAGCAGAAGTCTGGGAATCGATCTGGAAAT
CCTCCTAATTTTTACTCCCTCTCCCCG
A protein coding sequence
78
Genome
The complete set of genetic material of an
organism or species


TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAG
AGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC
TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAG
AGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC
AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTC
TCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG
AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTC
TCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG
Genes protein coding sequences
Large stretches of DNA with unknown function.
as an ordered list of genes
Regions where proteins bind to turn genes on and
off
79
Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Example h 4 and g 1
1 2 3 4 5 6
. n-L1 .
n
The maximum length of a chain L (h-1)g h
80
Ways to place the remaining h-1 genes when the
gaps and length are constrained
1 2 3 4 5 .
n-L1 . n
l lt L
  • Gaps are constrained
  • And sum of gaps is constrained

A known solution
81
g2
g3
gm-1
g1
l
A known solution
82
Counting chains at the end of the genome
  • Gaps are constrained
  • And sum of gaps is constrained

l w-1
l h
83
Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Chains near the end of the genome
Ways to place the remaining h-1 genes, so no gap
exceeds g
1 2 3 4 5 .
n-L1 . n
1

.
L-h
84
Number of ways to position h genes in a genome
of n genes so they form a max-gap chain
Starting positions near end
Starting positions
Ways to place remaining h-1 genes

85
Whole-genome comparison cluster statistics
n1000, m250
g10
g20
Cluster Probability
h (cluster size)
86
(No Transcript)
87
Constructive Approach
Number of configurations that contain a cluster
of exactly size h
number of ways to place m-h remaining genes so
they do not extend the cluster
number of ways to position h genes so they form a
chain in both genomes
number of ways to position h genes so they form a
chain in a single genome
88
Constructive Approach
Number of configurations that contain a cluster
of exactly size h
number of ways to place m-h remaining genes so
they do not extend the cluster
number of ways to position h genes so they form a
cluster in both genomes
89
Building Phylogenetic Trees
Genes may be laterally transferred between
distantly related species
AAACATTTT E. coli
GTCGGTTGG E. coli
AAACATTTA Salmonella
AAACGTTTC Chlamydia
GTCGGTTGC Thermococcus
GTCAGTTGC Methanococcus
  • Trees are often constructed based on a single
    gene
  • species with the fewest differences between their
    gene sequences are grouped together in the tree
  • The history of a gene may not indicate the
    history of the species
  • Construct trees based on evidence
  • from the whole genome

90
An Essential Task forSpatial Comparative Genomics
Identify gene clusters, groups of genes that are
derived from the same chromosomal region in an
ancestral genome
8
9
11
12
10
4
5
3
7
2
13
14
15
3
1
20
4
5
3
1
6
2
4
3
1
2
4
3
1
2
91
Phylogenetic Trees
Human
Chimp
Mouse
Rat
Dog
Possum
100 50
0
Million years Ago
  • Describe evolutionary relationships between
    species
  • each internal node represents the most recent
    common ancestor of the descendants
  • edge lengths correspond to time estimates.

92
Building Phylogenetic Trees
Human
AAACATTTTA
Opposable thumbs
Chimp
AAATATTTA
Mouse
AACATTTTG
Single pair of incisors
Rat
AACATTTCG
Flesh shearing teeth
Dog
ATCAGTTGC
No placenta Opposable thumbs
TGCACTTGT
Opossum
  • Trees can be built from
  • Physiological features
  • Gene sequences
  • Spatial genome organization

Species with the fewest differences between their
gene sequences are grouped together
93
Whole-genome phylogenies based on spatial
organization
  1. Find gene clusters
  2. Determine the minimum number of rearrangements
    between genome pairs
  3. Use rearrangement distances to build phylogenies

Guillaume Bourque et al. Genome Res. 2004 14
507-516
94
Conserved spatial organization between distantly
related species suggests functional associations
betweeen the genes
Snel, Bork, Huynen. PNAS 2002
B
C
D
A
C
E
D
A
B
E
D
E
D
?
E
D
?
95
(No Transcript)
96
Statistical Testing Provides Additional Evidence
for Common Ancestry
  • How can we verify that a gene cluster indicates
    common ancestry?
  • True histories are rarely known
  • Experimental verification is often not possible
  • Rates and patterns of large-scale rearrangement
    processes are not well understood

97
Constructive ApproachEnumerating configurations
that contain a cluster of exactly h gene pairs
h genes
m genes
m-h genes
  1. Select h spots in each genome, so that they form
    a max-gap chain
  2. Choose h genes to compose the cluster
  3. Assign each gene to a selected spot in each
    genome
  4. Choose the location of the remaining m-h genes so
    they dont extend the cluster

98
Where are the gene clusters?
  • Intuitive notions of what clusters look like
  • Similar gene content
  • Neither gene content nor order is perfectly
    preserved
  • Need more rigorous criteria

99
Ways to place the remaining h-1 genes when the
gaps and length are constrained
1 2 3 4 5 .
n-L1 . n
l lt L
A known solution
but not closed form
100
Ways to place the remaining h-1 genes when the
gaps and length are constrained
1 2 3 4 5 .
n-L1 . n
1

.
L-h
101
Future Work
  • Evluate
  • Developed statistical tests for max-gap clusters
    identified by whole-genome comparison using a
    combinatoric approach
  • Results raise concerns about current methods used
    in comparative genomics studies

102
What characteristics should we use to evaluate a
cluster?
  • Extent of gene loss/insertion
  • Density? (constrained by def to 1/g)
  • Number of insertions/delections between matches
    (constrained to g)
  • Size of fragment
  • Number of matching genes (unconstrained)
  • Degree of rearrangement
  • Number of order violations (unconstrained)

103
Assumptions
  • A single, linear chromosome
  • The mapping between genes is one-to-one

104
Evaluate clusters based on size
gap?gt 3
size 4
  • The size of a cluster is the number of matching
    gene pairs it contains

105
(No Transcript)
106
Existing Algorithms Impose Order Constraints
g 2
  • Typical approaches to finding max-gap clusters
    use a greedy, agglomerative algorithm
  • initialize a cluster as a single matching gene
    pair
  • search for a gene in proximity in both genomes
  • either extend the cluster and repeat, or
    terminate and choose a new seed

107
Algorithms and Definition Mismatch
g 2
A max-gap cluster of size four
  • Agglomerative algorithms will not find highly
    disordered max-gap clusters
  • A divide-and-conquer algorithm has been developed
    (Bergeron et al, 2002)
  • this work is not known by the biological community

108
Future Work
  • Generalize the model
  • Remove the assumption that gene correspondences
    are one-to-one
  • Evaluate clusters based on
  • density, e.g. size and total gaps
  • the degree to which order is conserved
  • Take phylogenetic distance into account
  • for more closely related species, random gene
    order is not a reasonable null hypothesis

109
In bacteria, genes in the same pathway often
occur together in the genome
E. coli
Tryptophan Synthesis Pathway
trpCF
trpD
trpB
trpA
trpE
Chorismate
trpC
trpB
trpA
trpE
trpD
trpF
Anthranilate
Bacillus Subtilis
trpD trpE
N-5-Phophoribosyl-anthranilate
Enol-1-o-carboxy phenylamino-1-deoxyribulose
phosphate
trpD trpE
Indole-3-glycerol phosphate
trpCF
trpA trpB
L-Tryptophan
trpA trpB
110
Speciation
An ancestral species a uniform population
111
Speciation
  1. Initially the two populations have identical
    genomes
  1. The populations evolve independently
  1. Eventually, there will be two species with
    similar but distinct genomes

112
Time passes,
more rearrangements accumulate
113
Common blocks are now harder to detectbut there
is still evidence of common ancestry
5
8
9
11
12
18
7
19
20
11
4
3
2
13
10
14
15
17
16
3
18
1
8
9
11
12
10
4
5
6
2
3
2
20
13
14
15
17
16
4
7
1
1
1
  • Gene clusters
  • Similar gene content
  • Neither gene content nor order is perfectly
    preserved

114
Gene Clusters
5
8
9
11
12
18
7
19
20
11
4
3
2
13
10
14
15
17
16
3
18
1
4
8
12
6
20
17
9
11
10
5
1
2
3
1
2
13
14
15
16
4
1
7
  • Intuitive notions of what clusters look like
  • Similar gene content
  • Neither gene content nor order is perfectly
    preserved
  • Need more rigorous criteria

115
Genome
The genetic material of an organism or
species Specifies the complete blueprint for the
organism
Chromosome
A long double-stranded molecule of DNA
TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAG
AGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC
TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAG
AGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC
AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTC
TCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG
AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTC
TCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG
Gene
A DNA sequence that encodes a protein Proteins
are the building blocks of cells
116
  • Benoits outline
  • example and a little motivation
  • here are the issues, in order to solve this we
    need to
  • need to cluster
  • ways to cluster exist but we dont know how good
    they are
  • want to have a statistical way of measuring it
  • cluster def

117
What are the processes of genomic change?
  • Small-scale point mutations
  • Change gene sequences
  • Large-scale genomic rearrangements
  • Change gene content and order

118
In bacteria, genes in the same pathway often
occur together in the genome
Tryptophan Synthesis Pathway
Enol-1-o-carboxy phenylamino-1-deoxyribulose
phosphate
N-5-Phophoribosyl-anthranilate
Indole-3-glycerol phosphate
Chorismate
Tryptophan
Anthranilate
trpD trpE
trpCF
trpA trpB
trpA trpB
trpD trpE
E. coli
trpCF
trpD
trpB
trpA
trpE
Bacillus Subtilis
trpD
trpC
trpB
trpA
trpE
trpF
119
Human genome
Human Chromosome 21 is broken into at least three
pieces in mouse
Accidental duplication of chromosome 21 causes
Down Syndrome
Human genome
X is scrambled but conserved
Mouse genome as scrambled human genome
Guillaume Bourque et al. Genome Res. 2004 14
507-516
120
Other applications
  • build evolutionary trees based on rearrangements
  • detect ancient whole genome duplications
  • identify operons
  • estimate rearrangement frequencies
  • ...

121
Common Blocks
regions that descended from the same region in
the genome of the common ancestor
Species 1
8
9
11
12
10
4
5
3
7
2
13
14
15
3
1
8
7
11
12
10
20
17
16
9
4
5
3
1
6
2
4
3
1
2
13
14
15
4
3
1
2
Species 2
122
Common Blocks
are harder to detect between more distantly
related organisms, but there is still evidence of
common ancestry
Species 1
5
8
9
11
12
18
11
4
3
7
2
13
10
14
15
17
16
19
20
3
18
1
8
11
12
10
5
6
2
3
2
20
13
14
15
17
16
7
9
4
1
1
4
1
Species 2
  • Similar gene content
  • Neither gene content nor order is perfectly
    preserved

123
8
11
12
5
9
18
7
17
19
20
18
11
4
3
1
2
13
10
14
15
16
3
8
9
11
12
10
4
5
6
2
3
2
20
13
14
15
17
16
4
7
1
1
1
  • Gene clusters
  • Similar gene content
  • Neither gene content nor order is perfectly
    preserved

124
Inputs
  1. Two genomes (i.e, ordered lists of genes)
  2. A mapping of corresponding genes

125
Hypothesis Testing
  • Null hypothesis random gene order
  • Alternate hypothesis shared ancestry
  • Reject clusters that could have arisen under the
    null model

126
Number of ways to position h genes in a genome
of n genes so they form a max-gap chain
Probability that h randomly placed genes will
form a chain in a genome of n genes

127
Probability of h randomly placed genes forming a
chain
n 1000 (total genes in genome)
h (size of the chain)
128
Number of ways to place h genes in two genomes so
they form a cluster
h genes
m genes
m-h genes
Choose h genes to compose the cluster
Assign each gene to a selected spot in each genome
Select h spots in a genome, so they form a
max-gap chain
129
Calculating the NumeratorEnumerate the
configurations that contain a cluster of exactly
h gene pairs
h genes
m genes
m-h genes
Assign each gene to a selected spot in each genome
  • Choose the location of the remaining m-h genes so
    they dont extend the cluster

Choose h genes to compose the cluster
Select h spots in a genome, so they form a
max-gap chain
130
Closely related genomes
Species 1
8
11
12
9
10
5
7
2
13
4
3
1
14
15
3
20
17
16
8
9
7
11
12
10
4
5
3
1
6
2
4
3
1
2
13
14
15
4
3
1
2
Species 2
Related regions, regions that descended from the
same region in the genome of the common ancestor,
are easy to identify
131
More Diverged Genomes
5
8
9
11
12
18
20
11
4
3
7
2
13
10
14
15
17
16
19
3
18
1
8
9
11
12
10
4
5
6
2
3
2
20
13
14
15
17
16
4
7
1
1
1
  • Related regions are harder to detect, but there
    is still spatial evidence of common ancestry
  • Similar gene content
  • Neither gene content nor order is perfectly
    preserved

132
Genome Comparison
Species 1
8
12
9
11
10
4
5
3
7
1
2
13
14
15
3
17
16
8
9
7
11
12
10
4
5
3
1
6
2
4
3
1
2
20
13
14
15
4
3
1
2
Species 2
Our goal identify chromosomal regions that
descended from the same region in the genome of
the common ancestor
133
Comparing Genomes
75 Million years
Chromo-somes Millions of nucleotides Genes
Human 23 2900 20-25k
Mouse 20 2500 20-25k
Fly 4 180 13.6k
Rice 12 430 40k
E. coli 1 4.7 3200
Chlamydia 1 1 936
134
Comparing Genomes
75 Million years
Chromo-somes Genes
Human 23 20-25k
Mouse 20 20-25k
Fly 4 13.6k
Rice 12 40k
E. coli 1 3200
Chlamydia 1 936
Write a Comment
User Comments (0)
About PowerShow.com