Title: Terminology
1(No Transcript)
2Terminology
- Homologous related through common ancestry
- Orthologous related through speciation
- Paralogous related through duplication
Species 1
8
12
4
5
3
7
1
2
9
11
13
10
14
15
3
orthologs
5
6
2
2
20
2
3
1
4
3
1
4
3
1
Species 2
paralogs
3Identification of Orthologous Genes
- The identification of orthologous genes is a
prerequisite for a marker-based approach - Orthology identification
- is often difficult to determine from gene
sequence alone - is an important unsolved research problem
- can be improved by incorporating genomic context
4An example Which gene is the true ortholog?
Most similar Least similar
Species 2
1st of 4
2nd of 4
3rd of 4
1st of 1
1st of 1
1st of 1
1st of 1
1st of 1
4th of 4
Query Gene
Species 1
5- Problem for more diverged genomes, unambiguous
orthologs will be sparse and
clusters will be more rearranged - Solution Identify orthologs and gene clusters
simultaneously
Identify homologous genes
Find gene clusters
Similar genomic context
6Two approaches
- Minimize rearrangements
- Maximize conserved structure
7- Work that combines sequence similarity and
genomic context - Bansal, Bioinformatics 99
- Kellis et al, J Comp Biol 04
- Bourque et al, RECOMB Comp Genomics 05
- Chen et al, ACM/IEEE Trans Comput Biol and Bioinf
05 - Limitations
- No flexible cluster definitions
- No statistical approaches
- Little real evaluation
8Why we need more flexible cluster defs and thus
statistics
- Give an example from yeast where longest
subsequence or Blins boxes fail? - Also they have to use ad hoc filters?
9Solution
- Use max-gap clusters
- Show how it works on the example.
10- However, which cluster is more conserved?
- Show two clusters, one larger but more gaps, the
other smaller and denser? - Need a way to rank them.
- Use p-value as measure of degree of conservation.
11- Discuss algorithmic challenges? Why monotonicity
helps us here?
12- Evaluation in progress
- Say what data set Im using?
13(No Transcript)
14(No Transcript)
15Basic Genome Model
- a sequence of unique genes
- distance between genes is equal to the number of
intervening genes - gene orientation unknown
- a single, linear chromosome
16Inputs
- Two genomes (i.e, ordered lists of genes)
- A mapping of corresponding genes
17Whole Genome Comparison m n
Two genomes of n genes with with m homologous
genes pairs
g?? 3
g?? 3
- What is the probability of observing a
maximal max-gap cluster of size exactly h, if the
genes in both genomes are randomly ordered? - A cluster is maximal if it is not a subset of
a larger cluster
18(No Transcript)
19Processes of genomic change
- Small-scale point mutations
- Change gene sequences
- Large-scale genomic rearrangements
- Change gene content and order
20Processes of genomic change
CCCCCCGACACTTCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTA
AAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCCCCCCTTA
G
CCCCCCCTGTGAAGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAA
TTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGGGGGGAA
TC
Noncoding DNA
Genes
Regulatory regions
- Small-scale point mutations
- Change gene sequences
- Large-scale genomic rearrangements
- Change gene content and order
21Noncoding DNA
CCCCCCGACACTTCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTA
AAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCCCCCCCCC
C
CCCCCCCTGTGAAGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAA
TTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGGGGGGGG
GG
Regulatory regions
Genes
22Building Phylogenetic Trees
Genes may be laterally transferred between
distantly related species
AAACATTTT E. coli
GTCGGTTGG E. coli
AAACATTTA Salmonella
AAACGTTTC Chlamydia
GTCGGTTGC Thermococcus
GTCAGTTGC Methanococcus
- Trees are often constructed based on a single
gene - species with the fewest differences between their
gene sequences are grouped together in the tree - The history of a gene may not indicate the
history of the species - Construct trees based on evidence
- from the whole genome
23Whole-genome phylogenies based on spatial
organization
- Find gene clusters
- Determine the minimum number of rearrangements
between genome pairs - Use rearrangement distances to build phylogenies
- Effects of cluster choice?
Guillaume Bourque et al. Genome Res. 2004 14
507-516
24Temporal coherence
before
time
now
- Divergence times of homologous pairs within a
block should agree
25Genomic Change
Ancestral chromosome
Large-scale duplication
chromosome 2
chromosome 1
Sequence Mutation Chromosomal Rearrangements
26Groups find very different clusters when
analyzing the same data
27Identifying gene clusters
- Formally define a gene cluster
- Devise an algorithm to identify clusters
- Verify that clusters indicate common ancestry
...modeling
...algorithms
...statistics
28Order and Orientation
density 6/8
density 6/8
- Local rearrangements will cause both gene order
and orientation to diverge - Overly stringent order constraints could lead to
false negatives - Partial conservation of order and orientation
provide additional evidence of regional homology
29Symmetry
A
B
A
B
?
clusters found
clusters found
- Many existing cluster algorithms are not
symmetric with respect to chromosome
30Cluster definitions in the literature
Descriptive r-windows (many references) connected components (Pevzner Tesler 03) common intervals (Uno and Tagiura 00) max-gap (many references) Constructive LineUp (Hampson et al 03) CloseUp (Hampson et al 05) FISH (Calabrese et al 03) AdHoRe (Vandepoele et al 02) Gene teams (Bergeron et al 02) greedy max-gap (Hokamp 01)
Require search algorithms
Harder to reason about formally
31size 5, length 12
density 5/12
gap 3 genes
- Cluster Parameters
- size number of homologous pairs in the cluster
- length total number of genes in the cluster
- density proportion of homologous pairs
(size/length)
32gap?? g
gap?? g
gap?? g
max-gap clusters
- cluster grows to its natural size
- cluster of size m may be of length m to g(m -1)
m - maximal length grows as size grows
length?? r
r-windows
- cluster size is constrained
- cluster of size m may be of length m to r
- maximal length is fixed, regardless of cluster
size
33A tradeoff local vs global density
- max-gap
- constrains local density
- only weakly constrains global density ( 1/(g1))
- r-window
- constrains global density
- only weakly constrains local density (maximum
possible gap is r - size)
34Even when global density is high,
Density 12/18
a region may not be locally dense
35Formalizing these intuitive notions
Chromosome 17
10 genes duplicated out of 100
29 genes
Chromosome 3
McLysaght, Hokamp, Wolfe. Nature Genetics, 2002.
- Similar but not identical gene content
- density constraints
- Gene order is not perfectly preserved
- order violation constraints
The degree of constraint varies widely among
definitions
36The Max-Gap Definition is the Most Widely Used in
Genomic Analyses
Blanc et al 2003, recent polyploidy in Arabidopsis Venter et al 2001, sequence of the human genome Overbeek et al 1999, inferring functional coupling of genes in bacteria Vandepoele et al 2002, duplications in Arabidopsis through comparison with rice Vision et al 2000, duplications in Eukaryotes Lawrence and Roth 1996, identification of horizontal transfers Tamames 2001, evolution of gene order conservation in prokaryotes Wolfe and Shields 1997, ancient yeast duplication McLysaght 2002, genomic duplication during early chordate evolution Coghlan and Wolfe 2002, comparing rates of rearrangements Seoighe and Wolfe 1998, genome rearrangements after duplication in yeast Chen et al 2004, operon prediction in newly sequenced bacteria Blanchette et al 1999, breakpoints as phylogenetic features ...
37r-window statistics for a two-way cluster
size m 4, length r 7
r-m
r-m
m
- Given length, total number of genes in each
window - Test statistic size, the number of homologous
pairs in the cluster - P-value the probability of two arbitrary windows
of r genes containing m genes in common, under
the null hypothesis - Durand and Sankoff, J Comp Biology, 2002
38The max-gap definition is the most widely used
cluster definition in genomic analyses
- Allows extensive rearrangement of gene order
- Allows limited gene insertion and losses
- Allows the cluster to grow to its natural size
There is no formal statistical model for max-gap
clusters
39Whole Genome Comparison of Human with Human
McLysaght, Hokamp, Wolfe. Nature Genetics, 2002.
Could this pattern have occurred by chance?
40McLysaght, Hokamp, Wolfe. Nature Genetics, 2002.
Chromosome 17
Clusters with similarity to human chromosome 17
- Are larger clusters more likely to occur by
chance? - Are there other duplicated segments that their
method did not detect?
41Max-Gap Cluster Statistics
- Reference set
- complete clusters
- complete clusters with length restriction
- incomplete clusters
- Whole genome comparison
- upper bound
- lower bound
- Hoberman, Sankoff, and Durand. Journal of
Computational Biology 2005. - Hoberman, Sankoff, and Durand. RECOMB Comparative
Genomics 2004.
42Reference set, complete clusters
Given a genome G 1, , n unique genes
a set of m genes of interest (in
blue)
m 5
- Do all m blue genes form a significant max-gap
cluster? -
-
43Reference set, complete clusters
g 2
m 5
- Test statistic the maximum gap observed between
adjacent blue genes - P-value the probability of observing a maximum
gap g, under the null hypothesis
44Compute probabilities by counting
All possible unlabeled permutations
The problem is how to count this
Permutations where the maximum gap g
45Adding edge effects
Hoberman, Sankoff, Durand. JCB 2005.
- I used this equation to calculate probabilities
- for various parameter values ?
46Probability of h randomly placed genes forming a
chain
n 1000 (total genes in genome)
h (size of the chain)
47Probability of a complete cluster
n 500
48Using statistics to choose parameter values
Significant Parameter Values (a 0.001)
n 500
49Max-Gap Cluster Statistics
- Reference set
- complete clusters
- complete clusters with length restriction
- incomplete clusters
- Whole genome comparison
- upper bound
- lower bound
- Hoberman, Sankoff, and Durand. Journal of
Computational Biology 2005. - Hoberman, Sankoff, and Durand. RECOMB Comparative
Genomics 2004.
50Larger clusters do not always imply
greater significance
- A max-gap cluster containing many genes may be
more likely to occur by chance than one
containing few genes
51Why max-gap and r-windows?
Chromosome 17
10 genes duplicated out of 100
29 genes
Chromosome 3
McLysaght, Hokamp, Wolfe. Nature Genetics, 2002.
- Allow gene insertions and deletions (but differ
on exactly how many) - Do not enforce arbitrary constraints on gene
order - Descriptive definitions that can be reasoned
about formally - Commonly used in genomics studies
52Greedy Algorithms Impose Order Constraints
g 2
- A greedy, agglomerative algorithm
- initializes a cluster as a single homologous pair
- searches for a gene in proximity on both
chromosomes - either extends the cluster and repeats, or
terminates
53Identification of homologous chromosomal segments
is a key task in comparative genomics
- Genome evolution
- Reconstruct history of chromosomal rearrangements
- Infer ancestral genetic map
- Phylogeny reconstruction
- Identify ancient whole genome duplications
Ancestral chromosome
Whole genome duplication
chr 1
chr 2
54Odd properties of max-gap clusters
- A larger cluster may be less significant
- Moving a gene further away may make a cluster
more likely
55r-window statistics for pairwise comparison
size m 4, length r 7
r-m
m
r-m
- Test statistic size, the number of homologous
pairs in the cluster - P-value what is the probability of two arbitrary
windows of r genes containing m genes in common,
under the null hypothesis - Durand and Sankoff, J Comp Biology, 2003
56Where are the gene clusters?
- Intuitive notions of what clusters look like
- Similar but not identical gene content
- Gene order is not perfectly preserved
- Need a more rigorous definition
57What properties???
3 genes
K. lactis Chromosome 5
S. cerevisiae Chromosome 16
8 homologous genes out of 17
- Allow gene insertions and deletions
Figure from the Yeast Genome Browser Byrne
Wolfe, Genome Res. 2005
58What properties???
- Allow gene insertions and deletions
- Do not enforce arbitrary constraints on gene
order - Formal definitions can be reasoned about
formally - Commonly used in genomics studies
Figure from McLysaght et al Nature Genetics, 2002.
59Cluster definitions in the literature
Descriptive r-windows (many references) connected components (Pevzner Tesler 03) common intervals (Uno and Tagiura 00) max-gap (many references) Constructive LineUp (Hampson et al 03) CloseUp (Hampson et al 05) FISH (Calabrese et al 03) AdHoRe (Vandepoele et al 02) Gene teams (Bergeron et al 02) greedy max-gap (Hokamp 01)
Require search algorithms
Harder to reason about formally
60Definitional Discordance
- Little consensus or even comparison
- Properties of clusters differ widely depending on
definition - Different sets of clusters are found when
analyzing the same datasets - Differences among definitions not well understood
- Hoberman and Durand, RECOMB Comparative Genomics
2005 - Durand and Hoberman, Trends in Genetics 2005/6?
61Statistical Testing Provides Additional Evidence
for Common Ancestry
- How can we verify that a gene cluster indicates
common ancestry? - True histories are rarely known
- Experimental verification is not possible
- Generative models are not used since rates and
patterns of large-scale rearrangement processes
are not well understood
62Genomic Change
Ancestral genome
CCCCCCGACACTTCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTA
AAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCCCCCCTTA
GAGGGGCGGGCCCCCCCCCGCCCCCCCCCCCCCTTAGTGAGGGAGAGGGG
CGGGCCCCCGCCCCCCGCCCCCCCCCCCCCTTAGAGGGGCGGGCCCCCCC
CCGCCCCCCCCCCCCCTTAG
CCCCCCCTGTGAAGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAA
TTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGGGGGGAA
TCTCCCCGCCCGGGGGGGGGCGGGGGGGGGGGGGAATCACTCCCTCTCCC
CGCCCGGGGGCGGGGGGCGGGGGGGGGGGGGAATCTCCCCGCCCGGGGGG
GGGCGGGGGGGGGGGGGAATC
63If n5000 and r100, with a significance level
of
P12
P13
P23
- independent pairwise tests
- P(X12 lt x12) lt a and P23 lt a
- third pair is ignored
- two pairs must share at least 6 genes
- product of two pairwise tests
- P12 P23 lt alpha
- two pairs must share at least 5 genes
- third pair is ignored
- three-way test
- P123 lt alpha
- all three pairs must share at least 4 genes
64Identification of homologous chromosomal segments
is a key task in comparative genomics
- Genome evolution
- Reconstruct history of chromosomal rearrangements
- Infer ancestral genetic map
- Phylogeny reconstruction
- Identify ancient whole genome duplications
Ancestral chromosome
Whole genome duplication
chr 1
chr 2