Terminology - PowerPoint PPT Presentation

About This Presentation
Title:

Terminology

Description:

Slide 1 – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 65
Provided by: Rose257
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Terminology


1
(No Transcript)
2
Terminology
  • Homologous related through common ancestry
  • Orthologous related through speciation
  • Paralogous related through duplication

Species 1
8
12
4
5
3
7
1
2
9
11
13
10
14
15
3
orthologs
5
6
2
2
20
2
3
1
4
3
1
4
3
1
Species 2
paralogs
3
Identification of Orthologous Genes
  • The identification of orthologous genes is a
    prerequisite for a marker-based approach
  • Orthology identification
  • is often difficult to determine from gene
    sequence alone
  • is an important unsolved research problem
  • can be improved by incorporating genomic context

4
An example Which gene is the true ortholog?
Most similar Least similar
Species 2
1st of 4
2nd of 4
3rd of 4
1st of 1
1st of 1
1st of 1
1st of 1
1st of 1
4th of 4
Query Gene
Species 1
5
  • Problem for more diverged genomes, unambiguous
    orthologs will be sparse and
    clusters will be more rearranged
  • Solution Identify orthologs and gene clusters
    simultaneously

Identify homologous genes
Find gene clusters
Similar genomic context
6
Two approaches
  • Minimize rearrangements
  • Maximize conserved structure

7
  • Work that combines sequence similarity and
    genomic context
  • Bansal, Bioinformatics 99
  • Kellis et al, J Comp Biol 04
  • Bourque et al, RECOMB Comp Genomics 05
  • Chen et al, ACM/IEEE Trans Comput Biol and Bioinf
    05
  • Limitations
  • No flexible cluster definitions
  • No statistical approaches
  • Little real evaluation

8
Why we need more flexible cluster defs and thus
statistics
  • Give an example from yeast where longest
    subsequence or Blins boxes fail?
  • Also they have to use ad hoc filters?

9
Solution
  • Use max-gap clusters
  • Show how it works on the example.

10
  • However, which cluster is more conserved?
  • Show two clusters, one larger but more gaps, the
    other smaller and denser?
  • Need a way to rank them.
  • Use p-value as measure of degree of conservation.

11
  • Discuss algorithmic challenges? Why monotonicity
    helps us here?

12
  • Evaluation in progress
  • Say what data set Im using?

13
(No Transcript)
14
(No Transcript)
15
Basic Genome Model
  • a sequence of unique genes
  • distance between genes is equal to the number of
    intervening genes
  • gene orientation unknown
  • a single, linear chromosome

16
Inputs
  1. Two genomes (i.e, ordered lists of genes)
  2. A mapping of corresponding genes

17
Whole Genome Comparison m n
Two genomes of n genes with with m homologous
genes pairs
g?? 3
g?? 3
  • What is the probability of observing a
    maximal max-gap cluster of size exactly h, if the
    genes in both genomes are randomly ordered?
  • A cluster is maximal if it is not a subset of
    a larger cluster

18
(No Transcript)
19
Processes of genomic change
  • Small-scale point mutations
  • Change gene sequences
  • Large-scale genomic rearrangements
  • Change gene content and order

20
Processes of genomic change
CCCCCCGACACTTCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTA
AAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCCCCCCTTA
G
CCCCCCCTGTGAAGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAA
TTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGGGGGGAA
TC
Noncoding DNA
Genes
Regulatory regions
  • Small-scale point mutations
  • Change gene sequences
  • Large-scale genomic rearrangements
  • Change gene content and order

21
Noncoding DNA
CCCCCCGACACTTCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTA
AAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCCCCCCCCC
C
CCCCCCCTGTGAAGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAA
TTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGGGGGGGG
GG
Regulatory regions
Genes
22
Building Phylogenetic Trees
Genes may be laterally transferred between
distantly related species
AAACATTTT E. coli
GTCGGTTGG E. coli
AAACATTTA Salmonella
AAACGTTTC Chlamydia
GTCGGTTGC Thermococcus
GTCAGTTGC Methanococcus
  • Trees are often constructed based on a single
    gene
  • species with the fewest differences between their
    gene sequences are grouped together in the tree
  • The history of a gene may not indicate the
    history of the species
  • Construct trees based on evidence
  • from the whole genome

23
Whole-genome phylogenies based on spatial
organization
  1. Find gene clusters
  2. Determine the minimum number of rearrangements
    between genome pairs
  3. Use rearrangement distances to build phylogenies
  4. Effects of cluster choice?

Guillaume Bourque et al. Genome Res. 2004 14
507-516
24
Temporal coherence
before
time
now
  • Divergence times of homologous pairs within a
    block should agree

25
Genomic Change
Ancestral chromosome
Large-scale duplication
chromosome 2
chromosome 1
Sequence Mutation Chromosomal Rearrangements
26
Groups find very different clusters when
analyzing the same data
27
Identifying gene clusters
  1. Formally define a gene cluster
  2. Devise an algorithm to identify clusters
  3. Verify that clusters indicate common ancestry

...modeling
...algorithms
...statistics
28
Order and Orientation
density 6/8
density 6/8
  • Local rearrangements will cause both gene order
    and orientation to diverge
  • Overly stringent order constraints could lead to
    false negatives
  • Partial conservation of order and orientation
    provide additional evidence of regional homology

29
Symmetry
A
B
A
B
?
clusters found
clusters found
  • Many existing cluster algorithms are not
    symmetric with respect to chromosome

30
Cluster definitions in the literature
Descriptive r-windows (many references) connected components (Pevzner Tesler 03) common intervals (Uno and Tagiura 00) max-gap (many references) Constructive LineUp (Hampson et al 03) CloseUp (Hampson et al 05) FISH (Calabrese et al 03) AdHoRe (Vandepoele et al 02) Gene teams (Bergeron et al 02) greedy max-gap (Hokamp 01)
Require search algorithms
Harder to reason about formally
31
size 5, length 12
density 5/12
gap 3 genes
  • Cluster Parameters
  • size number of homologous pairs in the cluster
  • length total number of genes in the cluster
  • density proportion of homologous pairs
    (size/length)

32
gap?? g
gap?? g
gap?? g
max-gap clusters
  • cluster grows to its natural size
  • cluster of size m may be of length m to g(m -1)
    m
  • maximal length grows as size grows

length?? r
r-windows
  • cluster size is constrained
  • cluster of size m may be of length m to r
  • maximal length is fixed, regardless of cluster
    size

33
A tradeoff local vs global density
  • max-gap
  • constrains local density
  • only weakly constrains global density ( 1/(g1))
  • r-window
  • constrains global density
  • only weakly constrains local density (maximum
    possible gap is r - size)

34
Even when global density is high,

Density 12/18
a region may not be locally dense
35
Formalizing these intuitive notions
Chromosome 17
10 genes duplicated out of 100
29 genes
Chromosome 3
McLysaght, Hokamp, Wolfe. Nature Genetics, 2002.
  • Similar but not identical gene content
  • density constraints
  • Gene order is not perfectly preserved
  • order violation constraints

The degree of constraint varies widely among
definitions
36
The Max-Gap Definition is the Most Widely Used in
Genomic Analyses
Blanc et al 2003, recent polyploidy in Arabidopsis Venter et al 2001, sequence of the human genome Overbeek et al 1999, inferring functional coupling of genes in bacteria Vandepoele et al 2002, duplications in Arabidopsis through comparison with rice Vision et al 2000, duplications in Eukaryotes Lawrence and Roth 1996, identification of horizontal transfers Tamames 2001, evolution of gene order conservation in prokaryotes Wolfe and Shields 1997, ancient yeast duplication McLysaght 2002, genomic duplication during early chordate evolution Coghlan and Wolfe 2002, comparing rates of rearrangements Seoighe and Wolfe 1998, genome rearrangements after duplication in yeast Chen et al 2004, operon prediction in newly sequenced bacteria Blanchette et al 1999, breakpoints as phylogenetic features ...
37
r-window statistics for a two-way cluster
size m 4, length r 7
r-m
r-m
m
  • Given length, total number of genes in each
    window
  • Test statistic size, the number of homologous
    pairs in the cluster
  • P-value the probability of two arbitrary windows
    of r genes containing m genes in common, under
    the null hypothesis
  • Durand and Sankoff, J Comp Biology, 2002

38
The max-gap definition is the most widely used
cluster definition in genomic analyses
  • Allows extensive rearrangement of gene order
  • Allows limited gene insertion and losses
  • Allows the cluster to grow to its natural size

There is no formal statistical model for max-gap
clusters
39
Whole Genome Comparison of Human with Human
McLysaght, Hokamp, Wolfe. Nature Genetics, 2002.
Could this pattern have occurred by chance?
40
McLysaght, Hokamp, Wolfe. Nature Genetics, 2002.
Chromosome 17
Clusters with similarity to human chromosome 17
  1. Are larger clusters more likely to occur by
    chance?
  2. Are there other duplicated segments that their
    method did not detect?

41
Max-Gap Cluster Statistics
  • Reference set
  • complete clusters
  • complete clusters with length restriction
  • incomplete clusters
  • Whole genome comparison
  • upper bound
  • lower bound
  • Hoberman, Sankoff, and Durand. Journal of
    Computational Biology 2005.
  • Hoberman, Sankoff, and Durand. RECOMB Comparative
    Genomics 2004.

42
Reference set, complete clusters
Given a genome G 1, , n unique genes
a set of m genes of interest (in
blue)
m 5
  • Do all m blue genes form a significant max-gap
    cluster?

43
Reference set, complete clusters
g 2
m 5
  • Test statistic the maximum gap observed between
    adjacent blue genes
  • P-value the probability of observing a maximum
    gap g, under the null hypothesis

44
Compute probabilities by counting
All possible unlabeled permutations
The problem is how to count this
Permutations where the maximum gap g
45
Adding edge effects
Hoberman, Sankoff, Durand. JCB 2005.
  • I used this equation to calculate probabilities
  • for various parameter values ?

46
Probability of h randomly placed genes forming a
chain
n 1000 (total genes in genome)
h (size of the chain)
47
Probability of a complete cluster
n 500
48
Using statistics to choose parameter values
Significant Parameter Values (a 0.001)
n 500
49
Max-Gap Cluster Statistics
  • Reference set
  • complete clusters
  • complete clusters with length restriction
  • incomplete clusters
  • Whole genome comparison
  • upper bound
  • lower bound
  • Hoberman, Sankoff, and Durand. Journal of
    Computational Biology 2005.
  • Hoberman, Sankoff, and Durand. RECOMB Comparative
    Genomics 2004.

50
Larger clusters do not always imply
greater significance
  • A max-gap cluster containing many genes may be
    more likely to occur by chance than one
    containing few genes

51
Why max-gap and r-windows?
Chromosome 17
10 genes duplicated out of 100
29 genes
Chromosome 3
McLysaght, Hokamp, Wolfe. Nature Genetics, 2002.
  • Allow gene insertions and deletions (but differ
    on exactly how many)
  • Do not enforce arbitrary constraints on gene
    order
  • Descriptive definitions that can be reasoned
    about formally
  • Commonly used in genomics studies

52
Greedy Algorithms Impose Order Constraints
g 2
  • A greedy, agglomerative algorithm
  • initializes a cluster as a single homologous pair
  • searches for a gene in proximity on both
    chromosomes
  • either extends the cluster and repeats, or
    terminates

53
Identification of homologous chromosomal segments
is a key task in comparative genomics
  • Genome evolution
  • Reconstruct history of chromosomal rearrangements
  • Infer ancestral genetic map
  • Phylogeny reconstruction
  • Identify ancient whole genome duplications

Ancestral chromosome
Whole genome duplication
chr 1
chr 2
54
Odd properties of max-gap clusters
  1. A larger cluster may be less significant
  1. Moving a gene further away may make a cluster
    more likely

55
r-window statistics for pairwise comparison
size m 4, length r 7
r-m
m
r-m
  • Test statistic size, the number of homologous
    pairs in the cluster
  • P-value what is the probability of two arbitrary
    windows of r genes containing m genes in common,
    under the null hypothesis
  • Durand and Sankoff, J Comp Biology, 2003

56
Where are the gene clusters?
  • Intuitive notions of what clusters look like
  • Similar but not identical gene content
  • Gene order is not perfectly preserved
  • Need a more rigorous definition

57
What properties???
3 genes
K. lactis Chromosome 5
S. cerevisiae Chromosome 16
8 homologous genes out of 17
  1. Allow gene insertions and deletions

Figure from the Yeast Genome Browser Byrne
Wolfe, Genome Res. 2005
58
What properties???
  • Allow gene insertions and deletions
  • Do not enforce arbitrary constraints on gene
    order
  • Formal definitions can be reasoned about
    formally
  • Commonly used in genomics studies

Figure from McLysaght et al Nature Genetics, 2002.
59
Cluster definitions in the literature
Descriptive r-windows (many references) connected components (Pevzner Tesler 03) common intervals (Uno and Tagiura 00) max-gap (many references) Constructive LineUp (Hampson et al 03) CloseUp (Hampson et al 05) FISH (Calabrese et al 03) AdHoRe (Vandepoele et al 02) Gene teams (Bergeron et al 02) greedy max-gap (Hokamp 01)
Require search algorithms
Harder to reason about formally
60
Definitional Discordance
  • Little consensus or even comparison
  • Properties of clusters differ widely depending on
    definition
  • Different sets of clusters are found when
    analyzing the same datasets
  • Differences among definitions not well understood
  • Hoberman and Durand, RECOMB Comparative Genomics
    2005
  • Durand and Hoberman, Trends in Genetics 2005/6?

61
Statistical Testing Provides Additional Evidence
for Common Ancestry
  • How can we verify that a gene cluster indicates
    common ancestry?
  • True histories are rarely known
  • Experimental verification is not possible
  • Generative models are not used since rates and
    patterns of large-scale rearrangement processes
    are not well understood

62
Genomic Change
Ancestral genome
CCCCCCGACACTTCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTA
AAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCCCCCCTTA
GAGGGGCGGGCCCCCCCCCGCCCCCCCCCCCCCTTAGTGAGGGAGAGGGG
CGGGCCCCCGCCCCCCGCCCCCCCCCCCCCTTAGAGGGGCGGGCCCCCCC
CCGCCCCCCCCCCCCCTTAG
CCCCCCCTGTGAAGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAA
TTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGGGGGGAA
TCTCCCCGCCCGGGGGGGGGCGGGGGGGGGGGGGAATCACTCCCTCTCCC
CGCCCGGGGGCGGGGGGCGGGGGGGGGGGGGAATCTCCCCGCCCGGGGGG
GGGCGGGGGGGGGGGGGAATC
63
If n5000 and r100, with a significance level
of
P12
P13
P23
  • independent pairwise tests
  • P(X12 lt x12) lt a and P23 lt a
  • third pair is ignored
  • two pairs must share at least 6 genes
  • product of two pairwise tests
  • P12 P23 lt alpha
  • two pairs must share at least 5 genes
  • third pair is ignored
  • three-way test
  • P123 lt alpha
  • all three pairs must share at least 4 genes

64
Identification of homologous chromosomal segments
is a key task in comparative genomics
  • Genome evolution
  • Reconstruct history of chromosomal rearrangements
  • Infer ancestral genetic map
  • Phylogeny reconstruction
  • Identify ancient whole genome duplications

Ancestral chromosome
Whole genome duplication
chr 1
chr 2
Write a Comment
User Comments (0)
About PowerShow.com