Evolution by duplication - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Evolution by duplication

Description:

6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 18 Nov 10, 2005 Evolution by duplication Somewhere, something went wrong – PowerPoint PPT presentation

Number of Views:318
Avg rating:3.0/5.0
Slides: 61
Provided by: Mano74
Category:

less

Transcript and Presenter's Notes

Title: Evolution by duplication


1
6.095/6.895 - Computational Biology Genomes,
Networks, Evolution
Lecture 18
Nov 10, 2005
Evolution by duplication
Somewhere, something went wrong
2
Challenges in Computational Biology
4
Genome Assembly
Gene Finding
Regulatory motif discovery
DNA
Sequence alignment
Comparative Genomics
TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT
Database lookup
Evolutionary Theory
RNA folding
Gene expression analysis
RNA transcript
Cluster discovery
10
Gibbs sampling
Protein network analysis
12
13
Regulatory network inference
Emerging network properties
14
3
Open questions (?)
  • Panda
  • Bear or raccoon?
  • Out of Africa
  • mitochondrial evolution story?
  • Human evolution
  • Did we ever meet Neanderthal?
  • Primate evolution
  • Are we chimp-like or gorilla-like?
  • Vertebrate evolution
  • How did complex body plans arise?
  • Recent evolution
  • What genes are under selection?

4
What we have learned
  • Phylogenetic trees
  • Distance-based methods
  • UPGMA, Neighbor-Joining
  • Alignment-based methods
  • Parsimony set-based, dynamic programming
  • Evolution by nucleotide mutation
  • Probability of back-mutation
  • Markov chain
  • Models of evolution
  • Jukes-Cantor
  • Kimura 2-parameter model
  • Evolution by rearrangements
  • Sorting by reversals
  • Signed / unsigned version approximation
    algorithms

5
Todays goals Evolution by Duplication
  • Detecting gene duplication
  • Orthologs and paralogs
  • Gene trees and species trees
  • Reconciliation
  • Detecting genome duplication
  • Evidence across species
  • Evidence in a single species
  • Duplicate gene evolution
  • Detect accelerated divergence
  • Measuring positive selection
  • Gene conversion

6
Determining orthologs and paralogs
7
Orthologs and paralogs
human
mouse
rat
dog
rabbit
paralogs
orthologs
  • Orthologs arise by speciation
  • typically keep same function
  • Paralogs arise by duplication
  • typically take on new functions

8
Why are orthologs paralogs important?
  • Comparative genomics relies on correct orthology
  • Signal discovery by orthologous conservation

9
Challenges in genome-wide orthology
  • Tens of thousands of genes
  • Many paralogous families precede species
    divergence
  • Single phylogeny is impossible not enough
    traits
  • Abundant duplication and loss
  • Protein family expansions
  • Gene conversion, loss, inactivation
  • Spurious matches
  • Common domains in unrelated proteins
  • Similarity not always due to common ancestry
  • Noisy data
  • Varying rates of mutation (gene species)
  • Pseudogenes, incorrect/incomplete gene models

10
Current methods for ortholog finding
  • Pair-wise sequence comparison
  • Best bi-directional BLAST hits
  • Focuses on one-to-one orthologs (no duplications)
  • Hit clustering methods
  • Detect clusters in graph of pair-wise hits
  • Difficulty to separate large connected components
  • Synteny methods
  • Detect conserved regions, stretches of nearby
    hits
  • Genome alignment methods focus on best hits
  • Phylogenetic methods
  • Phylogeny of family clusters orthologs near each
    other
  • Traditionally applied to specific families (not
    genome-wide)

11
Algorithm SynPhyl
  • Combine synteny and phylogeny to find orthologs

Initial gene family construction
Build phylogenetic trees within families
Reconcile gene trees to determine orthology
12
Building Meaningful Gene Families
13
Step 1. Initial gene family construction
  • Challenge How to keep cluster sizes balanced
  • Limitations of traditional clustering methods
  • UPGMA, k-means, graph-partitioning lead to
    imbalance
  • Bi-partitioning methods lead to arbitrary midway
    splitting

14
Step 1. Initial gene family construction
  • (1) Initial cluster seeds from unambiguous
    matches
  • Syntenic orthologs
  • Multi-species significant BBH

Initial gene clusters
15
Step 2. Cluster extension
  • (1) Initial cluster seeds from unambiguous
    matches
  • (2) Cluster extension
  • Pull unassigned genes to existing clusters
  • Ensure distance of new gene within cluster
    distribution

Unassigned genes
Initial gene clusters
16
Step 3. Phylogenetic reconstruction
  • (1) Initial cluster seeds from unambiguous
    matches
  • (2) Cluster extension
  • (3) Phylogenetic reconstruction
  • Phylogeny for each cluster
  • Align each cluster (MUSCLE protein alignment)
  • Neighbor-Joining fast, distance-based (JTT
    model)
  • Bootstrapping used for confidence measure,
    propagates
  • Use phylogeny to further separate clusters
  • Reconciliation
  • Four mammals
  • 78,744 genes
  • 17,586 trees
  • Largest 103 genes
  • Ten fungi
  • 54,890 genes
  • 5,537 trees
  • Largest 164 genes

Extended gene clusters
17
Bootstrap confidence scores
Repeat 100 times
Sample with replacement
Gene cluster
Alignment
Tree
  • Bootstrapping
  • Sample columns from the alignment randomly
  • Build trees based on these columns (NJ, ML, MP)
  • For every internal branch
  • Count how many topologies agree with inferred
    split
  • Percentage is the bootstrap confidence score
  • Building a final tree
  • Full tree, using all the data
  • Consensus tree

18
Phylogenetic Tree Reconciliation Gene Tree ?
Species Tree
19
Gene Tree / Species Tree reconciliation
Known species tree
  • G1 Each species contains each subfamily
  • Easy to infer duplication events
  • G2 Loss events in each family hide complex
    ancestry
  • Reconciliation with species tree recovers the
    events

20
Reconciliation to determine orthology
  • Reconcile each gene tree to the species tree
  • Each node in gene tree maps to node in species
    tree
  • Read off orthology and paralogy
  • Infer gene duplication and loss events

Gene tree
Species tree
d1
h1
m2
r2
dog
m1
r1
chimp
rat
human
mouse
21
Reconciliation algorithm
  • For every node g, decide duplication or
    speciation
  • Map left child to tree ? M(a). Map right child to
    tree ? M(b)
  • M(g) is least common ancestor of M(a) and M(b)
  • After mapping
  • g is a duplication node if M(g)M(a) or M(b)
  • g is a speciation node if M(g) is distinct from
    its children
  • Post-processing count loss edges

Limitation Reconciliation assumes correct
species tree Generally NOT the case
22
Mammalian tree Abundance of alternate tree
topologies
  • Most trees are incorrect
  • Count most frequent subtrees of size four
  • Correct species tree a minority lt20
  • Reason Long branch attraction
  • Due to rapidly evolving rodent lineage
  • Common phylogenetic reconstruction problem
  • What happens to reconciliation?

23
Reconciliation with erroneous trees
Gene tree
Species tree
D H M R
D H M R
  • With erroneous trees
  • Direct reconciliation leads to spurious
    duplications losses

24
Towards better reconciliation methods
Gene Tree
Species Tree
dog
m1
r1
d2
h2
d1
h1
m3
r3
human
m2
r2
rat
mouse
  • Full solution Maximize joint likelihood
  • Incorporate cost of reconciliation in tree
    building
  • Tradeoff nucleotide mutations gene
    duplication/loss
  • One solution Partitioning by Reconciliation
  • Key insight most errors are on older branches,
    irrelevant to orthology
  • Use species tree to partition gene tree
  • Allow re-rooting of each partition based on
    species tree
  • ? Apply reconciliation algorithm to each partition

25
Step 4 Partitioning by reconciliation
(1) Initial cluster seeds (2) Cluster
extension (3) Phylogenetic reconstruction (4)
Partitioning by reconciliation
Gene Clusters
Repeat 100 times
Unrooted Trees
Phylogeny
Partition
Partitioned Trees
Unrooted Trees
Rooted Trees
Reconciliation
Select root
Bootstrapping Loop
Ortholog assignments with confidence score
26
Putting it all together SynPhyl
Gene Annotations
Gene Family Clusters
Initial clustering
Genome synteny
Repeat 100 times
Unrooted Trees
Phylogeny
Partition
Partitioned Trees
Unrooted Trees
Rooted Trees
Reconciliation
Select root
Bootstrapping Loop
Ortholog and Paralog Database
Assign orthology with confidence scores
27
Benchmarks and Results
28
Results Mammalian comparisons
  • Compare human, mouse, rat, dog complete genomes
  • Coverage 75,753 genes
  • Number of groups 18,446 (of which 13,741 have
    all four species)
  • One-to-one orthologs in four species 12,359
  • Contribution of phylogenetic reconstruction
  • More one-to-one orthologs 11,619 ? 12,359
  • Large families split into small groups 17,586 ?
    18,446

groups
Count of ortholog groups by species
29
Higher resolution resolving fine-grain
correspondence
30
Higher sensitivity recognize subtle duplication
events
  • Additional duplicates found for ENSEMBL 1-to-1
    orthologs
  • Hundreds of additional duplicates detected
  • Confirmed by branch lengths and topology

31
SynPhyl comparison to direct reconciliation
SynPhyl reconciliation
Direct reconciliation
Total count of losses 18,352 11,750 Total
count of duplications 10,114 8,942
More gene trees reconcile to species tree Gene
duplications and losses dramatically decreased
32
Result Genome-wide correspondence of multiple
species
C. albicans
C. dublinensis
C. tropicalis
L. elongisporus
C. guillermondii
C. lusitaniae
33
Summary / Contributions
  • SynPhyl new tool for genome-wide orthology
  • Uses synteny, phylogeny, and known species tree
  • Automatically determines orthologs and paralogs
  • Returns ortholog assignments, trees for each
    family
  • Algorithmic highlights
  • Initial clustering constrained by synteny
  • Fine-grain correspondence uses phylogeny
  • Partition by reconciliation constrained by
    species trees
  • Advantages of the algorithm
  • Practical, fast (lt ½ day on a PC)
  • Uses information available phylogeny, synteny
  • Confidence metric bootstrap values propagate to
    orthology
  • Phylogeny ensures consistent orthologs (no
    over-collapsing)
  • Performance
  • Successfully applied to mammals, fungi
  • Fine-grain resolution phylogeny disambiguates
    large families

34
Outline
  • Detecting gene duplication
  • Orthologs and paralogs
  • Gene trees and species trees
  • Reconciliation
  • Detecting genome duplication
  • Evidence across species
  • Evidence in a single species
  • Duplicate gene evolution
  • Detect accelerated divergence
  • Measuring positive selection
  • Gene conversion

35
Genome Duplication
36
A range of evolutionary distances
5 Myr
S.cerevisiae
20 Myr
S.paradoxus
S.mikatae
S.bayanus
Ability to ask different set of questions
37
Gene correspondence
12Mb
XVI
XV
XIV
XIII
XII
XI
X
S.cerevisiae chromosomes
IX
VIII
VII
VI
V
IV
III
II
I
10.5Mb
K.waltii scaffolds
38
Gene correspondence
XVI
XV
XIV
XIII
XII
XI
X
S.cerevisiae chromosomes
IX
VIII
VII
VI
V
IV
III
II
I
K.waltii scaffolds
39
Signatures of evolutionary events
Gene interleaving is evidence of complete
duplication
40
Duplicate mapping tiles K. waltii
Chr 1
S. cer.
Chr 2
Chr 3
Chr 4
K. waltii chromosomes
Chr 5
Chr 6
Chr 7
Chr 8
41
Duplicate mapping of centromeres
Recognize sister regions solely based on gene
order
42
Conclusion Whole Genome Duplication has happened
43
Whole Genome Duplications are everywhere!
  • Yeast Duplication
  • Most genes 1-to-1 mapping
  • Gene interleaving evidence of duplication
  • Complete tiling of the genome
  • Vertebrate Duplication in Fish
  • Fish Gene order not conserved, only
    chromosomes
  • Mammals Gene order conserved, not chromosomes
  • Two rounds of WGD in base of vertebrate lineage
  • Build clusters of related genes (use Ciona as
    outgroup)
  • Count duplications by reconciliation
  • Find regions of duplicate overlap ? 4-way
    synteny

44
Genome duplication evidence in a single species
45
Evidence of duplication using a single genome?
Scer chr 4
Scer chr 12
Wolfe 97
  • Genomic evidence
  • Conserved order of paralogous genes
  • Same transcriptional orientation
  • However
  • Interspersed with single-copy genes

Interpretation Genome duplication followed by
gene loss
46
Whole genome duplication is controversial
  • There was a whole-genome duplication. Wolfe,
    Nature 97
  • There was no whole-genome duplication. Dujon,
    FEBS 2000
  • At least some chrom dup. occurred independently
    Langkjaer, JMB, 2000
  • Dynamic equilibrium of duplications and loss
    Llorente, FEBS, 2000
  • Recent evidence supports single event. Wong,
    PNAS 02
  • Continuous block duplications and deletions
    Dujon, Yeast 2003
  • Dup. precedes divergence from Kluyveromyces.
    Piskur, Nature, 2003
  • Telomere-mediated duplication events Coissac,
    Mol Bio Evo 1997
  • Multiple closely spaced events Friedman, Genome
    Res, 2003
  • Spontaneous duplication of large chromosomal
    segments Koszul, EMBO 04
  • Insufficient evidence
  • Only 50 of genome in duplicate regions
  • Only 8 of genes present in two copies
  • Extensive redundancy outside duplicate regions
  • Evidence against WGD
  • Divergence-based dating show multiple times
  • Other species have similar level of redundancy
  • Alternative evolutionary scenario proposed
  • Independent segmental duplications
  • Also consistent with the evidence

Evidence remains inconclusive
47
Conclusion Whole Genome Duplication has happened
48
Outline
  • Detecting gene duplication
  • Orthologs and paralogs
  • Gene trees and species trees
  • Reconciliation
  • Detecting genome duplication
  • Evidence across species
  • Evidence in a single species
  • Duplicate gene evolution
  • Detect accelerated divergence
  • Measuring positive selection
  • Gene conversion

49
Post-duplication evolution
50
Whole-genome duplication results in 500 new genes
Number of genes
5,000
time
Today
100Myrs
Evidence of accelerated gene evolution
51
Fate of duplicated genes
  • 457 genes kept in two copies, result of selection
  • Involved in sugar metabolism and fermentation

WGD
S. cerevisiae copy 1
S. cerevisiae copy 2
K. waltii
Evidence of accelerated protein divergence ?
52
Measuring accelerated divergence
TTT(FPhe)
2
1
? Two shortest paths possible
TTA(LLeu)
GTT(VVal)
GTA(VVal)
  • Protein divergence
  • Count amino-acid changes
  • Use BLOSUM substitution matrix
  • Nucleotide divergence
  • Count nucleotide substitutions
  • Correct for back-mutations
  • Use transition/transversion evolutionary model
  • dN / dS
  • Two types of nucleotide substitutions
  • S synonymous Preserve amino-acid translation
  • N non-synonymous Change amino-acid
  • Count synonymous / non-synonymous sites
  • Depends on path taken between two codons

53
Scenarios for rapid gene evolution
One copy faster
Scer - copy1
Scer - copy2
Kwal
Ohno, 1970
Both copies faster
Scer - copy1
Scer - copy2
Lynch, 2000
Kwal
20 of duplicated genes show acceleration
20 of duplicated genes show acceleration 95 of
cases Only one copy faster
54
Emerging gene functions after duplication
  • Origin of replication ? silencing

4-fold acceleration
Scer - Sir3 (silencing)
Scer - Orc1 (origin of replication)
Kwal - Orc1
  • Translation initiation ? anti-viral defense

3-fold acceleration
Scer - Ski7 (anti-viral defense)
Scer - Hbs1 (translation initiation)
Kwal - Hbs1
Asymmetric divergence ? recognize ancestral /
derived
55
Distinct functional properties
Ancestral function Derived function
Gene deletion Lethal (20) Never lethal
Gain new function and lose ancestral function
56
Distinct functional properties
Ancestral function Derived function
Gene deletion Lethal (20) Never lethal
Expression Abundant Specific (stress, starvation)
Localization General Specific (mitochondrion, spores)
Gain new function and lose ancestral function
57
Gene conversion
58
Decelerated evolution
Scer copy1
Kwal
Scer copy2
  • 60 gene pairs (13 of 457 pairs)
  • 98 protein identity (all pairs 55)
  • 90 identity in 4fold degenerate sites (all
    pairs 41)
  • Not recent duplication
  • Gene order argues ancestral WGD pairs

Gene conversion?
59
Evidence of gene conversion
YBL072C S. cerevisiae
YER102W S. cerevisiae
K. waltii
  • Tree root reveals time of duplication
  • No acceleration in the K. waltii branch
  • The two genes have recently replaced each other
  • Branching order reveals gene conversion
  • Paralogs are closer to each other than to their
    ortholog
  • Both S. cerevisiae and S. bayanus show gene
    conversion

Periodic gene conversion
60
Summary
  • Detecting gene duplication
  • Orthologs and paralogs
  • Gene trees and species trees
  • Reconciliation
  • Detecting genome duplication
  • Evidence across species
  • Evidence in a single species
  • Duplicate gene evolution
  • Detect accelerated divergence
  • Measuring positive selection
  • Gene conversion
Write a Comment
User Comments (0)
About PowerShow.com