Title: Evolution by duplication
16.095/6.895 - Computational Biology Genomes,
Networks, Evolution
Lecture 18
Nov 10, 2005
Evolution by duplication
Somewhere, something went wrong
2Challenges in Computational Biology
4
Genome Assembly
Gene Finding
Regulatory motif discovery
DNA
Sequence alignment
Comparative Genomics
TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT
Database lookup
Evolutionary Theory
RNA folding
Gene expression analysis
RNA transcript
Cluster discovery
10
Gibbs sampling
Protein network analysis
12
13
Regulatory network inference
Emerging network properties
14
3Open questions (?)
- Panda
- Bear or raccoon?
- Out of Africa
- mitochondrial evolution story?
- Human evolution
- Did we ever meet Neanderthal?
- Primate evolution
- Are we chimp-like or gorilla-like?
- Vertebrate evolution
- How did complex body plans arise?
- Recent evolution
- What genes are under selection?
4What we have learned
- Phylogenetic trees
- Distance-based methods
- UPGMA, Neighbor-Joining
- Alignment-based methods
- Parsimony set-based, dynamic programming
- Evolution by nucleotide mutation
- Probability of back-mutation
- Markov chain
- Models of evolution
- Jukes-Cantor
- Kimura 2-parameter model
- Evolution by rearrangements
- Sorting by reversals
- Signed / unsigned version approximation
algorithms
5Todays goals Evolution by Duplication
- Detecting gene duplication
- Orthologs and paralogs
- Gene trees and species trees
- Reconciliation
- Detecting genome duplication
- Evidence across species
- Evidence in a single species
- Duplicate gene evolution
- Detect accelerated divergence
- Measuring positive selection
- Gene conversion
6Determining orthologs and paralogs
7Orthologs and paralogs
human
mouse
rat
dog
rabbit
paralogs
orthologs
- Orthologs arise by speciation
- typically keep same function
- Paralogs arise by duplication
- typically take on new functions
8Why are orthologs paralogs important?
- Comparative genomics relies on correct orthology
- Signal discovery by orthologous conservation
9Challenges in genome-wide orthology
- Tens of thousands of genes
- Many paralogous families precede species
divergence - Single phylogeny is impossible not enough
traits - Abundant duplication and loss
- Protein family expansions
- Gene conversion, loss, inactivation
- Spurious matches
- Common domains in unrelated proteins
- Similarity not always due to common ancestry
- Noisy data
- Varying rates of mutation (gene species)
- Pseudogenes, incorrect/incomplete gene models
10Current methods for ortholog finding
- Pair-wise sequence comparison
- Best bi-directional BLAST hits
- Focuses on one-to-one orthologs (no duplications)
- Hit clustering methods
- Detect clusters in graph of pair-wise hits
- Difficulty to separate large connected components
- Synteny methods
- Detect conserved regions, stretches of nearby
hits - Genome alignment methods focus on best hits
- Phylogenetic methods
- Phylogeny of family clusters orthologs near each
other - Traditionally applied to specific families (not
genome-wide)
11Algorithm SynPhyl
- Combine synteny and phylogeny to find orthologs
Initial gene family construction
Build phylogenetic trees within families
Reconcile gene trees to determine orthology
12Building Meaningful Gene Families
13Step 1. Initial gene family construction
- Challenge How to keep cluster sizes balanced
- Limitations of traditional clustering methods
- UPGMA, k-means, graph-partitioning lead to
imbalance - Bi-partitioning methods lead to arbitrary midway
splitting
14Step 1. Initial gene family construction
- (1) Initial cluster seeds from unambiguous
matches - Syntenic orthologs
- Multi-species significant BBH
Initial gene clusters
15Step 2. Cluster extension
- (1) Initial cluster seeds from unambiguous
matches - (2) Cluster extension
- Pull unassigned genes to existing clusters
- Ensure distance of new gene within cluster
distribution
Unassigned genes
Initial gene clusters
16Step 3. Phylogenetic reconstruction
- (1) Initial cluster seeds from unambiguous
matches - (2) Cluster extension
- (3) Phylogenetic reconstruction
- Phylogeny for each cluster
- Align each cluster (MUSCLE protein alignment)
- Neighbor-Joining fast, distance-based (JTT
model) - Bootstrapping used for confidence measure,
propagates - Use phylogeny to further separate clusters
- Reconciliation
- Four mammals
- 78,744 genes
- 17,586 trees
- Largest 103 genes
- Ten fungi
- 54,890 genes
- 5,537 trees
- Largest 164 genes
Extended gene clusters
17Bootstrap confidence scores
Repeat 100 times
Sample with replacement
Gene cluster
Alignment
Tree
- Bootstrapping
- Sample columns from the alignment randomly
- Build trees based on these columns (NJ, ML, MP)
- For every internal branch
- Count how many topologies agree with inferred
split - Percentage is the bootstrap confidence score
- Building a final tree
- Full tree, using all the data
- Consensus tree
18Phylogenetic Tree Reconciliation Gene Tree ?
Species Tree
19Gene Tree / Species Tree reconciliation
Known species tree
- G1 Each species contains each subfamily
- Easy to infer duplication events
- G2 Loss events in each family hide complex
ancestry - Reconciliation with species tree recovers the
events
20Reconciliation to determine orthology
- Reconcile each gene tree to the species tree
- Each node in gene tree maps to node in species
tree - Read off orthology and paralogy
- Infer gene duplication and loss events
Gene tree
Species tree
d1
h1
m2
r2
dog
m1
r1
chimp
rat
human
mouse
21Reconciliation algorithm
- For every node g, decide duplication or
speciation - Map left child to tree ? M(a). Map right child to
tree ? M(b) - M(g) is least common ancestor of M(a) and M(b)
- After mapping
- g is a duplication node if M(g)M(a) or M(b)
- g is a speciation node if M(g) is distinct from
its children - Post-processing count loss edges
Limitation Reconciliation assumes correct
species tree Generally NOT the case
22Mammalian tree Abundance of alternate tree
topologies
- Most trees are incorrect
- Count most frequent subtrees of size four
- Correct species tree a minority lt20
- Reason Long branch attraction
- Due to rapidly evolving rodent lineage
- Common phylogenetic reconstruction problem
- What happens to reconciliation?
23Reconciliation with erroneous trees
Gene tree
Species tree
D H M R
D H M R
- With erroneous trees
- Direct reconciliation leads to spurious
duplications losses
24Towards better reconciliation methods
Gene Tree
Species Tree
dog
m1
r1
d2
h2
d1
h1
m3
r3
human
m2
r2
rat
mouse
- Full solution Maximize joint likelihood
- Incorporate cost of reconciliation in tree
building - Tradeoff nucleotide mutations gene
duplication/loss - One solution Partitioning by Reconciliation
- Key insight most errors are on older branches,
irrelevant to orthology - Use species tree to partition gene tree
- Allow re-rooting of each partition based on
species tree - ? Apply reconciliation algorithm to each partition
25Step 4 Partitioning by reconciliation
(1) Initial cluster seeds (2) Cluster
extension (3) Phylogenetic reconstruction (4)
Partitioning by reconciliation
Gene Clusters
Repeat 100 times
Unrooted Trees
Phylogeny
Partition
Partitioned Trees
Unrooted Trees
Rooted Trees
Reconciliation
Select root
Bootstrapping Loop
Ortholog assignments with confidence score
26Putting it all together SynPhyl
Gene Annotations
Gene Family Clusters
Initial clustering
Genome synteny
Repeat 100 times
Unrooted Trees
Phylogeny
Partition
Partitioned Trees
Unrooted Trees
Rooted Trees
Reconciliation
Select root
Bootstrapping Loop
Ortholog and Paralog Database
Assign orthology with confidence scores
27Benchmarks and Results
28Results Mammalian comparisons
- Compare human, mouse, rat, dog complete genomes
- Coverage 75,753 genes
- Number of groups 18,446 (of which 13,741 have
all four species) - One-to-one orthologs in four species 12,359
- Contribution of phylogenetic reconstruction
- More one-to-one orthologs 11,619 ? 12,359
- Large families split into small groups 17,586 ?
18,446
groups
Count of ortholog groups by species
29Higher resolution resolving fine-grain
correspondence
30Higher sensitivity recognize subtle duplication
events
- Additional duplicates found for ENSEMBL 1-to-1
orthologs - Hundreds of additional duplicates detected
- Confirmed by branch lengths and topology
31SynPhyl comparison to direct reconciliation
SynPhyl reconciliation
Direct reconciliation
Total count of losses 18,352 11,750 Total
count of duplications 10,114 8,942
More gene trees reconcile to species tree Gene
duplications and losses dramatically decreased
32Result Genome-wide correspondence of multiple
species
C. albicans
C. dublinensis
C. tropicalis
L. elongisporus
C. guillermondii
C. lusitaniae
33Summary / Contributions
- SynPhyl new tool for genome-wide orthology
- Uses synteny, phylogeny, and known species tree
- Automatically determines orthologs and paralogs
- Returns ortholog assignments, trees for each
family - Algorithmic highlights
- Initial clustering constrained by synteny
- Fine-grain correspondence uses phylogeny
- Partition by reconciliation constrained by
species trees - Advantages of the algorithm
- Practical, fast (lt ½ day on a PC)
- Uses information available phylogeny, synteny
- Confidence metric bootstrap values propagate to
orthology - Phylogeny ensures consistent orthologs (no
over-collapsing) - Performance
- Successfully applied to mammals, fungi
- Fine-grain resolution phylogeny disambiguates
large families
34Outline
- Detecting gene duplication
- Orthologs and paralogs
- Gene trees and species trees
- Reconciliation
- Detecting genome duplication
- Evidence across species
- Evidence in a single species
- Duplicate gene evolution
- Detect accelerated divergence
- Measuring positive selection
- Gene conversion
35Genome Duplication
36A range of evolutionary distances
5 Myr
S.cerevisiae
20 Myr
S.paradoxus
S.mikatae
S.bayanus
Ability to ask different set of questions
37Gene correspondence
12Mb
XVI
XV
XIV
XIII
XII
XI
X
S.cerevisiae chromosomes
IX
VIII
VII
VI
V
IV
III
II
I
10.5Mb
K.waltii scaffolds
38Gene correspondence
XVI
XV
XIV
XIII
XII
XI
X
S.cerevisiae chromosomes
IX
VIII
VII
VI
V
IV
III
II
I
K.waltii scaffolds
39Signatures of evolutionary events
Gene interleaving is evidence of complete
duplication
40 Duplicate mapping tiles K. waltii
Chr 1
S. cer.
Chr 2
Chr 3
Chr 4
K. waltii chromosomes
Chr 5
Chr 6
Chr 7
Chr 8
41 Duplicate mapping of centromeres
Recognize sister regions solely based on gene
order
42Conclusion Whole Genome Duplication has happened
43Whole Genome Duplications are everywhere!
- Yeast Duplication
- Most genes 1-to-1 mapping
- Gene interleaving evidence of duplication
- Complete tiling of the genome
- Vertebrate Duplication in Fish
- Fish Gene order not conserved, only
chromosomes - Mammals Gene order conserved, not chromosomes
- Two rounds of WGD in base of vertebrate lineage
- Build clusters of related genes (use Ciona as
outgroup) - Count duplications by reconciliation
- Find regions of duplicate overlap ? 4-way
synteny
44Genome duplication evidence in a single species
45Evidence of duplication using a single genome?
Scer chr 4
Scer chr 12
Wolfe 97
- Genomic evidence
- Conserved order of paralogous genes
- Same transcriptional orientation
- However
- Interspersed with single-copy genes
Interpretation Genome duplication followed by
gene loss
46Whole genome duplication is controversial
- There was a whole-genome duplication. Wolfe,
Nature 97 - There was no whole-genome duplication. Dujon,
FEBS 2000 - At least some chrom dup. occurred independently
Langkjaer, JMB, 2000 - Dynamic equilibrium of duplications and loss
Llorente, FEBS, 2000 - Recent evidence supports single event. Wong,
PNAS 02 - Continuous block duplications and deletions
Dujon, Yeast 2003 - Dup. precedes divergence from Kluyveromyces.
Piskur, Nature, 2003 - Telomere-mediated duplication events Coissac,
Mol Bio Evo 1997 - Multiple closely spaced events Friedman, Genome
Res, 2003 - Spontaneous duplication of large chromosomal
segments Koszul, EMBO 04
- Insufficient evidence
- Only 50 of genome in duplicate regions
- Only 8 of genes present in two copies
- Extensive redundancy outside duplicate regions
- Evidence against WGD
- Divergence-based dating show multiple times
- Other species have similar level of redundancy
- Alternative evolutionary scenario proposed
- Independent segmental duplications
- Also consistent with the evidence
Evidence remains inconclusive
47Conclusion Whole Genome Duplication has happened
48Outline
- Detecting gene duplication
- Orthologs and paralogs
- Gene trees and species trees
- Reconciliation
- Detecting genome duplication
- Evidence across species
- Evidence in a single species
- Duplicate gene evolution
- Detect accelerated divergence
- Measuring positive selection
- Gene conversion
49Post-duplication evolution
50Whole-genome duplication results in 500 new genes
Number of genes
5,000
time
Today
100Myrs
Evidence of accelerated gene evolution
51Fate of duplicated genes
- 457 genes kept in two copies, result of selection
- Involved in sugar metabolism and fermentation
WGD
S. cerevisiae copy 1
S. cerevisiae copy 2
K. waltii
Evidence of accelerated protein divergence ?
52Measuring accelerated divergence
TTT(FPhe)
2
1
? Two shortest paths possible
TTA(LLeu)
GTT(VVal)
GTA(VVal)
- Protein divergence
- Count amino-acid changes
- Use BLOSUM substitution matrix
- Nucleotide divergence
- Count nucleotide substitutions
- Correct for back-mutations
- Use transition/transversion evolutionary model
- dN / dS
- Two types of nucleotide substitutions
- S synonymous Preserve amino-acid translation
- N non-synonymous Change amino-acid
- Count synonymous / non-synonymous sites
- Depends on path taken between two codons
53Scenarios for rapid gene evolution
One copy faster
Scer - copy1
Scer - copy2
Kwal
Ohno, 1970
Both copies faster
Scer - copy1
Scer - copy2
Lynch, 2000
Kwal
20 of duplicated genes show acceleration
20 of duplicated genes show acceleration 95 of
cases Only one copy faster
54Emerging gene functions after duplication
- Origin of replication ? silencing
4-fold acceleration
Scer - Sir3 (silencing)
Scer - Orc1 (origin of replication)
Kwal - Orc1
- Translation initiation ? anti-viral defense
3-fold acceleration
Scer - Ski7 (anti-viral defense)
Scer - Hbs1 (translation initiation)
Kwal - Hbs1
Asymmetric divergence ? recognize ancestral /
derived
55Distinct functional properties
Ancestral function Derived function
Gene deletion Lethal (20) Never lethal
Gain new function and lose ancestral function
56Distinct functional properties
Ancestral function Derived function
Gene deletion Lethal (20) Never lethal
Expression Abundant Specific (stress, starvation)
Localization General Specific (mitochondrion, spores)
Gain new function and lose ancestral function
57Gene conversion
58Decelerated evolution
Scer copy1
Kwal
Scer copy2
- 60 gene pairs (13 of 457 pairs)
- 98 protein identity (all pairs 55)
- 90 identity in 4fold degenerate sites (all
pairs 41) - Not recent duplication
- Gene order argues ancestral WGD pairs
Gene conversion?
59Evidence of gene conversion
YBL072C S. cerevisiae
YER102W S. cerevisiae
K. waltii
- Tree root reveals time of duplication
- No acceleration in the K. waltii branch
- The two genes have recently replaced each other
- Branching order reveals gene conversion
- Paralogs are closer to each other than to their
ortholog - Both S. cerevisiae and S. bayanus show gene
conversion
Periodic gene conversion
60Summary
- Detecting gene duplication
- Orthologs and paralogs
- Gene trees and species trees
- Reconciliation
- Detecting genome duplication
- Evidence across species
- Evidence in a single species
- Duplicate gene evolution
- Detect accelerated divergence
- Measuring positive selection
- Gene conversion