Title: Computational Analyses of Eukaryotic Gene Evolution
1Computational Analyses ofEukaryotic Gene
Evolution
- Sourav Chatterji
- souravc_at_cs.berkeley.edu
- May 18, 2006
2- White Blood Cells From Cancer-resistant
- Mice Cure Cancers In Ordinary Mice
- (Science Daily News. May 9, 2006)
3- The cancer-resistant mice all stem from a single
mouse discovered in 1999. - The original studies showed that the resistance
is inherited. - About half of the progeny inherit the resistance.
- Cui and his colleagues believe that the
resistance results from a mutation in a single
gene and are attempting to find it, but that has
proved frustrating. - Mouse genome published in 2002.
4Understanding the biology of genes
- Draft Human Genome (2001).
- 30,000-40,000 genes
- Finished Human Genome (2004).
- 20,000-25,000 genes
- A Third Approach to Gene Prediction Suggests
Thousands of Additional Human Transcribed
Regions (Glusman et al. 2006).
5Understanding the biology of genes
- Why is it a hard problem?
- Pseudogenes.
- Long Introns.
- Conserved Non Coding Sequences (CNSes).
- Determination of Gene Boundaries.
- Alternative Splicing.
6Understanding the biology of genes
- what's needed is knowledge about why genes have
the characteristics they do Pennisi, 2003 - Studying gene evolution should help.
Illustration Terry E Smith
7State of Sequencing Projects
- 25 Mammalian Genomes.
- Sequencing around H.sapiens.
- 12 Fly Genomes.
- Sequencing around D. melanogaster.
- 5 Worm Genomes.
- Sequencing around C. elegans.
Source NHGRI
8(No Transcript)
9(No Transcript)
10Outline
- Reference Based Annotation.
- The GeneMapper Algorithm.
- Annotation of Fruitfly Genomes.
- Evolution of Gene Structure.
11Outline
- Reference Based Annotation.
- The GeneMapper Algorithm.
- Annotation of Fruitfly Genomes.
- Evolution of Gene Structure.
12Source http//rana.lbl.gov/drosophila/
13Species ESTs mRNAs RefSeq
D. melanogaster 383407 19931 19967
D. simulans 5013 80 None
D. yakuba 11015 808 None
D. erecta None 6 None
D. ananassae None 11 None
D. pseudoobscura 35042 40 None
D. mojavensis 361 2 None
D. virilis 663 41 None
D. grimshawi None None None
Source UCSC browser
14Reference Based Annotation
- How to accurately annotate the newly sequenced
genomes? - Transfer annotations from a well-studied
reference genome. - Implicitly creates a data set for studying the
evolution of genes.
15Protein Alignment Approach
Reference Protein
Genomic Sequence
- Procrustes Gelfand at al. 1996
- GeneWise Birney at al. 2004
- Integral part of the ENSEMBL gene annotation
pipeline. - Not aware of exon/intron boundaries.
- Accuracy decreases when sequence identity is low.
16Similarity Based Approach
Reference Gene
Target Sequence
- Projector Meyer and Durbin 2004
- Predicts the global gene structure using a pair
HMM. - Uses heuristics to decrease the search space.
- GeneMapper
- Uses a bottom up algorithm for predicting the
gene structure. - Not suitable if the exon/intron structure of the
gene has diverged a lot.
17Outline
- Reference Based Annotation.
- The GeneMapper Algorithm.
- Annotation of Fruitfly Genomes.
- Evolution of Gene Structure.
18The GeneMapper Algorithm
- Bottom Up Algorithm
- Predict the ortholog of each reference exon in
the target sequence. - Join exon predictions together to predict gene
structure. - Multiple Species GeneMapper
- Uses all available information if the gene has to
be mapped into multiple target species. - Uses a profile based approach to get more
accurate annotations in evolutionary distant
species.
19(No Transcript)
20Mapping Exons Accurately
- Predicting the ortholog of a reference exon in
the target sequence - Accurately model the evolution of exons.
- Use StrataSplice to model splice sites.
21Mapping Exons Accurately
- Use version of Smith Waterman algorithm.
- Exact Optimization feasible.
- Green edges to model the evolution of codons.
- Uses 64 64 COD distance matrices.
- Red edges to allow for frameshifts.
22Multiple Species GeneMapper
- Generates a gene profile of orthologous genes.
- A more complete characterization than a single
gene. - Each column contains an alignment of
orthologous codons. - Special columns of width 1 are allowed to
account for frameshifts and sequencing errors.
23Multiple Species GeneMapper
- Iteratively builds a gene profile to capture the
characteristics of the gene. - The profile helps us map exons more accurately to
evolutionary distant species. - ExonAligner is modified to align the profile with
the target sequence. - Uses different models for conserved and
non-conserved codons.
24Exploiting Phylogeny Species Hopping
25Exploiting Phylogeny Species Hopping
Map gene into closest species
26Exploiting Phylogeny Species Hopping
Map gene into closest species
27Exploiting Phylogeny Species Hopping
Add the prediction to the profile
28Exploiting Phylogeny Species Hopping
Use profile to map gene into the next closest
species
29GeneMapper Performance
GeneWise Projector GeneMapper
Gene Sn. 61.3 59.9 81.7
Gene Sp. 60.8 59.5 81.7
Exon Sn. 92.8 94.2 97.2
Exon Sp. 93.4 90.5 97.8
Nucl Sn. 99.86 99.78 99.88
Nucl Sp. 99.91 99.70 99.94
30GeneMapper Performance
GeneWise Projector GeneMapper
Gene Sn. 61.3 59.9 81.7
Gene Sp. 60.8 59.5 81.7
Exon Sn. 92.8 94.2 97.2
Exon Sp. 93.4 90.5 97.8
Nucl Sn. 99.86 99.78 99.88
Nucl Sp. 99.91 99.70 99.94
31Sources of Errors
- Highly Divergent Exons
- Exon Splitting
- Improved handling in latest version.
- Assembly and Sequencing Errors
32Outline
- Reference Based Annotation.
- The GeneMapper Algorithm.
- Annotation of Fruitfly Genomes.
- Evolution of Gene Structure.
33Annotating the Fruitfly Genomes
34The Fruitfly Genomes
35(No Transcript)
36The Annotation pipeline
- Construct whole genome homology maps of nine
fruitfly genomes by using Mercator Dewey and
Pachter, unpublished - Extend the map using extrapolation.
- Use the map as a guide to transfer D.
melanogaster RefSeq annotations by using
GeneMapper.
37Generating Homology Maps
Waterston et al. , 2002
38Issues With Homology Maps
- Gene duplications
- Distinguishing between paralogs and orthologs.
- Incomplete coverage.
- Low sequence identity.
- Insufficient Anchor Coverage
- Micro-rearrangements.
39(No Transcript)
40Fruitfly Annotations
41Fruitfly Annotations
- Annotations of the 11 fruitfly genomes
- http//bio.math.berkeley.edu/genemapper/CAF1_genes
_v0.2 - Browsable on the UCSC browser
- http//bio.math.berkeley.edu/genemapper/fruitfly.h
tml - Gene Alignments
- http//bio.math.berkeley.edu/genemapper/CAF1_aln/
42Annotation Statistics
Species Transcripts Unique Complete
D. melanogaster 19697 13488 N/A
D. simulans 18274 12353 17074
D. yakuba 18551 12594 17614
D. erecta 18700 12682 18203
D. ananassae 17398 11561 15858
D. pseudoobscura 16651 10867 14595
D. mojavensis 15908 10214 13192
D. virilis 16032 10305 13451
D. grimshawi 15700 10063 13107
43Outline
- Reference Based Annotation.
- The GeneMapper Algorithm.
- Annotation of Fruitfly Genomes.
- Evolution of Gene Structure.
44Exon Intron Gene Structure
- Phenomenon due to exon intron structure
- Alternative Splicing.
- Regulation through UTRs.
- Nonsense Mediated Decay.
- Diversity in Gene Structure
- Prokaryotic genes are intronless.
- Tremendous Diversity within Eukaryotes.
- Might have been responsible for the formation of
nucleus Martin and Koonin, 2006.
45Evolution of Gene Structure
- Introns early Theory
- Introns lost in prokaryotic evolution.
- Introns late Theory
- Spliceosomal introns invented during eukaryotic
evolution. - Reality probably in middle.
- Most introns are fairly new.
- Eukaryotic ancestor had spliceosomal mechanisms.
46Evolution of Gene Structure
- Conservation of intron positions among various
eukaryotic cladesRogozin et al. 2003 - Both loss and gain of introns in various
eukaryotic lineages Rogozin et al. 2003 - Intron preferentially lost near 3 ends of genes
Roy and Gilbert, 2005 - Excess of 5 introns in eukaryotic genomes Lin
and Zhang, 2005
47Mechanisms of Intron Gain
- Duplication Theory Tarrio et al. 1998.
- New Introns are formed by duplication of existing
introns. - Transposons Theory Crick 1979
- Novel Introns arise by insertion of transposons.
48Mechanisms of Intron Loss
- Recombination Theory Bernstein et al., 1983
- Recombination of reverse transcribed mRNA
transcript with the genome results in the loss of
introns. - Predicts intron loss at 3 end.
- Deletion Theory Kent and Zahler, 2000
- Intron loss by genomic deletion.
- Predicts inexact intron loss.
49Finding Intron Gain/Loss
- Two criteria for detecting credible intron gain
Logsdin et al. 1998 - Strong Phylogenetic Support.
- Source of Intron DNA should be identifiable.
- Hard to satisfy this criteria as there are few
constraints on intron evolution.
50Finding Intron Gain/Loss
The Melanogaster Group
51Finding Intron Gain/Loss
52Finding Intron Gain/Loss
53Finding Intron Gain/Loss
Present
Absent
Present
54Finding Intron Gain/Loss
Absent
Absent
Present
55Finding Intron Gain/Loss
- Determination of Intron Gain
- GeneMapper allows for a single inserted intron in
a exon. - Determination of Intron Loss
- Minimum intron length for maintaining splicing
reaction.
56Finding Intron Gain/Loss
- Used GeneMapper to search for lost/inserted
introns in all FlyBase genes. - Found 231 inserted introns.
- Found 105 deleted introns.
57Lengths of Inserted Introns
58Phases of Inserted Introns
59Mechanisms of Intron Loss
- Recombination Hypothesis
- Predicts that intron loss should occur at the 3
end. - Testing the hypothesis
- 53 of the 105 lost introns are either the last or
penultimate intron at the 3 end. - 82 of 105 lost introns are in the 3 third.
- 94 of 105 lost introns are in the 3 half.
60Mechanisms of Intron Loss
- Deletion Hypothesis
- Predicts inexact intron loss.
- Testing the hypothesis
- Look at gene alignments around fusion events.
- There doesnt appear to be many intron losses
through inexact deletions.
61Mechanisms of Intron Gain
- Duplication Hypothesis
- Introns formed by duplication of existing
introns. - Testing the hypothesis
- Use BLAST to search for matches with existing
Melanogaster introns. - No duplications found.
62Mechanisms of Intron Gain
- Transposon Hypothesis
- Testing the hypothesis
- Searched for TEs using RepeatMasker.
- 3 of the 231 new introns were found to be
transposons. - It seems that even though some introns are formed
by insertion of TEs into genes, this mechanism is
used very rarely.
63Conclusions
- Large scale sequencing projects provide an
unprecedented opportunity to study genome
evolution. - We have developed computational tools to study
genome evolution. - These tools can be used to prove or disprove
theories about gene evolution.