Title: Interpreting the human genome
1Interpreting the human genome
CSAIL MIT Computer Science and Artificial
Intelligence Lab
Broad Institute of MIT and Harvard for Genomics
in Medicine
2The age of comparative genomics
3Resolving power in mammals, flies, fungi
Post-duplication
9 Yeasts
Pre-dup
P
Diploid
P
P
P
8 Candida
Haploid
P
P
12 flies
10 mammals
17 yeasts
- Neutral 2.57 subs/site
- (opp 0.62 32sps 4.87)
- Coding 1.16 subs/site
- Detect 6-mer at FP 10-6
- Neutral 4.13 subs/site
- Coding 1.65 subs/site
- Detect 6-mer at 10-11
- Neutral 15.5 subs/site
- (Yeast 6.5 Candida 6.5)
- Coding 7.91 subs/site
- Detect 3-mer at 10-21
4Comparative Genomics 101 Conservation ? Function
- Conserved elements are typically functional (and
vice versa) - For example exons are deeply conserved to
mouse, chicken, fish - Some conserved elements are still uncharacterized
- How do we make sense of them?
- How do we distinguish each type of functional
element - Answer evolutionary signatures (Comp. Genomics
201) - Tell me how you evolve, Ill tell you who you are
- Patterns of change ? selective pressures ?
specific function
5Overview of this talk
- 1. Genome interpretation
- Evolutionary signatures of genes
- Revisiting the human and fly genomes
- Unusual gene structures
- 2. Gene regulation
- Regulatory motif discovery
- microRNA regulation
- Enhancer identification
- 3. Genome evolution
- Phylogenomics
- The two forces of gene evolution
- Accurate gene trees in complete genomes
6The goal All the genes and nothing but the genes
Yeast genes
Fly genes
Human genes
Kellis et al. Nature, 2003
Mike Lin 2006
Mike Lin 2006
7Gene identification
Study known genes
Derive conservation rules
Discover new genes
- Evolutionary signatures
- Tell me how you evolve, ill tell you who you
are - Each type of functional elements evolves in its
own specific ways
8Distinguishing genes from non-coding regions
Splice
Dmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCA
GGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCA
GCATCTTC Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG
-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT-
--GGCTCCAGCATCTTC Dsim TGTCCATAAATAAA-----TTTACAAC
AGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGAC
GAGCATGT---GGCTCCAGCATCTTC Dyak
TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGG
AGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATC
TTC Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTA
GCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGC
TCCAGCATCTTT Dana TGTCCATAAATAAA-----TCTACAACATTTA
GCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCG
TGA---GGCTCCATCATCTTA Dpse TGTCCATAAATGAA-----TTTA
CAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGC
ATACGCCCGTGG---GGCTCCATCATTTTC Dper
TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGG
AATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATT
TTC Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTA
GCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGT
TCCATTATCTTC Dmoj TGATTATAAACGTAATGCTTTTATAACAATTA
GCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCC
TGTCCCGGCTCCATCAGCTTT Dvir TGTTTATAAAATTAATTCTTTTA
AAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGC
GTGCGGCTCTGGCCCGGCTCCATCAGCTTC Dgri
TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAG
AGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGC
TTT
- Protein-coding genes have specific evolutionary
constraints - Gaps are multiples of three (preserve amino acid
translation) - Mutations are largely 3-periodic (silent codon
substitutions) - Specific triplets exchanged more frequently
(conservative substs.) - Conservation boundaries are sharp (pinpoint
individual splicing signals) - Encode as evolutionary signatures
- Computational test for each of them
- Combine and score systematically
9Signature 1 Reading frame conservation
RFC
RFC
100
60
100
55
100
90
100
40
100
60
100
100
100
20
100
30
100
40
?100
?60
10Results in yeast
Accept Reject
4000 named genes 99.9 0.1
300 intergenic regions 1 99
Accept Reject
4000 named genes
300 intergenic regions
Accept Reject
4000 named genes
300 intergenic regions
Accept Reject
4000 named genes 99.9 0.1
300 intergenic regions 1 99
2000 Hypothetical ORFs 1500 500
High sensitivity and specificity
Revisit yeast annotation with SGD
11Signature 2 Distinct patterns of codon
substitution
Codon observed in species 2
Codon observed in species 2
Genes
Intergenic
Codon observed in species 1
Codon observed in species 1
- Codon substitution patterns specific to genes
- Genetic code dictates substitution patterns
- Amino acid properties dictate substitution
patterns
12Codon Substitution Matrix (CSM)
human
mouse
aliphatic
negative
polar
positive
aromatic
polar
13Signatures 3, 4, 5, 6, 7, etc
real exon
ISEs
ISEs
donor site
acceptor site
ESEs
- Mutation patterns of splicing signals
- Real splice acceptor/donor evolve in specific
ways - Evolution of other motifs associated with
splicing - Exonic/Intronic Splicing Enhancers/Silencers
(ESE,ESI) - Density of motif clouds surrounding real exons
- Sharp conservation boundaries
- Relative conservation exon vs. surrounding
regions - Length of longest open reading frame
- Frequency of stop codons in each frame / each
species
14Putting it all together probabilistic framework
- Hidden Markov Models (HMMs)
- Generative model, learn emission, transition
probabilities - Easy to train, hard to integrate long-range
signals - Conditional Random Fields (CRFs)
- Discriminative dual of HMMs, learn weights on
features - Easy to integrate diverse signals, gradient
ascent for training
15From HMMs to CRFs
hidden sequence (e.g. fair, loaded)
state transition
yi
yi-1
yi1
emission
observed sequence (e.g. heads, tails)
xi
xi-1
xi1
16From HMMs to CRFs
yi
yi-1
yi1
hidden sequence
F(i-1)
F(i)
F(i1)
feature functions
X
observed
17From HMMs to CRFs
Generative model
Discriminative model
Transition and Emission probabilities
For example, features can simply be ei and aij
Or pretty much anything
18Training a Conditional Random Field model
- Find optimal feature weights (thats the hard
part) - Training by gradient ascent, numerical methods
19Discriminative framework shows continued increase
in power
- Reading frame conservation (RFC) score
2 species
3 species
5 species
12 species
- Codon substitution matrix (CSM) score
2 species
?
70
80
2 species
30
20
12 species
12 species
12 species
90
95
10
5
20Running on real genomes
- Obtain optimal weights (from training set)
- Experimentally-defined, genetics, curation, cDNA
- Apply CRF systematically to new genome
- Revisit existing genomes
- Annotate new genomes
21- Part 1. Genome interpretation
- Evolutionary signatures of genes
- Revisiting the human and fly genomes
- Unusual gene structures
?
22Initial results for the whole human genome
Human
Dog
Mouse
Rat
1,065 fully rejected
454 novel (2591 exons)
7,717 refined
9,862 fully confirmed
1,919 not aligned
- Fully rejected genes weak/no evidence
- New exons existing novel experimental evidence
- Need large-scale functional annotation for novel
genes
23Revisiting Drosophila annotation
D. melanog.
D. simulans
D. erecta
D. persimilis
()
579 fully rejected
1,454 exons (800 genes)
668 exons in 443 genes
10,845 fully confirmed
2,499 not aligned
- Fully rejected genes weak/no evidence
- New exons existing novel experimental evidence
- Large-scale functional annotation for novel genes
24Systematic application leads to
Reading Frame Conservation
- Exon-level changes
- Ex 1 New genes
- Ex 2 New exons
- Ex 3 Dubious genes
- More subtle changes
- Ex 4 Start/end adjustments
- Ex 5 Wrong reading frame
- Ex 6 Splice site adjustments
- Ex 7 Sequencing errors fixed
- Unusual gene structures
- W1 Stop-codon read-through
- W2 uORFs dicistronic
- W3 Internal frame-shifts
Codon Substitution Matrix
Codon observed in species 2
Genes vs. Intergenic
Codon observed in species 1
25Example 1 Known genes stand out
Sharp conservation boundaries. Known exons
stand out. High sensitivity and specificity.
conserved
substitution
insertion
frameshift
gap
26Example 2 Novel multi-exon gene
- 1,454 novel exons
- outside known genes
- Many cluster in new multi-exon genes
- Others are isolated high-confidence exons
27Example 2b Novel exons inside known genes
- (sorry, this example is from human, mouse, dog,
rat) - 668 cases in fly
- New candidate alternatively spliced gene forms
- New protein domains
28Novel genes and exons
- 1,454 novel exons outside existing genes
- 60 cluster in 300 multi-exon genes
- 40 isolated exons
- 668 novel exons inside existing genes
- Alternative splicing Many with cDNA support
- Nested genes Few known examples
- Human curation
- Collaboration with FlyBase
- Hundreds of changes in release 5.1, more in 5.2
- Systematic experimentation
- Sue Celniker and Berkeley Genome Project
- Thousands of new genes in the pipeline
29Example 3 Dubious single-exon gene
- Only evidence was an open reading frame
- Comparative information much stronger
30579 Dubious Genes
- Classification approach Yes / No answer
- Closely related species both genes and
intergenic aligned - Show very different patterns of mutation
- Comparative analysis provides negative evidence
- Alignment is unambiguous, orthologous, spans
entire gene - Sequence shows mutations and indels in every
species - Weak or missing experimental evidence
- 100 of these independently rejected by FlyBase
- These are missing from systematic clone
collections - Only 34 (6) have assigned names (vs. 36 of all
fly genes)
31Systematic application leads to
Reading Frame Conservation
- Exon-level changes
- Ex 1 New genes
- Ex 2 New exons
- Ex 3 Dubious genes
- More subtle changes
- Ex 4 Start/end adjustments
- Ex 5 Wrong reading frame
- Ex 6 Splice site adjustments
- Ex 7 Sequencing errors fixed
- Unusual gene structures
- W1 Stop-codon read-through
- W2 uORFs dicistronic
- W3 Internal frame-shifts
Codon Substitution Matrix
Codon observed in species 2
Genes vs. Intergenic
Codon observed in species 1
32Example 4 Start codon adjustment
- Codon substitution patterns suggest new start in
200 genes - Score each substitution using Codon Substitution
Matrix (CSM)
poor CSM score, atypical substitution high CSM
score, protein-like substitution
ATG
ATG
annotated start codon
conserved start codon
33Example 5 Gene annotated on wrong reading frame
- cDNA evidence supports overlapping reading
frames, both open - Annotation traditionally selects longer one
- Conservation enables distinguishing the two
Shorter ORF is the correct one
mRNA supports both ORFs
Annotated ORF (345nt)
Real ORF (315nt)
Conservation only supports shorter ORF
CG7738-RA is incorrect
34Example 6 Incorrect splice causes wrong frame
- Second exon annotated in the wrong frame
- Due to splice site boundary error
- Correction is supported by cDNA evidence
First exon correct frame
2nd exon incorrect frame
Fix exon boundary
35Example 7 Detect seq. errors / strain mutations
- Insertion/deletion causes frameshift
- Conservation signature shifts from frame1 to
frame2 - All other species disagree with D. melanogaster
indel - Sequencing error or species-specific mutation
chr3R6,953,865-6,953,927 (Ugt86Dd)
dm CAGTACATATTTGTGGAGAGTTACTTGAAAG-CTTGGCAGCTA
AGGGTCATCAGGTGACCGTTA droSec CAGTACATATTTTTGGAGAGC
TACTTGAAAGCCTTGGCAGCTAAGGGTCACCAGGTGACCGTTA droSim
CAGTACATATTTATGGAGAGCTACTTGAAAGCCTTGGCAGCTAAGGGTC
ACCAGGTGACCGTTA droYak CAGTACATTTTTGTGGAGACCTACTTG
AAAGCCCTGGCAGCCAAGGGTCACCAGGTGACCGTTA droEre
CAGTACATTTTTGTGGAGACCTACTTGAAAGCCCTGGCAGCTAGGGGTCA
CCAGGTGACTGTTA droAna CAGTACATCTTTGTGGAGACCTATCTGA
AGGCTTTGGCCGACAAAGGTCACCAGGTGACTGTTA droWil
CAATACATATTCATTGAGGCGTATCTAAAGGCATTGGCTGCCAAAGGACA
TCAGTTAACTGTGA droMoj CAGTACATATTCGCCGAGGCGTATTTGA
AGGCGCTAGCAGCCCGGGGCCATGAGGTCACCGTGA droVir
CAGTATATATTTGCCGAGTCGTATTTGAAGGCCTTGGCAGCGCGGGGTCA
TGAGGTGACAGTGA 0120120120120120120120120120
1201 2012012012012012012012012012012
Conservation in correct frame
Conservation in 2nd frame
Frame-shift (sequencing error / recent mutation)
36Example 8 Dubious gene is a miRNA transcript
- Evolutionary signatures reveal specific function
37Systematic application leads to
Reading Frame Conservation
- Exon-level changes
- Ex 1 New genes
- Ex 2 New exons
- Ex 3 Dubious genes
- More subtle changes
- Ex 4 Start/end adjustments
- Ex 5 Wrong reading frame
- Ex 6 Splice site adjustments
- Ex 7 Sequencing errors fixed
- Unusual gene structures
- W1 Stop-codon read-through
- W2 uORFs dicistronic
- W3 Internal frame-shifts
Codon Substitution Matrix
Codon observed in species 2
Genes vs. Intergenic
Codon observed in species 1
38Revisiting fly genome annotation
D. melanog.
D. simulans
D. erecta
D. persimilis
()
579 fully rejected
1,454 exons (800 genes)
668 exons in 443 genes
10,845 fully confirmed
2,499 not aligned
- Power of evolutionary signatures
- New genes and exons, dubious genes and exons
- Adjust gene boundaries ATG, frame, splice site,
seq errors - Signatures more powerful than primary signals
- Recognize unusual gene structures ? read-through,
uORFs, editing - Towards a revised genome annotation
- ? Curation FlyBase integrates prediction with
cDNA, protein, literature - ? Experimentation BDGP large-scale functional
validation novel exons
39Overview
- Part 1. Genome interpretation
- Evolutionary signatures of genes
- Revisiting the human and fly genomes
- Unusual gene structures
- Part 2. Gene regulation
- Regulatory motif discovery
- microRNA regulation
- Enhancer identification
- Part 3. Genome evolution
- Phylogenomics
- The two forces of gene evolution
- Accurate gene trees in complete genomes
40Whos actually doing the work
Ameya Deoras Spectral genomics
Mike Lin Gene identification
Alex Stark Fly motifs and miRNAs
Josh Grochow Network motif discovery
Pouya Kheradpour Human enhancers
Erez Lieberman Motif evolution
Matt Rasmussen Phylogenomics
Aviva Presser Network evolution
41compbio.mit.edu
42CSAIL and Biology