Interpreting the human genome - PowerPoint PPT Presentation

About This Presentation
Title:

Interpreting the human genome

Description:

Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics in Medicine – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 43
Provided by: manolis
Learn more at: https://compbio.mit.edu
Category:

less

Transcript and Presenter's Notes

Title: Interpreting the human genome


1
Interpreting the human genome
  • Manolis Kellis

CSAIL MIT Computer Science and Artificial
Intelligence Lab
Broad Institute of MIT and Harvard for Genomics
in Medicine
2
The age of comparative genomics
3
Resolving power in mammals, flies, fungi
Post-duplication
9 Yeasts
Pre-dup
P
Diploid
P
P
P
8 Candida
Haploid
P
P
12 flies
10 mammals
17 yeasts
  • Neutral 2.57 subs/site
  • (opp 0.62 32sps 4.87)
  • Coding 1.16 subs/site
  • Detect 6-mer at FP 10-6
  • Neutral 4.13 subs/site
  • Coding 1.65 subs/site
  • Detect 6-mer at 10-11
  • Neutral 15.5 subs/site
  • (Yeast 6.5 Candida 6.5)
  • Coding 7.91 subs/site
  • Detect 3-mer at 10-21

4
Comparative Genomics 101 Conservation ? Function
  • Conserved elements are typically functional (and
    vice versa)
  • For example exons are deeply conserved to
    mouse, chicken, fish
  • Some conserved elements are still uncharacterized
  • How do we make sense of them?
  • How do we distinguish each type of functional
    element
  • Answer evolutionary signatures (Comp. Genomics
    201)
  • Tell me how you evolve, Ill tell you who you are
  • Patterns of change ? selective pressures ?
    specific function

5
Overview of this talk
  • 1. Genome interpretation
  • Evolutionary signatures of genes
  • Revisiting the human and fly genomes
  • Unusual gene structures
  • 2. Gene regulation
  • Regulatory motif discovery
  • microRNA regulation
  • Enhancer identification
  • 3. Genome evolution
  • Phylogenomics
  • The two forces of gene evolution
  • Accurate gene trees in complete genomes

6
The goal All the genes and nothing but the genes
Yeast genes
Fly genes
Human genes
Kellis et al. Nature, 2003
Mike Lin 2006
Mike Lin 2006
7
Gene identification
Study known genes
Derive conservation rules
Discover new genes
  • Evolutionary signatures
  • Tell me how you evolve, ill tell you who you
    are
  • Each type of functional elements evolves in its
    own specific ways

8
Distinguishing genes from non-coding regions
Splice
Dmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCA
GGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCA
GCATCTTC Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG
-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT-
--GGCTCCAGCATCTTC Dsim TGTCCATAAATAAA-----TTTACAAC
AGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGAC
GAGCATGT---GGCTCCAGCATCTTC Dyak
TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGG
AGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATC
TTC Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTA
GCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGC
TCCAGCATCTTT Dana TGTCCATAAATAAA-----TCTACAACATTTA
GCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCG
TGA---GGCTCCATCATCTTA Dpse TGTCCATAAATGAA-----TTTA
CAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGC
ATACGCCCGTGG---GGCTCCATCATTTTC Dper
TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGG
AATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATT
TTC Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTA
GCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGT
TCCATTATCTTC Dmoj TGATTATAAACGTAATGCTTTTATAACAATTA
GCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCC
TGTCCCGGCTCCATCAGCTTT Dvir TGTTTATAAAATTAATTCTTTTA
AAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGC
GTGCGGCTCTGGCCCGGCTCCATCAGCTTC Dgri
TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAG
AGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGC
TTT

  • Protein-coding genes have specific evolutionary
    constraints
  • Gaps are multiples of three (preserve amino acid
    translation)
  • Mutations are largely 3-periodic (silent codon
    substitutions)
  • Specific triplets exchanged more frequently
    (conservative substs.)
  • Conservation boundaries are sharp (pinpoint
    individual splicing signals)
  • Encode as evolutionary signatures
  • Computational test for each of them
  • Combine and score systematically

9
Signature 1 Reading frame conservation
RFC
RFC
100
60
100
55
100
90
100
40
100
60
100
100
100
20
100
30
100
40
?100
?60
10
Results in yeast
Accept Reject
4000 named genes 99.9 0.1
300 intergenic regions 1 99
Accept Reject
4000 named genes
300 intergenic regions
Accept Reject
4000 named genes
300 intergenic regions
Accept Reject
4000 named genes 99.9 0.1
300 intergenic regions 1 99
2000 Hypothetical ORFs 1500 500
High sensitivity and specificity
Revisit yeast annotation with SGD
11
Signature 2 Distinct patterns of codon
substitution
Codon observed in species 2
Codon observed in species 2
Genes
Intergenic
Codon observed in species 1
Codon observed in species 1
  • Codon substitution patterns specific to genes
  • Genetic code dictates substitution patterns
  • Amino acid properties dictate substitution
    patterns

12
Codon Substitution Matrix (CSM)
human
mouse
aliphatic
negative
polar
positive
aromatic
polar
13
Signatures 3, 4, 5, 6, 7, etc
real exon
ISEs
ISEs
donor site
acceptor site
ESEs
  • Mutation patterns of splicing signals
  • Real splice acceptor/donor evolve in specific
    ways
  • Evolution of other motifs associated with
    splicing
  • Exonic/Intronic Splicing Enhancers/Silencers
    (ESE,ESI)
  • Density of motif clouds surrounding real exons
  • Sharp conservation boundaries
  • Relative conservation exon vs. surrounding
    regions
  • Length of longest open reading frame
  • Frequency of stop codons in each frame / each
    species

14
Putting it all together probabilistic framework
  • Hidden Markov Models (HMMs)
  • Generative model, learn emission, transition
    probabilities
  • Easy to train, hard to integrate long-range
    signals
  • Conditional Random Fields (CRFs)
  • Discriminative dual of HMMs, learn weights on
    features
  • Easy to integrate diverse signals, gradient
    ascent for training

15
From HMMs to CRFs
hidden sequence (e.g. fair, loaded)
state transition
yi
yi-1
yi1
emission
observed sequence (e.g. heads, tails)
xi
xi-1
xi1
16
From HMMs to CRFs
yi
yi-1
yi1
hidden sequence
F(i-1)
F(i)
F(i1)
feature functions
X
observed
17
From HMMs to CRFs
Generative model
Discriminative model
Transition and Emission probabilities
For example, features can simply be ei and aij
Or pretty much anything
18
Training a Conditional Random Field model
  • Find optimal feature weights (thats the hard
    part)
  • Training by gradient ascent, numerical methods

19
Discriminative framework shows continued increase
in power
  • Reading frame conservation (RFC) score

2 species
3 species
5 species
12 species
  • Codon substitution matrix (CSM) score

2 species
?
70
80
2 species
30
20
12 species
12 species
12 species
90
95
10
5
20
Running on real genomes
  • Obtain optimal weights (from training set)
  • Experimentally-defined, genetics, curation, cDNA
  • Apply CRF systematically to new genome
  • Revisit existing genomes
  • Annotate new genomes

21
  • Part 1. Genome interpretation
  • Evolutionary signatures of genes
  • Revisiting the human and fly genomes
  • Unusual gene structures

?
22
Initial results for the whole human genome
Human
Dog
Mouse
Rat
1,065 fully rejected
454 novel (2591 exons)
7,717 refined
9,862 fully confirmed
1,919 not aligned
  • Fully rejected genes weak/no evidence
  • New exons existing novel experimental evidence
  • Need large-scale functional annotation for novel
    genes

23
Revisiting Drosophila annotation
D. melanog.
D. simulans
D. erecta
D. persimilis
()
579 fully rejected
1,454 exons (800 genes)
668 exons in 443 genes
10,845 fully confirmed
2,499 not aligned
  • Fully rejected genes weak/no evidence
  • New exons existing novel experimental evidence
  • Large-scale functional annotation for novel genes

24
Systematic application leads to
Reading Frame Conservation
  • Exon-level changes
  • Ex 1 New genes
  • Ex 2 New exons
  • Ex 3 Dubious genes
  • More subtle changes
  • Ex 4 Start/end adjustments
  • Ex 5 Wrong reading frame
  • Ex 6 Splice site adjustments
  • Ex 7 Sequencing errors fixed
  • Unusual gene structures
  • W1 Stop-codon read-through
  • W2 uORFs dicistronic
  • W3 Internal frame-shifts

Codon Substitution Matrix
Codon observed in species 2
Genes vs. Intergenic
Codon observed in species 1
25
Example 1 Known genes stand out
Sharp conservation boundaries. Known exons
stand out. High sensitivity and specificity.
conserved
substitution
insertion
frameshift
gap
26
Example 2 Novel multi-exon gene
  • 1,454 novel exons
  • outside known genes
  • Many cluster in new multi-exon genes
  • Others are isolated high-confidence exons

27
Example 2b Novel exons inside known genes
  • (sorry, this example is from human, mouse, dog,
    rat)
  • 668 cases in fly
  • New candidate alternatively spliced gene forms
  • New protein domains

28
Novel genes and exons
  • 1,454 novel exons outside existing genes
  • 60 cluster in 300 multi-exon genes
  • 40 isolated exons
  • 668 novel exons inside existing genes
  • Alternative splicing Many with cDNA support
  • Nested genes Few known examples
  • Human curation
  • Collaboration with FlyBase
  • Hundreds of changes in release 5.1, more in 5.2
  • Systematic experimentation
  • Sue Celniker and Berkeley Genome Project
  • Thousands of new genes in the pipeline

29
Example 3 Dubious single-exon gene
  • Only evidence was an open reading frame
  • Comparative information much stronger

30
579 Dubious Genes
  • Classification approach Yes / No answer
  • Closely related species both genes and
    intergenic aligned
  • Show very different patterns of mutation
  • Comparative analysis provides negative evidence
  • Alignment is unambiguous, orthologous, spans
    entire gene
  • Sequence shows mutations and indels in every
    species
  • Weak or missing experimental evidence
  • 100 of these independently rejected by FlyBase
  • These are missing from systematic clone
    collections
  • Only 34 (6) have assigned names (vs. 36 of all
    fly genes)

31
Systematic application leads to
Reading Frame Conservation
  • Exon-level changes
  • Ex 1 New genes
  • Ex 2 New exons
  • Ex 3 Dubious genes
  • More subtle changes
  • Ex 4 Start/end adjustments
  • Ex 5 Wrong reading frame
  • Ex 6 Splice site adjustments
  • Ex 7 Sequencing errors fixed
  • Unusual gene structures
  • W1 Stop-codon read-through
  • W2 uORFs dicistronic
  • W3 Internal frame-shifts

Codon Substitution Matrix
Codon observed in species 2
Genes vs. Intergenic
Codon observed in species 1
32
Example 4 Start codon adjustment
  • Codon substitution patterns suggest new start in
    200 genes
  • Score each substitution using Codon Substitution
    Matrix (CSM)

poor CSM score, atypical substitution high CSM
score, protein-like substitution
ATG
ATG
annotated start codon
conserved start codon
33
Example 5 Gene annotated on wrong reading frame
  • cDNA evidence supports overlapping reading
    frames, both open
  • Annotation traditionally selects longer one
  • Conservation enables distinguishing the two

Shorter ORF is the correct one
mRNA supports both ORFs
Annotated ORF (345nt)
Real ORF (315nt)
Conservation only supports shorter ORF
CG7738-RA is incorrect
34
Example 6 Incorrect splice causes wrong frame
  • Second exon annotated in the wrong frame
  • Due to splice site boundary error
  • Correction is supported by cDNA evidence

First exon correct frame
2nd exon incorrect frame
Fix exon boundary
35
Example 7 Detect seq. errors / strain mutations
  • Insertion/deletion causes frameshift
  • Conservation signature shifts from frame1 to
    frame2
  • All other species disagree with D. melanogaster
    indel
  • Sequencing error or species-specific mutation

chr3R6,953,865-6,953,927 (Ugt86Dd)
dm CAGTACATATTTGTGGAGAGTTACTTGAAAG-CTTGGCAGCTA
AGGGTCATCAGGTGACCGTTA droSec CAGTACATATTTTTGGAGAGC
TACTTGAAAGCCTTGGCAGCTAAGGGTCACCAGGTGACCGTTA droSim
CAGTACATATTTATGGAGAGCTACTTGAAAGCCTTGGCAGCTAAGGGTC
ACCAGGTGACCGTTA droYak CAGTACATTTTTGTGGAGACCTACTTG
AAAGCCCTGGCAGCCAAGGGTCACCAGGTGACCGTTA droEre
CAGTACATTTTTGTGGAGACCTACTTGAAAGCCCTGGCAGCTAGGGGTCA
CCAGGTGACTGTTA droAna CAGTACATCTTTGTGGAGACCTATCTGA
AGGCTTTGGCCGACAAAGGTCACCAGGTGACTGTTA droWil
CAATACATATTCATTGAGGCGTATCTAAAGGCATTGGCTGCCAAAGGACA
TCAGTTAACTGTGA droMoj CAGTACATATTCGCCGAGGCGTATTTGA
AGGCGCTAGCAGCCCGGGGCCATGAGGTCACCGTGA droVir
CAGTATATATTTGCCGAGTCGTATTTGAAGGCCTTGGCAGCGCGGGGTCA
TGAGGTGACAGTGA 0120120120120120120120120120
1201 2012012012012012012012012012012


Conservation in correct frame
Conservation in 2nd frame
Frame-shift (sequencing error / recent mutation)
36
Example 8 Dubious gene is a miRNA transcript
  • Evolutionary signatures reveal specific function

37
Systematic application leads to
Reading Frame Conservation
  • Exon-level changes
  • Ex 1 New genes
  • Ex 2 New exons
  • Ex 3 Dubious genes
  • More subtle changes
  • Ex 4 Start/end adjustments
  • Ex 5 Wrong reading frame
  • Ex 6 Splice site adjustments
  • Ex 7 Sequencing errors fixed
  • Unusual gene structures
  • W1 Stop-codon read-through
  • W2 uORFs dicistronic
  • W3 Internal frame-shifts

Codon Substitution Matrix
Codon observed in species 2
Genes vs. Intergenic
Codon observed in species 1
38
Revisiting fly genome annotation
D. melanog.
D. simulans
D. erecta
D. persimilis
()
579 fully rejected
1,454 exons (800 genes)
668 exons in 443 genes
10,845 fully confirmed
2,499 not aligned
  • Power of evolutionary signatures
  • New genes and exons, dubious genes and exons
  • Adjust gene boundaries ATG, frame, splice site,
    seq errors
  • Signatures more powerful than primary signals
  • Recognize unusual gene structures ? read-through,
    uORFs, editing
  • Towards a revised genome annotation
  • ? Curation FlyBase integrates prediction with
    cDNA, protein, literature
  • ? Experimentation BDGP large-scale functional
    validation novel exons

39
Overview
  • Part 1. Genome interpretation
  • Evolutionary signatures of genes
  • Revisiting the human and fly genomes
  • Unusual gene structures
  • Part 2. Gene regulation
  • Regulatory motif discovery
  • microRNA regulation
  • Enhancer identification
  • Part 3. Genome evolution
  • Phylogenomics
  • The two forces of gene evolution
  • Accurate gene trees in complete genomes

40
Whos actually doing the work
Ameya Deoras Spectral genomics
Mike Lin Gene identification
Alex Stark Fly motifs and miRNAs
Josh Grochow Network motif discovery
Pouya Kheradpour Human enhancers
Erez Lieberman Motif evolution
Matt Rasmussen Phylogenomics
Aviva Presser Network evolution
41
compbio.mit.edu
42
CSAIL and Biology
Write a Comment
User Comments (0)
About PowerShow.com