Title: Comparative genomics of 12 Drosophila genomes
1Comparative genomics of 12 Drosophila genomes
- Bill Gelbart
- (in the role of Manolis Kellis)
Broad Institute of MIT and Harvard
MIT Computer Science and Artificial Intelligence
Laboratory
2Fly comparative genomics
Complexity of bilateral animal - Developmental
principles Signaling pathways, kernels -
Enhancers, splicing, microRNAs Genetics of a
small eukaryote - 100 years of classical
genetics - Systematic experiments, RNAi, in-situ
visualization Comparative genomics power -
Richest than any eukaryote - 2.1 substitutions
per site - Multiple sets of close relatives
3Fly comparative genomics
Matt Rasmussen Poster 267
- Extensive conservation of gene order
- Genome-wide alignments spanning entire genus
4Fly comparative genomics
Gene identification
Regulatory motif discovery
microRNA regulation
5- Part 1. Gene Identification
- Evolutionary signatures of genes
- Revisiting the fly genome
- Unusual gene structures
6Distinguishing genes from non-coding regions
Splice
Dmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCA
GGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCA
GCATCTTC Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG
-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT-
--GGCTCCAGCATCTTC Dsim TGTCCATAAATAAA-----TTTACAAC
AGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGAC
GAGCATGT---GGCTCCAGCATCTTC Dyak
TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGG
AGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATC
TTC Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTA
GCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGC
TCCAGCATCTTT Dana TGTCCATAAATAAA-----TCTACAACATTTA
GCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCG
TGA---GGCTCCATCATCTTA Dpse TGTCCATAAATGAA-----TTTA
CAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGC
ATACGCCCGTGG---GGCTCCATCATTTTC Dper
TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGG
AATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATT
TTC Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTA
GCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGT
TCCATTATCTTC Dmoj TGATTATAAACGTAATGCTTTTATAACAATTA
GCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCC
TGTCCCGGCTCCATCAGCTTT Dvir TGTTTATAAAATTAATTCTTTTA
AAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGC
GTGCGGCTCTGGCCCGGCTCCATCAGCTTC Dgri
TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAG
AGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGC
TTT
- Protein-coding genes have specific evolutionary
constraints - Gaps are multiples of three (preserve amino acid
translation) - Mutations are largely 3-periodic (silent codon
substitutions) - Specific triplets exchanged more frequently
(conservative substs.) - Conservation boundaries are sharp (pinpoint
individual splicing signals) - ? Encode as evolutionary signatures
- Computational test for each of them
- Combine and score systematically
7Gene identification
Study known genes
Derive conservation rules
Discover new genes
8Signature 1 Reading frame conservation
9Signature 2 Codon substitution patterns
Genes
Codon observed in species 1
Codon observed in species 2
- Codon substitution patterns specific to genes
- Genetic code dictates substitution patterns
- Amino acid properties dictate substitution
patterns
10Codon Substitution Matrix (CSM)
11Signature 3 Spectral analysis of DNA sequence
Genes
Intergenic
- Reveal frequency spectra due to non-uniform codon
usage - Extend to multiple genomes
- Initial results (single genome, single spectrum)
- Fly Sensitivity 85 Specificity 87 Accuracy 93
12Signatures 4, 5, 6, 7, etc
real exon
ISEs
ISEs
donor site
acceptor site
ESEs
- Mutation patterns of splicing signals
- Real splice acceptor/donor evolve in specific
ways - Evolution of other motifs associated with
splicing - Exonic/Intronic Splicing Enhancers/Silencers
(ESE,ESI) - Density of motif clouds surrounding real exons
- Sharp conservation boundaries
- Relative conservation exon vs. surrounding
regions - Length of longest open reading frame
- Frequency of stop codons in each frame / each
species
13Putting it all together CONGO gene finder
- CONGO gene finder based on Conditional Random
Fields (CRFs) - Hidden Markov Models (HMMs)
- Generative model, learn emission, transition
probabilities - Easy to train, hard to integrate long-range
signals - Conditional Random Fields (CRFs)
- Discriminative dual of HMMs, learn weights on
features - Easy to integrate diverse signals, gradient
ascent for training - Features of the model
- Train each evolutionary signatures as a feature
- Train single-species signals
- Apply it systematically to revisit Drosophila
genome
14- Part 1. Gene Identification
- Evolutionary signatures of genes
- Revisiting the fly genome
- Unusual gene structures
?
15Revisiting Drosophila annotation
D. melanog.
D. simulans
D. erecta
D. persimilis
()
579 fully rejected
1,454 exons (800 genes)
668 exons in 443 genes
10,845 fully confirmed
2,499 not aligned
- Fully rejected genes weak/no evidence
- New exons existing novel experimental evidence
- Large-scale functional annotation for novel genes
16Example 1 Known genes stand out
Sharp conservation boundaries. Known exons
stand out. High sensitivity and specificity.
conserved
substitution
insertion
frameshift
gap
17Example 2 Novel multi-exon gene
- 1,454 novel exons
- outside known genes
- Many cluster in new multi-exon genes
- Others are isolated high-confidence exons
18Novel genes and exons
- 1,454 novel exons outside existing genes
- 60 cluster in 300 multi-exon genes
- 40 isolated exons
- 668 novel exons inside existing genes
- Alternative splicing Many with cDNA support
- Nested genes Few known examples
- Human curation
- Collaboration with FlyBase
- Hundreds of changes in release 5.1, more in 5.2
- Systematic experimentation
- Sue Celniker and Berkeley Genome Project
- Thousands of new genes in the pipeline
19Example 3 Dubious single-exon gene
- Only evidence was an open reading frame
- Comparative information much stronger
20579 Dubious Genes
- Classification approach Yes / No answer
- Closely related species both genes and
intergenic aligned - Show very different patterns of mutation
- Comparative analysis provides negative evidence
- Alignment is unambiguous, orthologous, spans
entire gene - Sequence shows mutations and indels in every
species - Weak or missing evidence in D. melanogaster
- 100 of these independently rejected by FlyBase
- These are missing from systematic clone
collections - Only 34 (6) have assigned names (vs. 36 of all
fly genes)
21Example 4 Start codon adjustment
- Codon substitution patterns suggest new start in
200 genes - Score each substitution using Codon Substitution
Matrix (CSM)
poor CSM score, atypical substitution high CSM
score, protein-like substitution
annotated start codon
conserved start codon
22Example 5 Gene annotated on wrong reading frame
- cDNA evidence supports overlapping reading
frames, both open - Annotation traditionally selects longer one
- Conservation enables distinguishing the two
Shorter ORF is the correct one
mRNA supports both ORFs
Annotated ORF (345nt)
Real ORF (315nt)
Conservation only supports shorter ORF
CG7738-RA is incorrect
23Example 6 Incorrect splice causes wrong frame
- Second exon annotated in the wrong frame
- Due to splice site boundary error
- Correction is supported by cDNA evidence
First exon correct frame
2nd exon incorrect frame
Fix exon boundary
24Fly Gene Reannotation
novel genes
novel exons
dubious genes
- FlyBase
- manual curation
- official annotations
25- Part 1. Genome interpretation
- Evolutionary signatures of genes
- Revisiting the fly genome
- Unusual gene structures
?
?
26Distinguishing protein-coding regions, in
absence of traditional signals
Traditional stop codons Gene-like conservation
stops sharp
droMel TGCGATCGCTGCCGAAGGCCAATGGCACGTAAGCAGG------
-----------TCCAGGACTGGA-----GCAGAGAA droSim
TGCGATCGCTGCCGAAAGCCAATGGCACGTAAGCAGG-------------
----CCCAGGACTAGG-----GCAGAGAA droYak
TGCGATCGCTGCCGAAGGCCAATGGCACGTAAGCAGGGT-----A-----
AAAGACCAGGACCAGA-----GCAGCGGA droEre
TGCGATCGCTGCCGAAGGCCAATGGCACGTAAGCAGGGT-----G-----
AAAGGCCAGGACCAGA-----GCAGCGGC droAna
TGCGGTCCCTGCCGAAGGCCAATGGCACGTAAAGTGTGTTGCCGA-----
AGCGTCCGGAATCGGAA-TCTGAATTTGA droPse
TGCGGTCGCTGCCCAAGGCCAATGGAACGTAA-----ATTGCC-------
----ACAAGGATGAATA-TCATCAAAGG- droVir
TGCGGTCGCTGCCAAAGGCCAACGGCACGTAAATGGGCCACGCCACACA-
----------ACCCAAACTCATCATCTAT droMoj
TGCGACCGCTGCCAAAGGCCAATGGCACGTAAAGAGGGAAACAAACAAAC
AAACAGAAAAACAAAAACTCAAC----AT droGri
TGCGACCGCTGCCCAAGGCCAATGGCACGTAAA-TGGCATCCCACACAAA
AAACAACAAAACAAAAAGAAAAATGGGTT
27Unusual genes 1 Stop codon read-through
- Method 1 (single exons)
- 112 events, 95 extending known genes ? Manual
curation 82 - Enriched in neuronal function
- Method 2 (after splicing)
- 256 events, looser cutoff, large overlap, needs
manual curation - Enriched in transcription factors
Protein-coding conservation
Continued protein-coding conservation
No more conservation
Stop codon readthrough
2nd stop codon
28Mechanisms for stop-codon read-through
- Sequence-dependent inhibition of eRF binding
Interference from RNA secondary structure
Steric interactions from P-site tRNA
GAGGUGU
GAG GUG AGU UGA CACGAUGGAGAUC
1,2,3 nucleotides
29micr
midlife-crisis, stem cell maintenance
166 AA
dm AAAATTGACCCAGTTCCCCGACGCTGCTGAAGATGTTCGAGAC
CACGCTGACCCTGCCGCGAACCAGTGTG---CTTAC droSim
AAAATTGACCCAGTTCCCCGACGCTGCTGAAGATGTTCGAGACCACGCTG
ACCCTGCCGCGAACCAGTGTG---CTTAC droYak
AAAATTGACCCAGTTCGCCGACGCTGCTGAAGATGTTCGAGACCACGCTG
ACCCTGCCGCGAACCAGTGTG---CTTAC droEre
AAAATTGACCCAGTTCACCGACGCTGCTGAAGATGTTCGAGACCACGCTG
ACCCTGCCGCGAACCAGTGTG---CTTAC droAna
AAAATTGACCCAGTTCGCCGACGCTGCTGAAGATGTTCGAGACCACGCTG
ACTCTGCCGCGCACCAGTGTGGCCCTCAC droPse
AAAATTGACCCAGTTCGCCGACGCTGCTGAAGATGTTCGAGACCACGCTG
ACCCTGCCGCGCACCAGCGTA---CTGAC droVir
AAAATTGACCCAGTTCGCCGACGCTGCTAAAGATGTTCGAGACCACGCTG
ACTCTGCCGCGCACCAGCGTG---CTAAC droMoj
AAAATTGACCCAGTTCGCCGACGCTGCTGAAGATGTTCGAGACCACGCTG
ACCTTGCCGCGCACCAGCGTG---CTAAC droGri
AAAATTGACCCAGTTCGCCGACGCTGCTGAAGATGTTCGAGACCACGCTG
ACAATGCCGCGGAGCAGCGTG---CTAAC
12012012012012012012012012012012012012012012012012
012012012012012012012 01201
30DopR
dopamine-receptor neurotransmitter
38 AA
24 AA
TGGTGG---CGGTGGCCGCCGTGACCGAATCTCATGGAACATATCGCGTG
GTCA AGACA---GCATTTTGGTCAAATCGCA TGGTAG---CGGT
GGCCGCCGTGACCGAATCTCATGGAACATACCGCGTGGTCA
AGACA---GCCTTTTGGTCAAATCGCA TGGTGG---CGGTGGCCGCCGT
GACCGAATCTCATGGAACATACCGCGTGGTCA
AGACA---GCCTTTTGGTCAAATCGCA TGGTGG---CGGTGGCCGCCGT
GACCGAATCTCATGGAACATACCGCGTGGTCA
AGACA---GCCTTTTGGTCAAATCGCA TGGTGG---CCGTGGGTGCCGT
GACCGAATCTCATGGAACGTACCGCGTGGTCA
AGGCATCGGCCTTTTGGTCAAATTGCA TGGTGA---CCGTAGGCGCCGT
GACCGAATCTCATGGAACATAGCGCGTGGTCA
---TAGCCGCCTTTTGGTCAAATCGCA TTGTTGTTGTTGGGGCCGCCGT
GACCGAATCTCATGGAACGTATCTTGTGGTCA
---CAACAACGTTTTGGTCAAATTGCG TGGTCGTTGTGG---CCGCCGC
GACCGAATCTCATGGAACGTACCTTGTGGTCA
---CCGGTGCGTTTTGGTCAAATCGCA TGGTTGTTGTGG---CCGCCGC
GACCGAATCTCATGGAACGTAACGGTTGGTCA
---CAGCAGCGTTTTGGTCAAATTGCA
31Sequence context of read-through events
61 UGA
95 top candidates
15 UAG
14 UAA
5 mixed
32Unusual genes 2 Polycistronic messages / uORFs
- Method
- High-scoring ORFs with cDNA evidence
- Disjoint from the annotated ORF
- Results
- 217 cases
Protein-coding conservation in the 5UTR
33Unusual genes 3 Frame-shift in the middle of
exons
- Method
- Exons changing high-scoring frame
- Far from splice junctions
- Results
- 68 cases in 44 genes
Frame 1 is high-scoring
Frame 2 is high-scoring
34Part 1 summary Gene identification
- Signatures specific to protein-coding genes
- Reading frame conservation
- Codon substitution biases
- Splicing-associated motifs
- Combine in a discriminative framework
- Support Vector Machine, Conditional Random Fields
- Very high accuracy in yeast, fly, human
- Signatures more informative than individual
genome - Can doubt primary sequence of any one species
- Identify and correct sequencing errors
- New biological mechanisms stop codon skipping
35Scaling gene identification to 12 species
- 12 species pose new challenges
- Varying coverage, sequencing errors,
misassemblies - Duplication, loss, divergence, misalignments
- Current methods difficult to scale
- Michael Brent writes No method for improving
gene prediction accuracy by using multi-genome
alignments has yet been found despite several
serious efforts Genome Research 15 1777-1786
(see poster 40) - Does performance continue to improve with more
species?
36Discriminative framework enables continued
increase in power
- Reading frame conservation (RFC) score
2 species
3 species
5 species
12 species
- Codon substitution matrix (CSM) score
2 species
?
70
80
2 species
30
20
12 species
12 species
12 species
90
95
10
5
37Fly comparative genomics
Gene identification
Regulatory motif discovery
microRNA regulation
38Overview
- Part 2. Motif identification
- Evolutionary signatures for motif discovery
- Functional roles of novel motifs
- Scaling motif discovery to 12 species
39Known motifs are preferentially conserved
Experimentally validated region, where
dmel AATGATTTGCCAGCTAGCCAACTCTCTAATTAGCGACTAAGT
CCAAGTCAC . . .
dmel AATGATTTGC----------------CAGC--TAGCC-AACT
CTCTAATTAGCGACTAAGTCC-----------AAGTCAC dsim
AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATT
AGCGACTAAGTCC-----------AAGTCAC dyak
AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATT
AGCGACTAAGTCC-----------AAGTCAG dere
AATGGTTTGC----------------CAGCGGTCGCCAAACTCTCTAATT
AGCGACCAAGTCC-----------AAGTCAG dana
AATGATTTCCATTTCTCCCCACCCCCCACTAGTTCCTAGGCACTCTAATT
AGCAAGTTAGTCTCTAGAGACTCTAAGTCGG dpse
AAT--------TTTC-----------------------AGCCGTCTAATT
AGTGGTGTTCTC------GGTTCTCAAT---
engrailed TAATTA (Ades Sauer, 1994)
- Enough to discover motifs? Not really
- Conservation not limited to exact binding site ?
Additional bases would be found - Weakly constrained positions can diverge ? Real
motifs will be missed - Experimental validation typically not available ?
How do we discover motifs de novo? - ? Use basic property of regulatory motifs
Multiple functional (selected) instances
40Known motifs are frequently conserved
D. mel.
- Across the fly genome, the engrailed motif
- appears 8599 times
- is conserved 1534 times
- Statistical significance
- 5 flies conservation rate of random control
motifs 2.8 - Engrailed enrichment 6.8-fold (Binomial
P-value 35 stdev)
Motif Conservation Score (MCS)
41Systematically evaluate candidate patterns
gap
G
T
C
R
Y
S
A
G
T
R
W
- Enumerate
- Length between 6 and 15 nt, allow central gap
- 11 letter alphabet (A C G T, 2-fold codes, N)
- Score
- Compute binomial score (conserved vs. total)
- Select MCS gt 6.0 ? specificity 97
- Collapsing
- Sequence similarity
- Overlapping occurrences
All potential motifs
Evaluate MCS
Collapse motif variants
42Overview
- Part 2. Motif identification
- Evolutionary signatures for motif discovery
- Functional roles of novel motifs
- Scaling motif discovery to 12 species
?
43Evidence for promoter motifs
Consensus MCS
1 CTAATTAAA 65.6
2 TTKCAATTAA 57.3
3 WATTRATTK 54.9
4 AAATTTATGCK 54.4
5 GCAATAAA 51
6 DTAATTTRYNR 46.7
7 TGATTAAT 45.7
8 YMATTAAAA 43.1
9 AAACNNGTT 41.2
10 RATTKAATT 40
11 GCACGTGT 39.5
12 AACASCTG 38.8
13 AATTRMATTA 38.2
14 TATGCWAAT 37.8
15 TAATTATG 37.5
16 CATNAATCA 36.9
17 TTACATAA 36.9
18 RTAAATCAA 36.3
19 AATKNMATTT 36
20 ATGTCAAHT 35.6
21 ATAAAYAAA 35.5
22 YYAATCAAA 33.9
23 WTTTTATG 33.8
24 TTTYMATTA 33.6
25 TGTMAATA 33.2
26 TAAYGAG 33.1
27 AAAKTGA 32.9
28 AAANNAAA 32.9
29 RTAAWTTAT 32.9
30 TTATTTAYR 32.9
44Evidence for promoter motifs
Consensus MCS Matches to known TFs
1 CTAATTAAA 65.6 engrailed (en)
2 TTKCAATTAA 57.3 reversed-polarity (repo)
3 WATTRATTK 54.9 araucan (ara)
4 AAATTTATGCK 54.4 paired (prd)
5 GCAATAAA 51 ventral veins lacking (vvl)
6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx)
7 TGATTAAT 45.7 apterous (ap)
8 YMATTAAAA 43.1 abdominal A (abd-A)
9 AAACNNGTT 41.2
10 RATTKAATT 40
11 GCACGTGT 39.5 fushi tarazu (ftz)
12 AACASCTG 38.8 broad-Z3 (br-Z3)
13 AATTRMATTA 38.2
14 TATGCWAAT 37.8
15 TAATTATG 37.5 Antennapedia (Antp)
16 CATNAATCA 36.9
17 TTACATAA 36.9
18 RTAAATCAA 36.3
19 AATKNMATTT 36
20 ATGTCAAHT 35.6
21 ATAAAYAAA 35.5
22 YYAATCAAA 33.9
23 WTTTTATG 33.8 Abdominal B (Abd-B)
24 TTTYMATTA 33.6 extradenticle (exd)
25 TGTMAATA 33.2
26 TAAYGAG 33.1
27 AAAKTGA 32.9
28 AAANNAAA 32.9
29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n)
30 TTATTTAYR 32.9 Deformed (Dfd)
45Novel motifs show expression enrichment
- Discovered independently of any expression
clusters - Validate against known expression datasets
46Evidence for promoter motifs
Consensus MCS Matches to known Expression enrichment
1 CTAATTAAA 65.6 engrailed (en)
2 TTKCAATTAA 57.3 reversed-polarity (repo)
3 WATTRATTK 54.9 araucan (ara)
4 AAATTTATGCK 54.4 paired (prd)
5 GCAATAAA 51 ventral veins lacking (vvl)
6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx)
7 TGATTAAT 45.7 apterous (ap)
8 YMATTAAAA 43.1 abdominal A (abd-A)
9 AAACNNGTT 41.2
10 RATTKAATT 40
11 GCACGTGT 39.5 fushi tarazu (ftz)
12 AACASCTG 38.8 broad-Z3 (br-Z3)
13 AATTRMATTA 38.2
14 TATGCWAAT 37.8
15 TAATTATG 37.5 Antennapedia (Antp)
16 CATNAATCA 36.9
17 TTACATAA 36.9
18 RTAAATCAA 36.3
19 AATKNMATTT 36
20 ATGTCAAHT 35.6
21 ATAAAYAAA 35.5
22 YYAATCAAA 33.9
23 WTTTTATG 33.8 Abdominal B (Abd-B)
24 TTTYMATTA 33.6 extradenticle (exd)
25 TGTMAATA 33.2
26 TAAYGAG 33.1
27 AAAKTGA 32.9
28 AAANNAAA 32.9
29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n)
30 TTATTTAYR 32.9 Deformed (Dfd)
47Evidence for promoter motifs
Consensus MCS Matches to known Expression enrichment Promoters Enhancers
1 CTAATTAAA 65.6 engrailed (en) 25.4 2
2 TTKCAATTAA 57.3 reversed-polarity (repo) 5.8 4.2
3 WATTRATTK 54.9 araucan (ara) 11.7 2.6
4 AAATTTATGCK 54.4 paired (prd) 4.5 16.5
5 GCAATAAA 51 ventral veins lacking (vvl) 13.2 0.3
6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx) 16 3.3
7 TGATTAAT 45.7 apterous (ap) 7.1 1.7
8 YMATTAAAA 43.1 abdominal A (abd-A) 7 2.2
9 AAACNNGTT 41.2 20.1 4.3
10 RATTKAATT 40 3.9 0.7
11 GCACGTGT 39.5 fushi tarazu (ftz) 17.9
12 AACASCTG 38.8 broad-Z3 (br-Z3) 10.7
13 AATTRMATTA 38.2 19.5 1.2
14 TATGCWAAT 37.8 5.8 2
15 TAATTATG 37.5 Antennapedia (Antp) 14.1 5.4
16 CATNAATCA 36.9 1.8 1.7
17 TTACATAA 36.9 5.4
18 RTAAATCAA 36.3 3.2 2.8
19 AATKNMATTT 36 3.6 0
20 ATGTCAAHT 35.6 2.4 4.6
21 ATAAAYAAA 35.5 57.2 -0.5
22 YYAATCAAA 33.9 5.3 0.6
23 WTTTTATG 33.8 Abdominal B (Abd-B) 6.3 6
24 TTTYMATTA 33.6 extradenticle (exd) 6.7 1.7
25 TGTMAATA 33.2 8.9 1.6
26 TAAYGAG 33.1 4.7 2.7
27 AAAKTGA 32.9 7.6 0.3
28 AAANNAAA 32.9 449.7 0.8
29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n) 11 0.8
30 TTATTTAYR 32.9 Deformed (Dfd) 30.7
48Overview
- Part 2. Motif identification
- Evolutionary signatures for motif discovery
- Functional roles of novel motifs
- Scaling motif discovery to 12 species
?
?
49The slide for the skeptics
dmel CAGCT--AGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTC
dsim CAGCT--AGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTC
dsec CAGCT--AGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTC
dyak CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTC
dere CAGCGGTCGCCAAACTCTCTAATTAGCGACCAAGTC-CAAGTC
dana CACTAGTTCCTAGGCACTCTAATTAGCAAGTTAGTCTCTAGAG
footprint
dmel CAGCT--AGCC-AACTCTCTAATTA-------------------
------------------------------GCGACTA---AGTC-CAAGT
C dsim CAGCT--AGCC-AACTCTCTAATTA-----------------
--------------------------------GCGACTA---AGTC-CAA
GTC dsec CAGCT--AGCC-AACTCTCTAATTA---------------
----------------------------------GCGACTA---AGTC-C
AAGTC dyak CAGC--TAGCC-AACTCTCTAATTA-------------
------------------------------------GCGACTA---AGTC
-CAAGTC dere CAGCGGTCGCCAAACTCTCTAATTA-----------
--------------------------------------GCGACCA---AG
TC-CAAGTC dana CACTAGTTCCTAGGCACTCTAATTA---------
----------------------------------------GCAAGTT---
AGTCTCTAGAG dpse ------------AGCCGTCTAATTA-------
------------------------------------------GTG--GT-
--GTTCTCGGTTC dper ------------AGCCGTCTAATTA-----
--------------------------------------------GTG--G
T---GTTCTCGGTTC dwil AAAAT------ATGCTCTCTAATTAATG
TGATGGGATGGGA---------------TTCGAGACATTCGTGTTAGAGA
CTTTAGAGTCGCAAGTG dmoj -------AGCTCGACT-TTTAATTAA
TATGACGGGCGAC----TTCACAGTGAGAATGGGAGAGAGAG----AGAC
AGTGAGAGAGAGTGA---- dvir -------AGCCCGACT--TTAATT
AATATGATGAGCGCCAAAGTTCAGAGGGCCAGTTCAAAGCAGAC----AG
ACA------------------ dgri ------GAGGCAGAC--TTTAA
TTAATATGATGAGCG--TTTGGTCGGAGTGCCAAAGTTGAACAGGCCTGA
AAACAGCTAAAGCTGCTCG----
This is not typical. Motifs not always aligned.
50New challenges in using 12 genomes
engrailed (TAATTA)
engrailed
engrailed
- Sequencing / assembly / alignment artifacts
- Contig breaks, misassemblies, misalignments
- Evolutionary variation
- Individual binding sites can move / mutate
- Some motifs conserved only in subset of species
51Discovery power increases with more species
engrailed motif engrailed motif engrailed motif Random motifs conservation Enrichment
Total Conserved Conservation Random motifs conservation Enrichment
5/5 Flies 8599 1534 17.8 2.6 6.8x
8/8 Flies 8529 770 9 0.7 12.9x
Real ½
Random ¼
Enrich 2
- From 5 to 8 flies conservation enrichment
increases - Real sites ½ as many conserved
- Random motifs ¼ as many conserved
- Enrichment Doubles
- ? Undertake motif discovery with 12 species
52Signal increases linearly with branch length
- MCS of top motifs increases
- Several combinations of species allowed
53Fly comparative genomics
Gene identification
Regulatory motif discovery
microRNA regulation
54Motif discovery in 3-UTRs
TSS
3-UTR
Stop
ATG
- 168 promoter motifs
- match known TF motifs
- show expression enrichment
- enriched in known enhancers
- 196 motifs in 3-UTR
- Strand specific
- Abundance of 8-mers
- Role in microRNA regulation
55Directionality of 3-UTR motifs
3-UTR motifs
Promoter motifs
Stop
motif
motif
ATG
3-UTR motifs likely to act post-transcriptionally
563-UTR motif properties
(2) Length distribution
- Enriched in motifs of length 8
Features reminiscent of microRNA genes
57Properties of microRNA genes (miRNAs)
Have we discovered target motifs for miRNAs ?
58Top 3-UTR motifs match known miRNAs
Specifically match 5 positions of known
microRNA genes Complementarity can allow
mismatches, in specific positions
59Central 8-mer positions dictate miRNA binding
Evolution
Mismatches excluded from central positions
Experiment
Mutations in central positions affect binding
more strongly
Central 8-mer positions important for miRNA
binding
608-mer conservation rates pinpoints miRNA start
MCS of 8-mer starting at that position
- 8-mers starting at positions 0, -1, 1 all have
strong MCS - True across all miRNA genes (animation)
- Can use as signature for miRNA starting position
61Use 8-mer conservation profile to adjust
microRNA annotation
dme-miR-33 AGGTGCATTGTAGTCGCATTG
Old hsa-miR-33 GTGCATTGTAGTTGCATTG dme-miR-33
GTGCATTGTAGTCGCATTG New
- Conservation profile matches human microRNA
ortholog
62Using 8-mers to discovery novel microRNAs
- Find all matching conserved 8mers genome-wide
- Restrict to those that are flanked by conserved
20mer
- Fold surrounding genomic regions in all 9
genomes
- Restrict to those that have miRNA-like hairpin
structure
55 known miRNAs 43 8-mers 41 hairpins (38
known miRNAs)
Top 3-UTR motifs (MCSgt30) 3234 8-mers 185
hairpins (50 known miRNAs)
63Discovery of 50 novel miRNA genes
64Ability to determine functional miRNA targets
Real motif instances (conserved above expected)
Motif instances within noise (expected for random
motifs)
65Regulatory motif discovery in the fly
Systematic discovery of regulatory motifs in the
fly
- Frequently occurring, strongly conserved short
regulatory signals
TSS
3-UTR
Stop
ATG
- promoter motifs
- match known TF motifs
- show expression enrichment
- enriched in enhancers promoters
- motifs in 3-UTR
- Strand specific
- Abundance of 8-mers
- microRNA-associated
microRNA regulation
ATATGCAA
conserved 8-mers
50 known 50 candidate new miRNA genes
66Summary of results
- 1. Gene identification
- Evolutionary signatures for gene identification
- 1400 novel, 600 refined, 579 dubious genes
- Unusual gene structures
- 2. Regulatory motif discovery
- Genome-wide motif discovery
- Novel motifs tissues, promoters, enhancers
- Discovery power scales with branch length
- 3. microRNA regulation
- Motif-centric discovery, precise start positions
- New microRNA genes, refined existing genes
- 12 genomes identify individual motif targets
67Students and Collaborators
Alex Stark Regulatory motifs
Bill Gelbart Harvard MCB
Mike Lin Gene identification
FlyBase curators
Matt Rasmussen Whole-genome phylogeny
Sue Celniker Berkeley Drosophila Genome Center
Bing Ren UC San Diego
Martha Bulyk Harvard Medical School