New Methods for Comparative genomics - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

New Methods for Comparative genomics

Description:

New Methods for Comparative genomics -Mark ... CGL: a software library for comparative genomics. Explore recent history ... DVS E M KV GP DIL NNA T P ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 42
Provided by: markdougl
Category:

less

Transcript and Presenter's Notes

Title: New Methods for Comparative genomics


1
New Methods for Comparative genomics
-Mark Yandell HHMI/BDGP
2
New Methods for Comparative genomics
-Mark Yandell HHMI/BDGP
  • CGL a software library for comparative genomics
  • Explore recent history (5-70 myr years)
  • Explore ancient history (70-1000 myr years)

3
Part I CGL A software library for comparative
genomics
4
gt264 MSFNNALSGVNAAQKDLNVTANNIANVNTTGFKESRAEFADVYAN
SIFVNAKTQVGNGVATGAVAQQFHQGALQFTNNALDLSIQGNGFFVTSD
GLTNLDRTFTRAGAFKLNENSYMVNNQGNYLQGYEINTDGTPKAVSINA
TKPIQIPDRAGEPKMTELVEASFNLSIESKTKPTSPAAFDPTNSATFAH
STSVTIYDSLGAPHVITKYFVRHEDPAAPGTPLTPGVKMTFTSGKLDPT
LTVPVDPIKTVALGTTAGIINNGADPTQTLEIRLGDVTQYSSPFNVTKLT
QDGATVGNLTKVEITPDGIVSATYSNATTLKVAMVALAKFANSQGLTQV
GDTSWRQSLLSGDALPGTPNSGTLGSIKSSALEQSNVDLTSQLVNLITA
QRNFQANSRSLEVNSSLQQTILQI
gt264ATGAAAGTTAGTTTTGAAAGAATAATTCCAAGTGAAAAAAGCTCT
TTCCGCACACTGCATAATAACTCTCCTATTTCTGAATTTAAATGGGAGT
ATCATTATCATCCGGAAATAGAACTGGTATGTGTAATTTCGGGAAGTGG
CACACGGCATGTAGGCTACCATAAAAGCAATTATACAAACGGAGATCTT
GTGTTAATAGGTTCAAACATTCCACATTCCGGATTTGGACTGAATTCTG
TTGATCCGCATGAAGAAATAGTACTTCAGTTCAGGGAAGAGATTTTGCAT
TTTCCACAACAGGAAGTTGAAACAAGAGCCGTGAAAGATCTACTGGAAC
GCTCTAAATATGGTATTCTGTATAGTACAGCTACAAAAAAGCTGCTCAT
GCCGAAACTAAAAAAGCTTCTGGAATCCGAAGGCTACAAAAGATACTTA
CTACTTCTGGAGATTCTCTTCGAACTTTCTTTGTGCGAGGAATATGAAT
TGTTGAACAAAGAAATTATGCCTTATACCATAATCTCTAAAAATAAAACA
AGACTGGAAAATATCTTTACCTATGTGGAACATCATTACGATAAGGAAA
TAAATATAGAGGATGTTGCAAAGCTGGCTAATCTTACTCTTCCTGCATT
TTGTAATTTTTTTAAAAAAGCAACACAGATTACCTTTACAGAATTTGTC
AACCGTTACCGTATTAATAAAGCCTGCCTTCTGATGACTCAGGATAAAA
CAATATCCGAATGCAGCTACAGTTGTGGCTTTAACAATGTTACTTATTT
CAACAGAATGTTTAAAAAATATACCAATAAAACGCCATCAGAATTT
5
(No Transcript)
6
?
?
DEW RQWKMDKQV WDA ER
DDW RGWKMDKQV WDMER
?
Gene structure
Sequence similarity
7
Aligning two sequences implicitly aligns two
genes
gacgagtggcgacaatggaaaatggacaaacaagtg
gacgagtgg
tgggacgctgagcga
agtttaaagc
atgatatatatatatatatcg
gt
ag gt
ag taa
DEW
RQWKMDKQV
WDAER
DDW
RGWKMDKQV
WDMER
gt
ag gt
ag tga
agaaatttcg
atgcgcgcgcgcgcgcg
gatgggtggcgcgggtggaaaatggacaaacaggtg
gatgattgg
tgggacatggagcga
8
Using BLASTP alignments to make sense of genomic
annotations
Annotation A
gacgagtgg
gacgagtggcgacaatggaaaatggacaaacaagtg
tgggacgctgagcga
?
agtttaaagc
atgatatatatatatatatcg
gt
ag gt
ag taa
DEW
RQWKMDKQV
WDAER
DDW
RGWKMDKQV
WDMER
gt
ag gt
ag tga
agaaatttcg
atgcgcgcgcgcgcgcg
tgggacatggagcga
gatgggtggcgcgggtggaaaatggacaaacaggtg
gatgattgg
Annotation B
9
Using genomic annotations to make sense of
BLASTP alignments
Score 42.0 bits (97), Expect(2) 2e-25
Identities 23/64 (35), Positives 40/64
(61), Gaps 2/64 (3) Frame -3Query 1
MFQNDVSSPRELQLMAAKVEKELGPVDILVNNASLMPMTSTP-SLKSDEI
DTILQLNL-G 59 DVS E M KV
GPDILNNA T P E D NL GSbjct 29
VVKADVSNREEVREMVKKVIDKFGPIDILINNAGILGKTKDPLEVTDEEW
DRVISVNLKG 88 Score 37.7 bits (86),
Expect(2) 2e-25 Identities 17/43 (39),
Positives 29/43 (66) Frame -1Query 60
VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDI 92
TGA GGRAIELAKG IN SGAE K
Sbjct 89 ITGASRGIGRAIAIELAKRGVNVV---INYSGAEEE
AKKTEEL 128
10
Using genomic annotations to make sense of
BLASTP alignments
Score 42.0 bits (97), Expect(2) 2e-25
Identities 23/64 (35), Positives 40/64
(61), Gaps 2/64 (3) Frame -3Query 1
MFQNDVSSPRELQLMAAKVEKELGPVDILVNNASLMPMTSTP-SLKSDEI
DTILQLNL-G 59 DVS E M KV
GPDILNNA T P E D NL GSbjct 29
VVKADVSNREEVREMVKKVIDKFGPIDILINNAGILGKTKDPLEVTDEEW
DRVISVNLKG 88 Score 37.7 bits (86),
Expect(2) 2e-25 Identities 17/43 (39),
Positives 29/43 (66) Frame -1Query 60
VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQDI 92
TGA GGRAIELAKG IN SGAE K
Sbjct 89 ITGASRGIGRAIAIELAKRGVNVV---INYSGAEEE
AKKTEEL 128
splice junction
splice junction
splice junction
splice junction
splice junction
splice junction
11
Annotation A
gacgagtgg
gacgagtggcgacaatggaaaatggacaaacaagtg
tgggacgctgagcga
?
agtttaaagc
atgatatatatatatatatcg
gt
ag gt
ag taa
DEW
RQWKMDKQV
WDAER
Some un-annotated genome
12
Using genomic annotations to make sense of
TBLASTX alignments
Score 112 bits (240), Expect(2) 2e-25
Identities 58/99 (58), Positives 61/99
(61) Frame -3 / 1

Query 461 VTSATLPLTSFSLSPSANRSNRCLRKMASRTRGRR
SRKTMTFRLKMARVVSSLRCACAA 282
TSATLPLTSFSL P ANRS RC MASR RG SRK FRLK
ARVVSSLR AC ASbjct 45613 LTSATLPLTSFSLRPRANRSRR
CFSPMASRIRGSSSRKVRIFRLKRARVVSSLRRACEA 45792

Query 281 ACVAVGVAEAVGPVTVTPAGTAVVVRVMDSRK
PAVSTT 165 A VAVG
VRVMDSKPAVSTTSbjct 45793 AWVAVGAVGVAEDEGTVAGP
AGAVMRVMDSMKPAVSTT 45909
13
Using genomic annotations to make sense of
TBLASTX alignments
Score 112 bits (240), Expect(2) 2e-25
Identities 58/99 (58), Positives 61/99
(61) Frame -3 / 1

Query 461 VTSATLPLTSFSLSPSANRSNRCLRKMASRTRGRR
SRKTMTFRLKMARVVSSLRCACAA 282
TSATLPLTSFSL P ANRS RC MASR RG SRK FRLK
ARVVSSLR AC ASbjct 45613 LTSATLPLTSFSLRPRANRSRR
CFSPMASRIRGSSSRKVRIFRLKRARVVSSLRRACEA 45792

Query 281 ACVAVGVAEAVGPVTVTPAGTAVVVRVMDSRK
PAVSTT 165 A VAVG
VRVMDSKPAVSTTSbjct 45793 AWVAVGAVGVAEDEGTVAGP
AGAVMRVMDSMKPAVSTT 45909
1st Coding Exon
5-UTR
1st Intron
2nd Coding Exon
14
Using genomic annotations to make sense of
TBLASTX alignments
Score 112 bits (240), Expect(2) 2e-25
Identities 58/99 (58), Positives 61/99
(61) Frame -3 / 1

Query 461 VTSATLPLTSFSLSPSANRSNRCLRKMASRTRGRR
SRKTMTFRLKMARVVSSLRCACAA 282
TSATLPLTSFSL P ANRS RC MASR RG SRK FRLK
ARVVSSLR AC ASbjct 45613 LTSATLPLTSFSLRPRANRSRR
CFSPMASRIRGSSSRKVRIFRLKRARVVSSLRRACEA 45792

Query 281 ACVAVGVAEAVGPVTVTPAGTAVVVRVMDSRK
PAVSTT 165 A VAVG
VRVMDSKPAVSTTSbjct 45793 AWVAVGAVGVAEDEGTVAG
PAGAVMRVMDSMKPAVSTT 45909
1st Coding Exon
5-UTR
?
1st Intron
2nd Coding Exon
?
15
Using TBLASTN alignments to identify orthologus
exons and introns in an un-annotated genome
Score 53.5 bits (127), Expect(2) 3e-10
Identities 29/58 (50), Positives 30/58
(51) Frame -2Query 1 MSEVVRNTLVKLQAIIALI
GLAAITPLLILVALLGRLIAKLCWCSAPKSIAGEVMVIT 58
MSEVVRNTLVKLQAIIALIGLAAITPLLILVALLGRLIAKLCWCS
APKSIAGEVVSbjct 2097 MSEVVRNTLVKLQAIIALIGLAAI
TPLLILVALLGRLIAKLCWCSAPKSIAGEVAVVS 2270
splice junction
1st coding exon
orthologus intron begins here (2252)
Score 105 bits (261), Expect(2) 3e-10
Identities 52/52 (100), Positives 52/52
(100) Frame -1Query 49
SIAGEVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQD
IYKVRAK 108 EVMVITGAGHGLGRAISLELAK
KGCHIAVVDINVSGAEDTVKQIQDIYKVRAK Sbjct 3991
NVMPEVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQD
IYKVRAK 4146
splice junction
2nd coding exon
orthologus intron ends here (4003)
16
Using TBLASTN alignments to identify orthologus
exons and introns in an un-annotated genome
Score 53.5 bits (127), Expect(2) 3e-10
Identities 29/58 (50), Positives 30/58
(51) Frame -2Query 1 MSEVVRNTLVKLQAIIALI
GLAAITPLLILVALLGRLIAKLCWCSAPKSIAGEVMVIT 58
MSEVVRNTLVKLQAIIALIGLAAITPLLILVALLGRLIAKLCWCS
APKSIAGEVVSbjct 2097 MSEVVRNTLVKLQAIIALIGLAAI
TPLLILVALLGRLIAKLCWCSAPKSIAGEVAVVS 2270
splice junction
1st coding exon
1st coding exon
orthologus intron begins here (2252)
Score 105 bits (261), Expect(2) 3e-10
Identities 52/52 (100), Positives 52/52
(100) Frame -1Query 49
SIAGEVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQD
IYKVRAK 108 EVMVITGAGHGLGRAISLELAK
KGCHIAVVDINVSGAEDTVKQIQDIYKVRAK Sbjct 3991
NVMPEVMVITGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQIQD
IYKVRAK 4146
splice junction
2nd coding exon
'2nd coding exon
orthologus intron ends here (4003)
17
Part II Exploring recent history with CGL
D. melanogaster Intron length distribution
18
D. virillis
D. pseudoobscura
D. ananassae
D. yakuba
D. simulans
D. melanogaster
62.9
12.8
5.4
44.2
54.9
time (millions of years)
Tamura et al (2003) Temporal Patterns of Fruit
Fly (Drosophila Evolution Revealed by Mutation
Clocks MBE
19
How do intron lengths change over time?
Compare intron lengths
20
agtttaaagc
atgatatattatatatatcg
Query 2 gacgagtgg gaccagtggcgacaatggaaaatggaca
aacaagtg tgggacgctgagcga 54

Sbjct 29 gatgattgg
gatgggtgg---gggtggaaaatggacaaacaggtg
tgggacatggagcga 84
agaaatttcg
atgcgcgcgcgcgcgcg
Compare intron lengths
Log10 length D. simulans
Log10 length D. melanogaster
21
Why do intron lengths change over time?
D. pseudoobscura intron length
D. melanogaster intron length
22
Does selection on the protein influence intron
lengths ?
Annotation
gataagcagaag
atggatagtagc
Li
annotated
MDSR
DKQV
Lj
MESR
DKQV
inferred
atggaaagtagc
gataaacaaaag
Contig or read
23
(No Transcript)
24
A new evolutionary clock
25
Region appears To be un-duplicated In dpse
might be duplicated in dyak
P(6) vs. dpse ltlt 0.001 P(6) vs. dyak 0.001
26
(No Transcript)
27
Part III Exploring ancient history with CGL
C. elegans
C. intestinalis
H. sapiens
M. musculus
A.gambiae
D. melanogaster
250
70
time (myr)
1000
28
  • protein similarities
  • similarities in the intron-exon structures of
    genes

29
MIKEVFRPDKFGMDL
MILEVFRPDKFGMDL
MIKDVFRPDKFGIDL
30
The evolutionary trajectory of protein
similarities
Best HSP, reciprocal best blastp hits
Human vs. mouse
Dmel vs. agam
Human vs. ciona
Human vs. plant
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
0.2 0.1 0.0
Fraction of identical amino acids
31
The evolutionary trajectory of protein
similarities
Asymptotic distribution?
70 million years
250 million years
600 million years
3 billion years
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
0.2 0.1 0.0
Fraction of identical amino acids
time
32
D. melanogaster vs. A. gambiae
H. sapiens vs. C. intestinalis
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
0.2 0.1 0.0
Fraction of identical amino acids
C. intestinalis
H. sapiens
M. musculus
A.gambiae
D. melanogaster
C. elegans
33
?
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
0.2 0.1 0.0
Fraction of identical amino acids
34
Organism A
Organism B
BLASTP
All proteins
All proteins
Every reciprocal best hit
For every HSP
Global similarity between the two organisms
proteomes
35
C. elegans
C. intestinalis
H. sapiens
M. musculus
standard
A.gambiae
D. melanogaster
time (myr)
1000
250
70
C. elegans
C. intestinalis
H
100
H. sapiens
100
M. musculus
A.gambiae
100
D. melanogaster
36
MIKEVFRPDKFGMDL
MILEVFRPDKFGMDL
MIKDVFRPDKFGIDL
37
Organism A
Organism B
BLASTP
All proteins
All proteins
Every reciprocal best hit
XXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXXX
For every HSP
qintron-intron
I

Global similarity between the two organisms
intron-exon structures
38
Good trees can be made from both measures these
measures provide a rapid and robust means to
calculate phylogenetic trees using the
combined data from many annotated genomes
simultaneously.
C. elegans
C. elegans
C. intestinalis
C. intestinalis
100
100
H. sapiens
H. sapiens
100
100
M. musculus
M. musculus
A. gambiae
A. gambiae
100
100
D. melanogaster
D. melanogaster
I
H
Trees based on I provide an independent check on
trees made from amino-acid similarities note
that deep nodes are better resolved in the I
tree.
39
Like intron lengths, gene-structures seem to
evolve independently of protein Sequence hold H
constant and you still get the same I tree
A. gambiae
D. melanogaster
A. gambiae
D. melanogaster
C. elegans
M. musculus
C. elegans
M. musculus
H. sapiens
H. sapiens
C. intestinalis
C. intestinalis
I
H 1.5 50 similarity
40
New Methods for Comparative genomics
  • Part I CGL a software library for comparative
    genomics
  • Genome annotations make possible any new kinds of
    genome analyses
  • We have a developed a (soon to be) publicly
    available software library
  • called CGL seagull that that greatly
    facilitates such analyses.
  • Part II Using CGL to explore recent history
    (1-70 myr years)
  • Intron length can be used as an evolutionary
    clock.
  • Seems to evolve independently of protein
    sequence.
  • Can be used to confirm and calibrate existing
    protein clock approaches.
  • Can be used to identify recent gene duplication
    events.
  • Can be used to strengthen and extend with many
    standard protein based approaches, e.g. quartet
    analysis.
  • Part II Using CGL to explore ancient history
    (70-1000 myr years)
  • Proteins may saturate completely after 1
    billion years or so.
  • Resolving deep evolutionary questions w/ proteins
    means computing consensus trees we offer one
    approach to this problem using H.

41
Acknowledgements
Chris Mungall Simon Prochnik Chris Smith Josh
Kaminker George Hartzell Suzi Lewis Gerald M.
Rubin
Write a Comment
User Comments (0)
About PowerShow.com