Title: Tale 1: To Identify an Unknown Gene
1(No Transcript)
2(No Transcript)
3(No Transcript)
4Observation
Photos courtesy of www.webshots.com and Peter
Smallwood
5Observation
Photos courtesy of www.webshots.com and Peter
Smallwood
6Observation
Photos courtesy of www.webshots.com and Peter
Smallwood
7Observation
Photos courtesy of www.webshots.com and Peter
Smallwood
8Experiment
Photos courtesy of www.webshots.com and Peter
Smallwood
9Filters Information reducersSquirrel filter
10Filters Information reducersMolecule filter
11Filters Information reducersSequence filter
How organism is made How organism works
12From Sequence to OrganismHow does Nature do it?
13From Sequence to OrganismHow does Nature do it?
Genetic code
Rules of folding
14From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA
- Transcrl initiation
- Transcrl termination/ polyA tailing
- Splicing
- Transll initiation
Rules of transcriptional and post-transcriptional
control
15From Sequence to OrganismHow does Nature do it?
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding
Functional protein
DNA
16From Sequence to OrganismHow does Nature do it?
Natural filters/transformations
Functional protein
DNA
17From Sequence to OrganismHow can WE do it?
Simulation of Nature
Whether tis nobler in the mind to suffer the
slings and arrows of outrageous fortune...
We must give our military every tool and weapon
it needs to prevail...
???
18From Sequence to OrganismHow can WE do it?
Surrogate Processes
Whether tis nobler in the mind to suffer the
slings and arrows of outrageous fortune...
Utterence of Wm Shakespeare
Utterence of George W Bush
We must give our military every tool and weapon
it needs to prevail...
Words/sentence Choice of words Sentence
structure
19From Sequence to OrganismHow can WE do it?
Surrogate filters
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding
20From Sequence to OrganismHow can WE do it?
- Surrogate filters
- Gene finders
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding
21From Sequence to OrganismHow can WE do it?
- Surrogate filters
- Gene finders
- Similarity finders
- Feature finders
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding
22From Sequence to OrganismHow can WE do it?
- Surrogate filters
- Gene finders
- Similarity finders
- Feature finders
- Pattern finders
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding
23From Sequence to OrganismHow can WE do it?
- Surrogate filters
- Gene finders
- Similarity finders
- Feature finders
- Pattern finders
- 2nd Most powerful tool
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding
24Surrogate Filters
You do it
25Surrogate FiltersGene finders
Class 1 Start/Stop codon search (Map, Frames,
OrfFinder)
Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAA
TGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
26Surrogate FiltersGene finders
Class 1 Start/Stop codon search (Map, Frames,
OrfFinder)
Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAA
TGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATT
TGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG
27Surrogate FiltersGene finders
Class 1 Start/Stop codon search (Map, Frames,
OrfFinder)
Pro Quick, simple
Con Useless for eukaryotic genomic sequences
(introns)
Inaccurate (start codon problem)
Inaccurate (doubtful short open reading
frames)
28Surrogate FiltersGene finders
Do it
Class 1 Start/Stop codon search (Map, Frames,
OrfFinder)
1. Go to http//www.vcu.edu/elhaij/BioInf
2. Open 2nd 3rd browsers (Ctrl-N in Netscape)
Go to same site (copy and paste URL)
3. In 1st browser, go to Program List Click
on Gene Finders then scroll down Open
OrfFinder
4. In 2nd browser, open sample sequence
29Surrogate FiltersGene finders
Do it
Class 1 Start/Stop codon search (Map, Frames,
OrfFinder)
5. Paste sample sequence into window
6. Choose Bacterial Code in Genetic codes
window
7. Click on OrfFind
30(No Transcript)
31Surrogate FiltersGene finders
Class 2 Codon bias recognition (TestCode)
Are codons equally used?
The code is degenerate
32Surrogate FiltersGene finders
Class 2 Codon bias recognition (TestCode)
Most frequently used codons
Codon bias universal?
Yes/No(basis for determining foreign genes)
Codon usage is biased
33Surrogate FiltersGene finders
Class 2 Codon bias recognition (TestCode)
Pro Quick, simple, available through GCG
Better than Class 1 in excluding false open
reading frames
34Surrogate FiltersGene finders
Class 3 Markov Model-based recognition
Principle
Step 1 Create model through extensive
training set Training set
proven or suspected genes
Organism-specific
Step 2 Assess candidate genes through filter of
model
35Surrogate FiltersGene finders
Class 3 Markov Model-based recognition
Step 1 Create model through extensive training
set
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGAGATGATTCGGTAGCTTT
36Surrogate FiltersGene finders
Class 3 Markov Model-based recognition
Step 1 Create model through extensive training
set
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGAGATGATTCGGTAGCTTT
37Surrogate FiltersGene finders
Class 3 Markov Model-based recognition
Step 1 Create model through extensive training
set
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGAGATGATTCGGTAGCTTT
38Surrogate FiltersGene finders
Class 3 Markov Model-based recognition
Step 2 Assess candidate genes
3rd order Markov model
A C G TAAA 0.33
0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG
0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20
0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25
0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
AAAGCAA
39Surrogate FiltersGene finders
Class 3 Markov Model-based recognition
Step 2 Assess candidate genes
3rd order Markov model
A C G TAAA 0.33
0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG
0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20
0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25
0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
0.12
x 0.15
AAAGCAA
40Surrogate FiltersGene finders
Class 3 Markov Model-based recognition
Step 2 Assess candidate genes
3rd order Markov model
A C G TAAA 0.33
0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG
0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20
0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25
0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
0.12
x 0.15 . . .
AAAGCTA
So far, not a good candidate!
41Surrogate FiltersGene finders
Class 3 Markov Model-based recognition
Step 2 Assess candidate genes
42Surrogate FiltersGene finders
Class 3 Markov Model-based recognition
Step 2 Assess candidate genes
3rd order Markov model
Conform to standard model
Challenge accepted beliefs
Predicted genes
43Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Pro Almost most accurate method known
Con Needs big training set
May miss genes of foreign origin
Will miss very small genes
44Surrogate FiltersGene finders
Do it
Class 3 Hidden Markov Model (HMM)-based
recognition
1. Go to course web page (3rd browser)
2. Go to Program List Click on Gene Finders
then GeneMark
3. Click on here in Gene Prediction in
Bacteria and Archaea
4. Paste in sample sequence
45Surrogate FiltersGene finders
Do it
Class 3 Hidden Markov Model (HMM)-based
recognition
5. Choose Nostoc PCC 7120 as species
6. Check Generate PDF graphics (screen)
Print GeneMark 2.4 predictions
7. Click Start GeneMark.hmm
46(No Transcript)
47Surrogate FiltersScenario I Case of the Hidden
Heterocyst
48Case of the Hidden Heterocyst
NH3
N2
O2
Matveyev and Elhai (unpublished)
49Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
1. Use transposon mutagenesis
50Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Nostoc genome
Transposon
1. Use transposon mutagenesis
to find a mutant defective in heterocyst
differentiation
51Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Nostoc genome
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGA
1. Use transposon mutagenesis
to find a mutant defective in heterocyst
differentiation
2. Sequence out from transposon
52Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Nostoc genome
Do it
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGA
1. Use transposon mutagenesis
to find a mutant defective in heterocyst
differentiation
2. Sequence out from transposon
3. Find gene boundaries
4. Identify gene
53Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
1. Go to course web page (http//www.vcu.edu/
elhaij/BioInf)
2. Open Nostoc sequence
3. Do what you need to do to find the gene
54(No Transcript)
55Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Mission successful gtTranslation 358..513
(direct), 51 amino acids VQLAKQAQTAEGTLQIVTNARVTQT
VKLVRLEKFLSLQKSVEEALENVK
or was it?
Check predicted protein against databases
56Surrogate FiltersSimilarity finders
Do it
- Blast
- BlastP Protein sequence to search protein
database - BlastN Nucleotide sequence to search nucleotide
database - BlastX Nucleotide sequence (translated) to
search protein database - TBlastN Protein sequence to search (translated)
nucleotide database - Blast2Seq Compare two sequences you specify
Pfam (Protein motif families) Finds conserved
motifs similar to protein sequence
57 58(No Transcript)
59(No Transcript)
60(No Transcript)
61Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Mission successful gtTranslation 397..639
(direct), 51 amino acids VQLAKQAQTAEGTLQIVTNARVTQT
VKLVRLEKFLSLQSTVDAAVENIKGA
62Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
What happened?
- GeneMark is correct Conservation of noncoding
regions
- GeneMark is wrong Fooled by weird aa sequence
or start codon
63Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
What happened?
- GeneMark is correct Conservation of noncoding
regions
- GeneMark is wrong Fooled by weird aa sequence
or start codon
Moral Automated gene finders are wonderful, but
common sense is better
Dont trust automated annotation
64Surrogate FiltersFeature finders
- Hidden Markov model-based methods
- Good for contiguous features (e.g. signal
sequences) - Problems with features having gaps (e.g.
promoters)
- Ad hoc methods
- Feature-specific rules (e.g. tandem repeats,
terminators)
Position-dependent frequency tables
Position-specific scoring matrix (PSSM)
Weight table
65Surrogate FiltersFeature finders
Position-dependent frequency tables
Consensus TATAAA
66Surrogate FiltersFeature finders
Position-dependent frequency tables
67Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
atpI ACCTCGAAGGGAGCAGGAGTGAAAAAC bioB ACGTTTTGG
AGAAGCCCCATGGCTCAC glnA ATCCAGGAGAGTTAAAGTATGTCCGC
T glnH TAGAAAAAAGGAAATGCTATGAAGTCT lacZ TTCACACAGG
AAACAGCTATGACCATG rpsJ AATTGGAGCTCTGGTCTCATGCAGAAC
serC GCAACGTGGTGAGGGGAAATGGCTCAA sucA GATGCTTAAGG
GATCACGATGCAGAAC trpE CAAAATTAGAGAATAACAATGCAAACA
Experimentally proven start sites
68Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
?
Unknownstart site
aceB ACTATGGAGCATCTGCACATGAAAACC atpI ACCTCGAAGGGA
GCAGGAGTGAAAAAC bioB ACGTTTTGGAGAAGCCCCATGGCTCAC g
lnA ATCCAGGAGAGTTAAAGTATGTCCGCT glnH TAGAAAAAAGGAA
ATGCTATGAAGTCT lacZ TTCACACAGGAAACAGCTATGACCATG rp
sJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGG
GGAAATGGCTCAA sucA GATGCTTAAGGGATCACGATGCAGAAC trp
E CAAAATTAGAGAATAACAATGCAAACA
Experimentally proven start sites
69Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
?
Unknownstart site
aceB ACTATGGAGCATCTGCACATGAAAACC atpI ACCTCGAAGGGA
GCAGGAGTGAAAAAC bioB ACGTTTTGGAGAAGCCCCATGGCTCAC g
lnA ATCCAGGAGAGTTAAAGTATGTCCGCT glnH TAGAAAAAAGGAA
ATGCTATGAAGTCT lacZ TTCACACAGGAAACAGCTATGACCATG rp
sJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGG
GGAAATGGCTCAA sucA GATGCTTAAGGGATCACGATGCAGAAC trp
E CAAAATTAGAGAATAACAATGCAAACA
Experimentally proven start sites
70Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
atpI ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB
ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA
ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH
TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ
TTCACACAGGAAACAG....CTATGACCATG rpsJ
AATTGGAGCTCTGGTCTCATGCAGAAC serC
GCAACGTGGTGAGGG...GAAATGGCTCAA sucA
GATGCTTAAGGGATCA....CGATGCAGAAC trpE
CAAAATTAGAGAATA...ACAATGCAAACA
71Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
aceB ACCACATAACTATGGAGCATCTGCACATGAAAACC atpI
ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB
ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA
ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH
TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ
TTCACACAGGAAACAG....CTATGACCATG rpsJ
AATTGGAGCTCTGGTCTCATGCAGAAC serC
GCAACGTGGTGAGGG...GAAATGGCTCAA sucA
GATGCTTAAGGGATCA....CGATGCAGAAC trpE
CAAAATTAGAGAATA...ACAATGCAAACA
72Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
aceB ACCACATAACTATGGAGCATCT.GCACATGAAAACC atpI
ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB
ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA
ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH
TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ
TTCACACAGGAAACAG....CTATGACCATG rpsJ
AATTGGAGCTCTGGTCTCATGCAGAAC serC
GCAACGTGGTGAGGG...GAAATGGCTCAA sucA
GATGCTTAAGGGATCA....CGATGCAGAAC trpE
CAAAATTAGAGAATA...ACAATGCAAACA
73Surrogate FiltersPattern finders
New pattern discovery (Meme, Gibbs sampler,
BioProspector)
74Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTG
TGAAGTC histone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGC
CCGGGTGTTT HMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGG
GACGCGGG TP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGC
CTT protamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCC
CCTGACT
Step 1. Arbitrarily choose candidate pattern from
a sequence
75Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTG
TGAAGTC histone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGC
CCGGGTGTTT HMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGG
GACGCGGG TP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGC
CTT protamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCC
CCTGACT
Step 1. Arbitrarily choose candidate pattern from
a sequence
Step 2. Find best matches to pattern in all
sequences
76Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTG
TGAAGTC histone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGC
CCGGGTGTTT HMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGG
GACGCGGG TP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGC
CTT protamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCC
CCTGACT
Step 1. Arbitrarily choose candidate pattern from
a sequence
Step 2. Find best matches to pattern in all
sequences
Step 3. Construct position-dependent frequency
table based on matches
77Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTG
TGAAGTC histone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGC
CCGGGTGTTT HMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGG
GACGCGGG TP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGC
CTT protamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCC
CCTGACT
Step 1. Arbitrarily choose candidate pattern from
a sequence
Step 2. Find best matches to pattern in all
sequences
Step 3. Construct position-dependent frequency
table based on matches
Step 4. Calculate relative probability of matches
from frequency table
78Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTG
TGAAGTC histone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGC
CCGGGTGTTT HMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGG
GACGCGGG TP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGC
CTT protamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCC
CCTGACT
Step 1. Arbitrarily choose candidate pattern from
a sequence
Step 2. Find best matches to pattern in all
sequences
Step 3. Construct position-dependent frequency
table based on matches
Step 4. Calculate relative probability of matches
from frequency table
Step 5. If probability score high, remember
pattern and score
79Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTG
TGAAGTC histone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGC
CCGGGTGTTT HMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGG
GACGCGGG TP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGC
CTT protamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCC
CCTGACT
Step 1. Arbitrarily choose candidate pattern from
a sequence
Step 2. Find best matches to pattern in all
sequences
Step 3. Construct position-dependent frequency
table based on matches
Step 4. Calculate relative probability of matches
from frequency table
Step 5. If probability score high, remember
pattern and score
Step 6. Repeat Steps 1 - 5
80Surrogate FiltersScenario II Case of the
Masked Motif
- Youve found a gene related to Purple Tongue
Syndrome
- BlastP Encoded protein related to cAMP-binding
proteins
- Are the similarities trivial? Related to cAMP
binding?
- Does your protein contain cAMP-binding site?
- What IS a cAMP-binding site?
- Task
- Determine what is a cAMP-binding site
- Determine if your protein has one
81Surrogate FiltersScenario II Case of the
Masked Motif
Strategy
- Collect sequences of known cAMP-binding proteins
- Run Meme, a pattern-finding programAsk it to
find any significant motifs
Do it
- Rerun Meme. Demand that every protein has
identified motifs
- Run Pfam over known sequence to check
82(No Transcript)
83Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
- Progressive External Ophthalmoplegia (PEO)
- Slow paralysis of voluntary eye muscles
- Many other symptoms (e.g., frequent deafness)
- Loss of mitochondrial DNA
84Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
- Progressive External Ophthalmoplegia (PEO)
- Slow paralysis of voluntary eye muscles
- Many other symptoms (e.g., frequent deafness)
- Loss of mitochondrial DNA
- Inheritance
- Mendelian
- Autosomal dominant
- Linked to chromosome 4q34
85Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
- Progressive External Ophthalmoplegia (PEO)
- Slow paralysis of voluntary eye muscles
- Many other symptoms (e.g., frequent deafness)
- Loss of mitochondrial DNA
- Inheritance
- Mendelian
- Autosomal dominant
- Linked to chromosome 4q34
Your task
- Examine sequence of 4q34 region
- Assess likelihood that a gene in the area could
cause disease symptoms
86Surrogate FiltersScenario III Case of the
Mortal Mitochondrion Examining Sequence of 4q34
Region
tctacttatattcaatccacagggctacacctagttcttggtacacagta
catgctcagcaagagtctgttgaatgaacacatacatggtttatctgttt
gtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc
cactagtttctagctttcattctgcttacctggatttcggaactctagcc
tgccccactcttagataaacgcatgccctctgtggccctggaaccttagt
gacttctgctataccaaagtctccacgcccagggtgacacgcagctgcag
ctccgtaaacctctaacatgatgtcagcaaatattaaaaaaaaaaagttt
ataaaaacaatgaataaactttgttaaaggtacaaatgaaaattagcaaa
catgggaagataattgagtaaagagtttaaagttaaaaacgaattgcagt
cattctaggggaaggaacagttgtatttgaaaacctgtatggttacatga
actgcctaaaaaacaagctaaggaaaattaaagctcagatttatatattt
taagaaattaattgcaattaatttcctgggattaaatagcatttcctcaa
ccccagctgtcattaaaaagaggcaaatacagccaaggactggatcttct
ccggaaggctgacagcactgaccctcaagaaggcaccggctgacagacag
aacattctgccctaatatgtgctgaaattccgctgagagcagagtggtac
attgaaccctttaggggcttacaaaagaagtgtcctgtgttttagagtca
cagagttttgcagaaacaagtatgaattcacctagtggccccctgcacca
ggtctttcctgtgggcactgagtgcagacacatcaatatgtaatagcaga
atgaatgactgaacgaacgattgaatgaaaagaaatgagaggcagcaggt
tgtcagattctatgaggcaatcacagcatcaggtgaccttagtatctatt
tgagaggactgccatttattctcgggagcgcacggctctaaagaggccca
tatccaggcagtgagctctggtggggggcgcctttagatgcaagaaggag
gaaacagctcgaaatccctgggcctgagcgcggcccgtgcaggccggagg
gtcaagaactctccaccggcggcagcggcccggtgtctgccccggcttcg
ccccggcctaaggctgcctgtgctataaatacgcggcccacatgccgcgg
tgacacggtgttccctgggctcggcgggacagataacatgaatgtgccct
ttaaacgtcccaagttgcagggacagcccccggcccagcctcgctcccgg
aagcgccttcgcccccgatgccctctgcagctgggaggagggggcgcccc
gcacctgcccagccaatgcgcggcgcgagcgccggccgcgacccgcctcc
tctcgcgagagcccggcggggatataagggggagctgcgggccaggcggc
ggccccctagcgtcgcgcagggtcggggactgcgcgcggtgccaggccgg
gcgtgggcgagagcacgaacgggctgcctgcgggctgagagcgtcgagct
gtcaccatgggtgatcacgcttggagcttcctaaaggacttcctggccgg
gggcgtcgccgctgccgtctccaagaccgcggtcgcccccatcgagaggg
tcaaactgctgctgcaggtgaggaccgcgcggtgcaagaggcgggcgcgg
gcgcggcgggccgggcggggcgcgcgatgcggcgcgagctgcagggcgcg
gggcgccgcggaaaatctgcgccaggccacaggcccgggcgcccgcccgc
ccgcgggggaagaaggtgccctctgcgtagagacaggtccagcgtcagtc
gcagattcctggtgtcgggtggcgcccggcgttcgggtgtctatatatgg
aaacccacccggagccggtttacgtgtgccagatcctgcgcccgtgacag
cacgggcgtgcactcaggcccggaggcacctagtgattgccagtattttt
ggcaccgtcttatgcgcacgcacctttacaataaaaacatcaaaataatc
atcacccaagaattcccttatcgtatctcatgcacaatgctgtatgtagg
ctgacgccttcatctttatgtaacctctgtgagagagttattcttctcca
ttttacagatgaagctgaggttttgaaatattaagaaacaattttcggaa
taaactcagatcatcctgtctccaaatcttttcctcccctacctggtcgc
tgaatggtttatcatcctctcgtgttttcctccacctgcccaaaaggtca
gggcccctcaatgaggaagagcccaatttgggagtcagaattactaacaa
caaaacccccacaaattgctcacaacggcagcaaacccttaataattgat
tacttggattatctgcttgaaaactttggaggcctaatgtttagtggatt
tattctccttcctctattagagcatctagtagagatcctcatctccaggg
tgatcagagtgacactgagaaattgtcattttttggccatcatgtctatt
aaatccaaagccctttgaagcagggagtgttactcatttctgtcccccag
taagcccctcatacagttctcaaacctagggaaagtgaaataaataaatg
gctatagctttatataattcaatcaccttttcagtttatttggggcaata
cctttccctcaaataccctaataattgaagcaacattggattattttggc
ttgttatccagtaactaacatggataacagtatccatttacacgtcctcg
tatccatttgatttcctcatcctttttttcttcaaaaaaaaaatctagga
agtgcaaaccttttttttttctcctgtcctcttcccttctctctaccctg
cctgtcctctgtcacccaccctcccctccaccaggtccagcatgccagca
aacagatcagtgctgagaagcagtacaaagggatcattgattgtgtggtg
agaatccctaaggagcagggcttcctctccttctggaggggtaacctggc
caacgtgatccgttacttccccacccaagctctcaacttcgccttcaagg
acaagtacaagcagctcttcttagggggtgtggatcggcataagcagttc
tggcgctactttgctggtaacctggcgtccggtggggccgctggggccac
ctccctttgctttgtctacccgctggactttgctaggaccaggttggctg
ctgatgtgggcaagggcgccgcccagcgtgagttccatggtctgggcgac
tgtatcatcaagatcttcaagtctgatggcctgagggggctctaccaggg
tttcaacgtctctgtccaaggcatcattatctatagagctgcctacttcg
gagtctatgatactgccaagggtgagagaggggcatcggggagaaggagg
gtggtgtggaaagaggatcctatgggatctataactcacaaaggacctga
tatatattgatcttgttttttctagtctctgggataattgaggcttctga
atgaggaggtgatgtgcataagttaatagctgaagcgttccttgtgtcct
ctactgaaataaactctggcctttagttattcagagaggaggagggggga
gcctgtctccctctagacacagccatagcagttactgagtttaacttgaa
gccacttccaatgccctgtatacaagctgagcactgcccctccggggtcc
ggagagggcagcagccacctttgctgtctgcctggtcatatgtgaagcac
ctgcacaggggcaggttccccgcaaggtcagagcatggagctggaggtgc
agtggcctctctccctccacctgctttctgctgagaacaggcacttcata
gccgttcggcttctgggctctgtccacagggatgctgcctgaccccaaga
acgtgcacatttttgtgagctggatgattgcccagagtgtgacggcagtc
gcagggctggtgtcctacccctttgacactgttcgtcgtagaatgatgat
gcagtccggccggaaagggggtaagcttgtgctctactcatctaaacttg
tttggttttgcccgaggagaacattttacagggctcctttcagtcttcct
tactggaaattaattttcaaaattatttgataaggacttagggaagaaag
atggtattaattccccctaacgttctcaactatcctattagggaaaagta
ttttccattttattagagatgataagaacatgaatagtaagacatttaga
tgtgaatttaactaggtatccagcattatagagaccctaggccctcttcc
cttagagcctgggtgcaaaagctagggaaaagaagtagttagctacttct
tacaaagaactcttgcttccctcctagttacaggtgttagtgggatgggg
tgtttagctgggtagagatggcctgaagcaatctgttgtgccagagaaag
ttttggcttctataggttgaaccatatgaaattgccactttaaaagtcaa
aaacagtccaatgttagcagtttcgtatgtttcaacgaatagttacagcc
ttttatttagactgcataacctcgtgcaggatcatctgaggctcagcctc
agttcggtcctccataaaaaaaggtaaccgcgtagcataatactcctgct
ccactgcgcccttcttgtttcgcagttgggcagtccatgaattacttggt
taattgccccagttcttcactgaccttgaactaatggagtaggaatgaca
ggagacccagcctgccagtgaagcaaggaaggagatgtccagtgggatgt
tgcatggagctgggactccatgcccagatgaccctgattttataaaactg
gtaacagtgtgtacagatatgtttcaggggaaaagtctctttcctccagc
gttacggagccctcaccagcatttgtttccacagccgatattatgtacac
ggggacagttgactgctggaggaagattgcaaaagacgaaggagccaagg
ccttcttcaaaggtgcctggtccaatgtgctgagaggcatgggcggtgct
tttgtattggtgttgtatgatgagatcaaaaaatatgtctaatgtaatta
aaacacaagttcacagatttacatgaacttgatctacaagttcacagatc
cattgtgtggtttaatagactattcctaggggaagtaaaaagatctggga
taaaaccagactgaaggaatacctcagaagagatgcttcattgagtgttc
attaaaccacacatgtattttgtatttattttacatttaaattcccacag
caaatagaaaataatttatcatacttgtacaattaactgaagaattgata
ataactgaatgtgaaacatcaataaagaccacttaatgcacgctttctat
tttattgaactcttattaactgtaaaatgcatttttaaaagatcaaaaat
gcatattttctagcatgattcatgtatcagtcagcagccaagcttctaaa
tgccagatattatattgagaatgtattatatgagaacgtacaatgcttaa
agttccggttttcaaacttaggcaggtcatattctatctatcttatccag
cgttactgtaggctagaaagtgataatggctttcataatcctgccttgtc
ttaggcactttcctgcag
87Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Strategy
- Assume that encoded protein is in mitochondria
- Protein has function associated with
mitochondrial location?
- Use Gene finder to identify protein sequence(s)
- Use Similarity finder to identify possible
function
Do it
- Protein has structure associated with
mitochondrial location?
- Use Feature finders to identify pertinent
regions - (What ARE pertinent regions?)
88(No Transcript)
89Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Run 4q34 region through FGeneSH
Name PEO-related_gene? First three lines of
sequence tctacttatattcaatccacagggctacacctagttcttg
gtacacagtacatgctcagcaagagtctgttgaat gaacacatacatgg
tttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctc
tggcagctttc cactagtttctagctttcattctgcttacctggatttc
ggaactctagcctgccccactcttagataaacgcatg Fgenesh
Wed Feb 27 165914 GMT 2002 FGENESH 1.0
Prediction of potential genes in Human
genomic DNA Time Wed Feb 27 165914 2002
Seq name PEO-related_gene? Length of sequence
5768 GC content 48 Zone 2 Positions of
predicted genes and exons G Str Feature
Start End Score ORF Len
1 TSS 1216 -2.70 1
1 CDSf 1607 - 1717 18.01 1607 -
1717 111 1 2 CDSi 2985 - 3471
52.41 2985 - 3470 486 1 3 CDSi
3980 - 4120 20.99 3982 - 4119 138
1 4 CDSl 5035 - 5192 2.32 5037
- 5192 156 1 PolA 5471
0.92 Predicted protein(s) gtFGENESH 1 4
exon (s) 1607 - 5192 298 aa, chain
MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAE
KQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQ
LFLGGVDRHKQFWRYFAGNLASG GAAGATSLCFVYPLDFARTRLAADVG
KGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVS VQGIIIYRAAYFGVY
DTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMM QSGR
KGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEI
KKYV
/ 3
Translated message
?
90How to decide where exons are?
Strategy
Do it
- Compare sequence of 4q34 region to sequence of
mRNA - Sequence of mRNA may be in cDNA library
- Expressed Sequence Tag (EST) library
Problems
- Library may not exist
- Expression of gene may be low
91 92(No Transcript)
93Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Run 4q34 region through BlastN (x human ests)
MORAL Trust, but verify.
94Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Strategy
- Assume that encoded protein is in mitochondria
- Protein has function associated with
mitochondrial location?
?
- Use Gene finder to identify protein sequence(s)
- Use Similarity finder to identify possible
function
- Protein has structure associated with
mitochondrial location?
- Use Feature finders to identify pertinent
structures - (What ARE pertinent structures?)
95Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Run 4q34 region through BlastP
Name PEO-related_gene? First three lines of
sequence tctacttatattcaatccacagggctacacctagttcttg
gtacacagtacatgctcagcaagagtctgttgaat gaacacatacatgg
tttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctc
tggcagctttc cactagtttctagctttcattctgcttacctggatttc
ggaactctagcctgccccactcttagataaacgcatg Fgenesh
Wed Feb 27 165914 GMT 2002 FGENESH 1.0
Prediction of potential genes in Human
genomic DNA Time Wed Feb 27 165914 2002
Seq name PEO-related_gene? Length of sequence
5768 GC content 48 Zone 2 Positions of
predicted genes and exons G Str Feature
Start End Score ORF Len
1 TSS 1216 -2.70 1
1 CDSf 1607 - 1717 18.01 1607 -
1717 111 1 2 CDSi 2985 - 3471
52.41 2985 - 3470 486 1 3 CDSi
3980 - 4120 20.99 3982 - 4119
138 1 4 CDSl 5035 - 5192 2.32
5037 - 5192 156 1 PolA 5471
0.92 Predicted protein(s) gtFGENESH
1 4 exon (s) 1607 - 5192 298 aa, chain
MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAE
KQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQ
LFLGGVDRHKQFWRYFAGNLASG GAAGATSLCFVYPLDFARTRLAADVG
KGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVS VQGIIIYRAAYFGVY
DTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMM QSGR
KGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEI
KKYV
Translated message
96Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Strategy
- Assume that encoded protein is in mitochondria
- Protein has function associated with
mitochondrial location?
?
- Use Gene finder to identify protein sequence(s)
- Use Similarity finder to identify possible
function
?
- Protein has structure associated with
mitochondrial location?
- Use Feature finders to identify pertinent
structures - (What ARE pertinent structures?)
97Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
- Progressive External Ophthalmoplegia (PEO)
- Slow paralysis of voluntary eye muscles
- Many other symptoms (e.g., frequent deafness)
- Loss of mitochondrial DNA
- Inheritance
- Mendelian
- Autosomal dominant
- Linked to chromosome 4q34
982nd Most Powerful ToolScenario IV Case of the
Lethal Look-alike
992nd Most Powerful ToolScenario IV Case of the
Lethal Look-alike
1002nd Most Powerful ToolScenario IV Case of the
Lethal Look-alike
Escherichia coli . . .
haemorrhagic colitis
1012nd Most Powerful ToolScenario IV Case of the
Lethal Look-alike
E. coli K12
E. coli O157H7
1022nd Most Powerful ToolScenario IV Case of the
Lethal Look-alike
E. coli K12
E. coli O157H7
1032nd Most Powerful ToolScenario IV Case of the
Lethal Look-alike
1042nd Most Powerful ToolScenario IV Case of the
Lethal Look-alike
E. coli K12
E. coli O157H7
What tool to use?
Go to http//www.vcu.edu/elhaij/BioInf
1052nd Most Powerful ToolScenario IV Case of the
Lethal Look-alike
E. coli K12
E. coli O157H7
ASSIGN K12-set FROM Gene-finder (K12-DNA)
ASSIGN O157-set FROM Gene-finder (O157-DNA)
CONSIDER EACH protein IN O157-set
WHEN Constituent-of (K12-set, protein)
FALSE
COLLECT protein
1062nd Most Powerful ToolScenario IV Case of the
Lethal Look-alike
E. coli K12
E. coli O157H7
FUNCTION Constituent-of (set, item)
CONSIDER EACH
protein
IN
set
WHEN
protein item
RETURN TRUE
OTHERWISE RETURN FALSE
1072nd Most Powerful ToolScenario IV Case of the
Lethal Look-alike
E. coli K12
E. coli O157H7
FUNCTION Constituent-of (set, item)
CONSIDER EACH
protein
IN
set
WHEN
protein item
RETURN TRUE
FINALLY RETURN FALSE
1082nd Most Powerful ToolScenario IV Case of the
Lethal Look-alike
E. coli K12
E. coli O157H7
ASSIGN K12-set FROM Gene-finder (K12-DNA)
ASSIGN O157-set FROM Gene-finder (O157-DNA)
CONSIDER EACH protein IN O157-set
WHEN Constituent-of (K12-set, protein)
FALSE
COLLECT protein
FUNCTION Constituent-of (set, item)
CONSIDER EACH
FINALLY RETURN FALSE
1092nd Most Powerful Tool
Computer programming
110FIRST Most Powerful Tool
Your brain
- Keep your nonsense detector on high alert
111FIRST Most Powerful Tool
Your brain
- Keep your nonsense detector on high alert
- Appreciate the limitations of bioinformatic tools
112FIRST Most Powerful Tool
Your brain
- Keep your nonsense detector on high alert
- Appreciate the limitations of bioinformatic tools
- Look out for surprises in the underlying data
113(No Transcript)