Title: Welcome to BNFO 601
1Welcome to BNFO 601 Integrated Bioinformatics
The villains gallery
Paul Fawcett pfawcett_at_vcu.edu
Jeff Elhai elhaij_at_vcu.edu
2Welcome to BNFO 601 Course Organization
www.vcu.edu/csbc/bnfo601
- Scientific problems -
- bioinformatic solutions!
- For each topic
- Lecture and web supplements
- Discussion and computer time
- Problem sets
Focus on principals, not soon-to-be obsolete
software!!
3Optional Textbook BNFO 601
Beginning Perl by Simon Cozens Peter
Wainwright. 700 pages. Wrox Press Inc.
Also available for free on the web!!! http//learn
.perl.org/library/beginning_perl
4What is bioinformatics, anyway?
One reasonable definition The study and
application of computational and statistical
methods for the management and analysis of
biological information.
This is a very broad definition - therefore there
are many flavours of bioinformatics
5Why do we need bioinformatics?
Year BasePairs
Sequences 1982 680338 606 1983 2274029 242
7 1984 3368765 4175 1985 5204420 5700 1986 96
15371 9978 1987 15514776 14584 1988 23800000
20579 1989 34762585 28791 1990 49179285 39533
1991 71947426 55627 1992 101008486 78608 1993
157152442 143492 1994 217102462 215273 1995 3
84939485 555694 1996 651972984 1021211 1997 11
60300687 1765847 1998 2008761784 2837897 1999
3841163011 4864570 2000 11101066288 10106023 200
1 15849921438 14976310 2002 28507990166 22318883
Available biological data is growing
exponentially!
6Why do we need bioinformatics?
Humans have a hard time finding patterns in large
or complex data sets!
7MDVEEFLSRVDAGELVISLGDLSGAILSEVDLSGINLSGANLSGLWKNLS
TILSNTLWDIKEADALATIREIQDESNRAHALIALADKISLPPDLLSEAL
TVARVDEADCADALIALARKLPPDLLSEALATAAEREIQDEYFRTSTLIE
LKLPSVLSEALAAAREIQDEYFRASTLIADEYLAEKLPSVLSEALAASRE
IQFRADALRELAQKLPPDLLSEALAAVREIQPEYLRADALIALVEKLPSV
LSEALAAIREIQDEYLHADALRELVQKLPPDLLGEVLAAATEIRGGYPHT
NPLRELAEKLPPDLLSEALAAAREIQDESNRAHALRELAEKLPPDLLSEA
LTATREIQSEYHRASTLRALAQKLPPDLLSEALAAAREIQDESNRASTLR
ELAEKLPSVLPEALAAVRKIRHKSNRAYGLIALAEKLPSVLPEALAAATE
IEPEYHRASTLRELAEKLPPDLLSELTAISEIQPKSNRADALIALAEKLP
PDLLSEALAAIREIQDESNRAHALIALAEKLPPDLLSEALAAIREIQDES
NRAHALIALAQKLPPDLLSEALAATREIQSKSNRVHALIALAQKLPSVLP
EALAAATEIQDESNRASTLRELAEKLPPDLLSEALAAIREIQPKSNRVHA
LIALAQKLPSVLPEALAAIREIHHEYHRDNALRELAEKLPPNLLSEALAV
IREIHYESNRTNALIALAKKLPSVLPEALAAVRKIRDKSNRIYALRELAD
KLPSVLPEALATAREIHDESYRADALKELAEKLPPDLLSEALTAIREIHD
ESYRADALIALAEKLPSVLPEALAAATVIRPESYRADALRDLAQKLPPDL
LSEALAAIREIQSESNRAHALIALAEKMSLHNPSLSNVSANCVNLNHSTL
TEAKLNQSDLRYGNLKGANLNKANLSRAFLNHADLSNTMLAQSNLSGTNL
RNANLRNANLIMREEIRKVNQSLGESPKFGPFTGRQFVIFAGIFCIVFGL
LCLIIGLDIFWGLGFAFWSSFSVALLSGDQPYIYWSKVYPIVPRWTRGYA
TYTSPHLKKKVGTRKVKLTRSSKPKTLNPFEDWLDLTTIVRLKKDAYTVG
AYLLSKKNLTDSNNTLQLIFGFSCTGIHPLFNSEQEIEAVAKIFESGCKE
IPPGEKITFRWSSFCDDSDAEQYLMQRINNSSSLECEFLDWGRLARTQKL
TNQRARKDIKLNIYWSFTVSSEALETSDPVDKFLAKLANFVQRRFTDSGV
NQLTKKRFTQILTKALEASLRYQQILTEMGLNPQPKTDKDLWQELCKNIG
AKTVIAPHTLVFDEQGVREEIDEKAVFDKPIEIINQPHLSSIILNNGVPF
ADKRWICLPTGENKKFVGVMVLTRKPEIFASTKHQIRFLWDLFSRNNIFD
VEIITEFSPADRGITRAAQQMITKRSRALDLNVQQKKSIDVSAQINVERS
VEAQRQLYTGDVPLNLSLVVLVYRDTPEEIDDACRLISGYISQPTELTRE
VEYAWLIWLQTLLIRLEPILLRPYNRRLTFFASEILGLTNIVQNSPADEQ
GFELIADESDSPLHLDLSKTKNILILGTTGSGKSVLVSSIIGECQAQDMS
VLMIDLPNDDGTGTFGDYTPYHNGFYFDISKESNNLVQPLDLSKIPPDEW
EDRLQAHRNDVNLIVLQLVLGSQTFDGFLSQTIESLIPLGTKAFYDHADI
QRRFAKAKKDGLGSAAWDDTPTLADMERFFSKEHISLGYEDENVDRALNY
IRLRFQYWRNSSIGNAICRPSTFDTDAKLITFALTNLQSSKDAEVFGMSA
YIAASRQSLSAPNSVFFMDEASVLLRFAALSRLVGRKCATARKGGCRVML
AAQDILSIANSEAGEQILQNMPCRLIGRIVPGAAKSFTEHLGIPKDIIDK
NESFRPNIKQLYTLWLLDYNN
Biological data can be confusing!
8MDVEEFLSRVDAGELVISLGDLSGAILSEVDLSGINLSGANLSGLWKNLS
TILSNTLWDIKEADALATIREIQDESNRAHALIALADKISLPPDLLSEAL
TVARVDEADCADALIALARKLPPDLLSEALATAAEREIQDEYFRTSTLIE
LKLPSVLSEALAAAREIQDEYFRASTLIADEYLAEKLPSVLSEALAASRE
IQFRADALRELAQKLPPDLLSEALAAVREIQPEYLRADALIALVEKLPSV
LSEALAAIREIQDEYLHADALRELVQKLPPDLLGEVLAAATEIRGGYPHT
NPLRELAEKLPPDLLSEALAAAREIQDESNRAHALRELAEKLPPDLLSEA
LTATREIQSEYHRASTLRALAQKLPPDLLSEALAAAREIQDESNRASTLR
ELAEKLPSVLPEALAAVRKIRHKSNRAYGLIALAEKLPSVLPEALAAATE
IEPEYHRASTLRELAEKLPPDLLSELTAISEIQPKSNRADALIALAEKLP
PDLLSEALAAIREIQDESNRAHALIALAEKLPPDLLSEALAAIREIQDES
NRAHALIALAQKLPPDLLSEALAATREIQSKSNRVHALIALAQKLPSVLP
EALAAATEIQDESNRASTLRELAEKLPPDLLSEALAAIREIQPKSNRVHA
LIALAQKLPSVLPEALAAIREIHHEYHRDNALRELAEKLPPNLLSEALAV
IREIHYESNRTNALIALAKKLPSVLPEALAAVRKIRDKSNRIYALRELAD
KLPSVLPEALATAREIHDESYRADALKELAEKLPPDLLSEALTAIREIHD
ESYRADALIALAEKLPSVLPEALAAATVIRPESYRADALRDLAQKLPPDL
LSEALAAIREIQSESNRAHALIALAEKMSLHNPSLSNVSANCVNLNHSTL
TEAKLNQSDLRYGNLKGANLNKANLSRAFLNHADLSNTMLAQSNLSGTNL
RNANLRNANLIMREEIRKVNQSLGESPKFGPFTGRQFVIFAGIFCIVFGL
LCLIIGLDIFWGLGFAFWSSFSVALLSGDQPYIYWSKVYPIVPRWTRGYA
TYTSPHLKKKVGTRKVKLTRSSKPKTLNPFEDWLDLTTIVRLKKDAYTVG
AYLLSKKNLTDSNNTLQLIFGFSCTGIHPLFNSEQEIEAVAKIFESGCKE
IPPGEKITFRWSSFCDDSDAEQYLMQRINNSSSLECEFLDWGRLARTQKL
TNQRARKDIKLNIYWSFTVSSEALETSDPVDKFLAKLANFVQRRFTDSGV
NQLTKKRFTQILTKALEASLRYQQILTEMGLNPQPKTDKDLWQELCKNIG
AKTVIAPHTLVFDEQGVREEIDEKAVFDKPIEIINQPHLSSIILNNGVPF
ADKRWICLPTGENKKFVGVMVLTRKPEIFASTKHQIRFLWDLFSRNNIFD
VEIITEFSPADRGITRAAQQMITKRSRALDLNVQQKKSIDVSAQINVERS
VEAQRQLYTGDVPLNLSLVVLVYRDTPEEIDDACRLISGYISQPTELTRE
VEYAWLIWLQTLLIRLEPILLRPYNRRLTFFASEILGLTNIVQNSPADEQ
GFELIADESDSPLHLDLSKTKNILILGTTGSGKSVLVSSIIGECQAQDMS
VLMIDLPNDDGTGTFGDYTPYHNGFYFDISKESNNLVQPLDLSKIPPDEW
EDRLQAHRNDVNLIVLQLVLGSQTFDGFLSQTIESLIPLGTKAFYDHADI
QRRFAKAKKDGLGSAAWDDTPTLADMERFFSKEHISLGYEDENVDRALNY
IRLRFQYWRNSSIGNAICRPSTFDTDAKLITFALTNLQSSKDAEVFGMSA
YIAASRQSLSAPNSVFFMDEASVLLRFAALSRLVGRKCATARKGGCRVML
AAQDILSIANSEAGEQILQNMPCRLIGRIVPGAAKSFTEHLGIPKDIIDK
NESFRPNIKQLYTLWLLDYNN
But is rich in information content !
9Where did bioinformatics come from?
- Evolved, but is distinct from,
- the intellectual traditions of
- Genetics
- Biochemistry
- Molecular Biology
- Computer Science
- Probability Statistics
- Genomics
10Pre-genomic Molecular Biology
The cell as a factory
11Pre-genomic Molecular Biology
12Pre-genomic Molecular Biology
13Pre-genomic Molecular Biology
14Pre-genomic Molecular Biology
15Pre-genomic Molecular Biology
The cell as a Black box
16Pre-genomic Molecular Biology
How do we figure out how cars are made?
Genetic approach
Biochemical approach
17Pre-genomic Molecular BiologyBiochemists
Approach
18Pre-genomic Molecular BiologyBiochemists
Approach
19Pre-genomic Molecular BiologyBiochemists
Approach
20Pre-genomic Molecular BiologyBiochemists
Approach
An inherently reductionist approach!
21Pre-genomic Molecular Biology
How do we figure out how cars are made?
Genetic approach
Biochemical approach
22Pre-genomic Molecular BiologyGeneticists
Approach
23Pre-genomic Molecular BiologyGeneticists
Approach
24Pre-genomic Molecular BiologyGeneticists
Approach
Isolation of a Defective Gene
25Pre-genomic Molecular BiologyHow we viewed the
world
- Highly filtered perception
- Subject to ascertainment bias
26Post-genomic Molecular Biology
A major goal is to achieve a synoptic, integrated
understanding of cell function
27Post-genomic Molecular BiologyBioinformaticists
Approach
(short term)
28Post-genomic Molecular BiologyBioinformaticists
Approach
(long term)
29 What is Bioinformatics?
30TGAGACACATATTTTTGATATTCCAGTTGTTGCAATC GAATGTAAAACA
TATTTAGATCTTTAAATGTATGGTAC ATTCAAGATCCAACCTTCATTCT
AGTGTTTAAAGAGAAC TGATTTGTTTGCAGGGGCAGGAGGCTTTGGTTT
AGGTTTTG AAATGGCAGGCTTCTCTGTACCTTTATCTGTTGAAATTGAT
ACCTGGGCTTGTGATACACTACGCTACAACCGCCCTGATTCAACAGTTAT
TCAAAATGATATCGGTAACTTTAGTACAGAAAATGACGTTAAGAATATCT
GCAACTTTAAACCTGATATTATTATTGGCGGGCCTCCATGCCAGGGATTT
AGTATTGCTGGGCCAGCCCAAAAAGATCCTAAAGATCCTAGAAATGG AA
TTATCAAACAAATCATATGATCAGAATAATCGCCGTTTAAATCCTCATAA
AACT TTTATTCATCAACTTTGCACAATGGATAAAATTTCTTGAACCTAA
AGCGTTTGTCATGGAAAACGTAAAAGGATTGCTATCAAGGAAAAATGCAG
AAGGTTTTAAAGTTATAGATA CTTCTCACTAAATATAAAGATTTTTTAG
ATCAGCAGCATTATGCAGAAAAATTTGATTCA AGACGACGGTACTGGTT
TAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCT TT
ATTAAGAAAACATTTGAAGAACTTGGTTATTTTGTCGAAGTATGGGTTTT
AAATGCTGCGGAATATGGCATTCCGCAAATTAGAGAACGTATTTTTATTG
TTGGCAATAAAAAAGGTAAAGTACTAGGTATGAGTATTATACCTGCACTA
ACTTTGTGGGACGCAATATCAGACTTACCAGAACTTAATGCGCGTGAAGG
AAGTGAAGAGCAACCCTATCATTTAAAACCTCAAAATACTTATCAGACTT
GGGCTAGAAATGGTAGTGCTACGCTTTACAATCATGTTGCAATGGAACAT
TCTGACCGTTTAGTAGAACGTTTCCGGCATATAAAATGGGGTGAATCCAG
TTCGGATGTATCTAAAGAACATGGAGCTAGACGACGTAGTGGTAATGGTG
AATTATCAAACAAATCATATGATCAGAATAATCGCCGTTTAAATCCTCAT
AAACCGTCTCACACTATTGCTGCGTCATTCTATGCTAATTTTGTCCATCC
TTTTCAACATCGAAATTTAACAGCCCGTGAAGGAGCTAGAATCCAATCTT
TTCCAGATAACTATAGATTTTTTGGAAAAAAAACTGTCGTATCTCATAAA
CTATTGCATCGAGAAGAAAGATTTGATGAAAAATTTCTTTGTCAATATAA
TCAAATCGGTAATGCTGTACCCCCTCTTCTCGCTAAAGTAATTGCACATC
ATCTTCTAGAGAAATTAGAGTTATGCCAACAACTGATAGAAATCCTCTAG
TGCATGGATCAAATCTTGAACAAAAAGAGAATCATCGTACAAAATACAGA
GATACTGAAAGCAGGACTTTCCTTAGAGAAATCAGAACTGAATATGACAA
ATGGCATAAAGCAAATATGAACCTGGTTGGACCAAAATCAGAAATTACTG
ACCAAGATGATTCAATTATTACTCAAAGAGTGGAACTTCTCACTAAATAT
AAAGATTTTTTAGATCAGCAGCATTATGCAGAAAAATTTGATTCAAGATC
CAACCTTCATTCTAGTGTTTTAGAGACCATTTATAAAGTAAATCTTTAGA
CGACTAGACGACGTAGCATAATACGAGTCATAACGGCATATATGGCAGCC
TCACTCATTTCTGGGAGACGCTCATAATCCTTACTGAGACGACGGTACTG
GTTTAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCT
GAAATTCTTGATTAGTACGCCGGATTACCTCAACATGAGCTTGAATCATC
AGCCAAACAGAGAGCGCAAATTTATCACCGTCATAGCCGGAATCAACCCA
GATGACTTCAACTTTTTCCAGTAATTCTGGACGCTCTTCTAACAGTTCCA
TCAAAGTATAGGCGGCAAGTAATCTTTCTCCAGCATTTGCTTCACTTACA
ACCACTTTTAACAAAAGTCCCAGACTATCAACCAAAGTTTGCCGCTTTCG
TCCTTTTACCTTCTTGCCACCATCAAAACCGTACACATCCCCCTTTTTTC
AGTCGTTTTTACCGACTGGCTGTCTGCCGCGATCGCCGTGGGTTGAGTTG
ACTTCCCCATTTTTTGACGAACTTGATCGCGCAAAGTATGATTCATTTCA
GTTGAACTAGGAGGAAAATCCCCTGGAAGCATATCCCACTGACAACCTGT
TTTCAGATGGTAGTAGATAGCGTTGCATACTTCTCGCATATCAGTTGTTC
GGGGATGCCCACCGCATTTAGCGGGTGGAATCAAAGGAGCTAAAATTGCC
CATTCTGAGTCATTAAGGTCTGTAGAATAAGACTTTCGTCTCATTGTTTC
CTATGTAAATACACTCTACAAACAGTATCTTATCGCTGCCTTTTTATCTT
AGCTCTCCTTTAGATTTACTTTATAAATAGCCTCTTAGAAGAATTTCTTT
ATTATTTATTTAAAGATTTAGTACAAGATTTCGGGCAGAACGCTCTTATT
GGTAAGTCACACACGTTCAAAGATATTTTCTTCGTACCACCAAAATATTC
TGAAATGCTCAAGCGACCTTATGCGCGAATTGAGAGAAAAGATCATGATT
TCGTAATTGGTGCAACTGTTCAAGCATCGCTTGAAGCAGCACCTCCTCCA
GAACAAAACCATGCTTGAGGGATCTTCACGCGCAGCAGAGGATTTAAAAG
CGAGAAATCCTAACAGTTTATACCTTGTGGTTATGGAATGGATAAAACTG
ACCAATGATGTAAATTTACGAAAATATAAAGTTGATCAAATTTATGTACT
ACGTCAGCAAAAAAATACTGATAGAGAGTTTAGGTATGAGTCAACTTACA
TAAAAAAT
What is genomic data?
31Partial Hierarchy of Genomic Data
- DNA Sequences
- Contigs of assembled sequences
- Predicted introns, exons, promoters, etc.
- Genes
- RNA sequences
- Predicted gene products, proteins
- Chromosomes
- Genome
Sequence Analysis is therefore a fundamental
component of bioinformatics!
32E. coli What makes it kill?
33E. coli What makes it kill?
Escherichia coli . . .
haemorrhagic colitis
34E. coli What makes it kill?
E. coli K12
E. coli O157H7
35E. coli What makes it kill?
E. coli K12
E. coli O157H7
36E. coli What makes it kill?
37Metabolomics
What is Bioinformatics?
38Towards a Treatment for Sleeping Sickness
Prevalance
66 million sufferers
Standard treatment
Derivative of arsenic
39Towards a Treatment for Sleeping Sickness
TrypanosomesDependent on glycolysis
HumansDependent on glycolysis OR oxidative
metabolism
IDEA Identify drug that selectively blocks
glycolysis
40Towards a Treatment for Sleeping Sickness
How to block glycolysis?
Need a method to predict effectiveness!!
41Towards a Treatment for Sleeping Sickness
Glucose ATP
Glucose-6-phosphate ADP
Hexokinase
d(G6P)/dt k3glucoseATP
42Towards a Treatment for Sleeping Sickness
Glucose ATP
Glucose-6-phosphate ADP
Hexokinase
Model of glycolysis
d(G6P)/dt k3glucoseATP
d(F6P)/dt k4G6P
d(FDP)/dt k6F6PATP
...
d(pyruvate)/dt k20PEPADP
43Towards a Treatment for Sleeping Sickness
Glucose ATP
Glucose-6-phosphate ADP
Hexokinase
Model of glycolysis
d(G6P)/dt k3glucoseATP
d(F6P)/dt k4G6P
d(FDP)/dt k6F6PATP
...
d(pyruvate)/dt k20PEPADP
44Towards a Treatment for Sleeping Sickness
Run model with different realities
45Metabolomics
What isbioinformatics?
46What is bioinformatics, revisted
How to extract biological meaning from
overwhelming information
47A Walk in the Forest
Photo courtesy of www.webshots.com
48Observation
Photos courtesy of www.webshots.com and Peter
Smallwood
49Observation
Photos courtesy of www.webshots.com and Peter
Smallwood
50Observation
Photos courtesy of www.webshots.com and Peter
Smallwood
51Observation
Photos courtesy of www.webshots.com and Peter
Smallwood
52Experiment
Photos courtesy of www.webshots.com and Peter
Smallwood
53Filters Information reducers
A squirrel filter!
54Filters Information reducers
A molecule filter
55Filters Information reducers
How organism is made How organism works
A sequence filter
56From Sequence to OrganismHow does Nature do it?
57From Sequence to OrganismHow does Nature do it?
Genetic code
Rules of folding
58From Sequence to OrganismHow does Nature do it?
Genetic code
Gives us
59From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA
Gives us
60From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA
- Begin transcription
- End transcription
- Splice transcript
- Begin translation
Rules of transcriptional and post-transcriptional
control
61From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA
- Begin transcription
- End transcription
- Splice transcript
- Begin translation
Rules of transcriptional and post-transcriptional
control
62From Sequence to OrganismHow does Nature do it?
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding
Functional protein
DNA
63From Sequence to OrganismHow can we do it?
Natural filters/transformations
Functional protein
DNA
Simulation of Nature
Surrogate Processes
64From Sequence to OrganismHow can we do it?
Simulation of Nature
Whether tis nobler in the mind to suffer the
slings and arrows of outrageous fortune...
We must give our military every tool and weapon
it needs to prevail...
???
65From Sequence to OrganismHow can we do it?
Surrogate Processes
Whether tis nobler in the mind to suffer the
slings and arrows of outrageous fortune...
Utterance of W Shakespeare
Utterance of George W Bush
We must give our military every tool and weapon
it needs to prevail...
Word frequency
66From Sequence to OrganismHow can we do it?
Surrogate Processes
Whether tis nobler in the mind to suffer the
slings and arrows of outrageous fortune...
Utterance of W Shakespeare
Utterance of George W Bush
We must give our military every tool and weapon
it needs to prevail...
Word frequency
, words/sentence
67From Sequence to OrganismHow can we do it?
Surrogate filters
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding/function
TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG
AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT
TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC
TGGATTTCGG AACTCTAGCC TGCCCCACTC
My sequence
68From Sequence to OrganismHow can we do it?
- Surrogate filters
- Gene finders
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding/function
Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Function?
69From Sequence to OrganismHow can we do it?
- Surrogate filters
- Gene finders
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding/function
globin?
globin
70Surrogate FiltersGene finders
Start/Stop codon search
Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAA
TGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
71Surrogate FiltersGene finders
Start/Stop codon search
Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAA
TGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATT
TGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG
Highly inaccurate
72Surrogate FiltersGene finders
Hidden Markov Model (HMM)-based recognition
73Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 1 Create model through extensive training
set
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGAGATGATTCGGTAGCTTT
74Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 1 Create model through extensive training
set
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGAGATGATTCGGTAGCTTT
75Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 2 Assess candidate genes
76Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 2 Assess candidate genes
3rd order Markov model
A C G TAAA 0.33
0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG
0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20
0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25
0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
0.12
x 0.15
AAAGCAA
77Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 2 Assess candidate genes
3rd order Markov model
A C G TAAA 0.33
0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG
0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20
0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25
0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
0.12
x 0.15 . . .
AAAGCTA
So far, not a good candidate!
78Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 2 Assess candidate genes
3rd order Markov model
Candidate genes
Predicted genes
79Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 2 Assess candidate genes
3rd order Markov model
Conform to standard model
Challenge accepted beliefs
Predicted genes
Candidate genes
Predicted genes
80Computers are an ideal tool
81The Crisis in Bioinformatics
1. Need high-level filters
2. Need access to raw phenomena
3. Need new tools for new phenomena
4. Need ability to build new tools
Need a new generation!!
82AATAAAGCTTTACAAACCAAACTCTGGCTTCAATTGTGTAACCCAAGCTT
TGATTCTTTCCTCTGTTAAATCGGATTGATTATCTTCATCAAGGGCAAGA
CCTACAAATTTACCATCACGAACAGCTTTAGACTCACTGAATTCATAACC
TTCTGTAGGCCAATAGCCAACTGTTTCACCACCATTTTCTGAAATTTTTT
CCTCTAGAATACCGAGGGCATCTTGAAATGTATCAGGATAACCAACCTGG
TCTCCAGGAGCAAAATAAGCAACTTTTTTGCCGATGAAGTCAATGTTATC
TAACTCATCATAAAAATTTTCCCAATCACTTTGCAATTCTCCAACATTCC
AGGTAGGACAACCAACAACGATATAATCGTAGTTATTGAAATCACTTGGT
TCAGCTTGTGAAATATCATATAAAGTTACAACACTATCACCACCAAACTC
CTTCTGAATTATTTCTGATTCAGTTTGGGTATTGCCTGTTTGAGTACCAA
AAAATAAACCAATATTAGACATTTTTACTCCTTTTATGTATTTGCAAAAT
TATTTCAATTAAAATATTTAGTAATAATTAATTGTTAGCTAGCTAATAAT
TAAATTTTTATTACAATCATTGTAAAAGGCATTGAAAAAGTAAATAAAAA
TTTTTATTCTACGTTATTTCAAAAATATTTACTTACATATACTTAACCTT
TATAGTGATGTAATATACTCTAATTCCTATTTTACTTATAAATACCATCT
CAGCTTAATGTAACGAATTTTTCTGTTTATCTTTAAATACAAAAAATTCA
ACAAAACTACAGAAAATTAATCTTAATAACACAAAACAAGTATCAATCTG
TAATACAACTAAGCTTAAATAAATTAATAGAAAGCTTCATCTATCTAATA
GGTTGAGAATAGTTTATGTCTAATGACATAAATTCATTCGTGTTGATTTC
ATTTGGGTATATTCATCTGATTTAGGATTTACTCCATTAAGTTTGTACTC
ATCAATGCCCGCCTGTTGGTATCCACAATTCTCATACAGTGCGCGAGCAA
AGTAATCAATCGTTCGTCGCCATATCTAACTTTGAGTCAAACAAACCAGT
TGGATTACCAACCCTCAACTAATCGCTTCTTTAAGGCGAGCGATCGCACA
TTTAACTGTTGGTTGTCACAAGAGAACTAATACTACAGCAGTATATTTAA
CAACTAAGGGTGGTTCAACTTTCGCTGCGACTCCTCCAACGCGCTGAAAT
ACACAGGACTGATGCGATCGCAAACTCTTTGACTAAATTCCATACATTAT
CATGACCATCTCCCAAACAAACAAGTGGGTTAACCAGATGCTGACTATTA
ACATCCCCTGAGTTCGGAGTTGTAGGTCTATTTGACTGGTTCAAAGCGAT
GATGGAACGGCTTTGTTGCATGAATTAAAAAAAGACACACCATCACCTAC
TTCTAGGATAGACACATCAAACGTCCCACCGCCTAAGTCAAATACCAAGA
TAATTTCGTTAGTTTTCTTGTCAAGTCCGTAAGCGAGGGCCGCCGCCGTG
GGCTAGTTGATAATTCGCAGAACTTTAATCCCGGCAATTCTACTGGCATC
TTTGGTAGCCTGCCGTTGAGAGTCATTGAAATAGGCAGGGGTGGTAATTA
CCGCTTGCCTCACTGGTTCCCCCAGATATGTGCTGGCATCATCTATCAGC
TTGCGGACTACCTCATACCATTTCACGAAAAACCTGATACACATGTAAAC
TCTGAAACCCTTGCTGTATCAAAGTTTTGTAATTACGAATTACGAATTAC
GAATTGATATCAGCCGAGATTTCTTCGGGTGAAAATTCCTTGTTCAGAGC
GGGACAGTGTAGCTTGACATTGCCATTACTGTCACGTACCACTTTGTAAG
TAACTTGTTTTGCCTCTTGCGTAACTTCATCATACCTGCGCCCGATGAAC
CGCTTCACAGAATAAAAAGTGTTTTCTGGGTTCATTACACCCTGGCGCTT
Future Biology
83How to get there?
Bioinformatics
84How to get there?
The Challenge
- Some expert molecular biologists
- Some knowledgeable in the statistical arts
- Most have little experience with bioinformatic
tools
Overall goals of the course
85(No Transcript)
86(No Transcript)
87(No Transcript)
88(No Transcript)
89(No Transcript)
90(No Transcript)
91(No Transcript)
92(No Transcript)
93(No Transcript)
94(No Transcript)
95(No Transcript)
96(No Transcript)
97(No Transcript)
98How to get there?
Overall goals of the course
99How to get there?
Overall goals of the course
Introduction to the questions and tools of
bioinformatics
- Through specific scientific scenarios
- Through consideration of how common tools work
- Through manipulation of the tools to solve
problems
- Through computer programming
100Can normal people program?
Sample problem Whats the probability of getting
at least one pair in five dice?
101Can normal people program?
MAIN PROGRAM
Simulate the
roll of many dice (number_of_trials) Count
successes (how many trials conditions are
met) my successes 0 foreach my trial
(1..number_of_trials) roll_dice() if
(any_matches()) successes successes 1
print "Number of successes ", successes,
"\n" print "Number of trials ",
number_of_trials, "\n" print "Fraction
successful ", successes/number_of_trials,
"\n"
102Can normal people program?
sub roll_dice Roll some number of dice, count
ones, twos,... sixes number_of_ones 0
number_of_twos 0 number_of_threes 0
number_of_fours 0 number_of_fives 0
number_of_sixes 0 foreach my roll
(1..number_of_dice) my die_value
random_integer(1,6) if (die_value 1)
number_of_ones number_of_ones 1
if (die_value 2) number_of_twos
number_of_twos 1 if (die_value 3)
number_of_threes number_of_threes 1
if (die_value 4) number_of_fours
number_of_fours 1 if (die_value 5)
number_of_fives number_of_fives 1
if (die_value 6) number_of_sixes
number_of_sixes 1
103Can normal people program?
CONSTANTS
my
number_of_trials 10000 my number_of_dice
5 my matches_wanted 2
104How to get there?
Computer programming
Goals of course
- Be able to understand well-written programs in
Perl
- Be able to modify working programs
-
- Gain increasing skill in writing programs from
scratch
105How to get there?
What do you do?
- Read notes before coming to class
- Respond to questionnaire by 700 AM, day of class
- Attend to problem set questions (particularly
those you cant do)
- Serve as TA in area of your expertise