Title: Lecture 20: Bioinformatics continue
1Lecture 20Bioinformatics (continue)
2Outline
- Genomics in general
- DNA sequencing method
- Human Genome Project
- What to learn from HGP
- Bioinformatics
- predicting Open Reading Frame
- comparing two sequences ---gt information
3Open Reading Frame Prediction
Lets have the YY1 gene as an example. Get the
mRNA sequence for human YY1 from
(http//www.ncbi.nlm.nih.gov/mapview/map_search.cg
i?taxid960) ------gt NM_003403 Cut and paste the
sequence into the window of ORF prediction
programs 1) http//searchlauncher.bcm.tmc.edu/seq
-util/seq-util.html 2) http//www.ncbi.nlm.nih.g
ov/gorf/gorf.html
gthYY1, 1320 bases. ccgcccgcccgcagccgaggagccgaggccg
ccgcggccgtggcggcggagccctcagccatggcctcgggcgacaccctc
tacatcgccacggacggctcggagatgccggccgagatcgtggagctgca
cgagatcgaggtggagaccatcccggtggagaccatcgagaccacagtgg
tgggcgaggaggaggaggaggacgacgacgacgaggacggcggcggtggc
gaccacggcggcgggggcggccacgggcacgccggccaccaccaccacca
ccatcaccaccaccaccacccgcccatgatcgctctgcagccgctggtca
ccgacgacccgacccaggtgcaccaccaccaggaggtgatcctggtgcag
acgcgcgaggaggtggtgggcggcgacgactcggacgggctgcgcgccga
ggacggcttcgaggatcagattctcatcccggtgcccgcgccggccggcg
gcgacgacgactacattgaacaaacgctggtcaccgtggcggcggccggc
aagagcggcggcggcggctcgtcgtcgtcgggaggcggccgcgtcaagaa
gggcggcggcaagaagagcggcaagaagagttacctcagcggcggggccg
gcgcggcgggcggcggcggcgccgacccgggcaacaagaagtgggagcag
aagcaggtgcagatcaagaccctggagggcgagttctcggtcaccatgtg
gtcctcagatgaaaaaaaagatattgaccatgagacagtggttgaagaac
agatcattggagagaactcacctcctgattattcagaatatatgacagga
aagaaacttcctcctggaggaatacctggcattgacctctcagatcccaa
acaactggcagaatttgctagaatgaagccaagaaaaattaaagaagatg
atgctccaagaacaatagcttgccctcataaaggctgcacaaagatgttc
agggataactcggccatgagaaaacatctgcacacccacggtcccagagt
ccacgtctgtgcagaatgtggcaaagcttttgttgagagttcaaaactaa
aacgacaccaactggttcatactggagagaagccctttcagtgcacgttc
gaaggctgtgggaaacgcttttcactggacttcaatttgcgcacacatgt
gcgaatccataccggagacaggccctatgtgtgccccttcgatggttgta
ataagaagtttgctcagtcaactaacctgaaatctcacatcttaacacat
gctaaggccaaaaacaaccagtgaaaagaagagagaaga
4Three possible frames of YY1
Which one is the right ORF? Hints-----gt use the
rules 1) ATG start and STOP codon() and 2) the
longest frame
gthYY1, 1320 bases., frame3, 439 bases, 1139
checksum. ARPQPRSRGRRGRGGGALSHGLGRHPLHRHGRLGDAGRDR
GAARDRGGDHPGGDHRDHSGGRGGGGGRRRRGRRRWRPRRRGRPRAR RP
PPPPPSPPPPPAHDRSAAAGHRRPDPGAPPPGGDPGADARGGGGRRRLGR
AARRGRLRGSDSHPGARAGRRRRRLHTNAGHRGG GRQERRRRLVVVGR
RPRQEGRRQEERQEELPQRRGRRGGRRRRRPGQQEVGAEAGADQDPGGRV
LGHHVVLRKKRYPDSGRTD HWRELTSLFRIYDRKETSSWRNTWH
PLRSQTTGRICNEAKKNRRCSKNNSLPSRLHKDVQGLGHEKTSA
HPRSQSPRLCR MWQSFCEFKTKTTPTGSYWREALSVHVRRLWETLFTG
LQFAHTCANPYRRQALCVPLRWLEVCSVNPEISHLNTCGQKQPVK
RREK gthYY1, 1320 bases., frame2, 439 bases, E51
checksum. RPPAAEEPRPPRPWRRSPQPWPRATPSTSPRTARRCRPRS
WSCTRSRWRPSRWRPSRPQWWARRRRRTTTTRTAAVATTAAGAATGT PA
TTTTTITTTTTRPSLCSRWSPTTRPRCTTTRRSWCRRARRWWAATTRT
GCAPRTASRIRFSSRCPRRPAATTTTLNKRWSPWR RPARAAAAARRRRE
AAASRRAAARRAARRVTSAAGPARRAAAAPTRATRSGSRSRCRSRPWRAS
SRSPCGPQMKKKILTMRQWLKNR SLERTHLLIIQNIQERNFLLEEYLA
LTSQIPNNWQNLLESQEKLKKMMLQEQLALIKAAQRCSGITRPENIC
TPTVPESTSVQ NVAKLLLRVQNNDTNWFILERSPFSARSKAVGNAFHW
TSICAHMCESIPETGPMCAPSMVVIRSLLSQLTNLTSHMLRPKTTSE
KKRE gthYY1, 1320 bases., frame1, 440 bases, 149
checksum. PPARSRGAEAAAAVAAEPSAMASGDTLYIATDGSEMPAEI
VELHEIEVETIPVETIETTVVGEEEEEDDDDEDGGGGDHGGGGGHGH AG
HHHHHHHHHHHPPMIALQPLVTDDPTQVHHHQEVILVQTREEVVGGDDSD
GLRAEDGFEDQILIPVPAPAGGDDDYIEQTLVTVA AAGKSGGGGSSSSG
GGRVKKGGGKKSGKKSYLSGGAGAAGGGGADPGNKKWEQKQVQIKTLEGE
FSVTMWSSDEKKDIDHETVVEEQ IIGENSPPDYSEYMTGKKLPPGGIPG
IDLSDPKQLAEFARMKPRKIKEDDAPRTIACPHKGCTKMFRDNSAMRKHL
HTHGPRVHVCA ECGKAFVESSKLKRHQLVHTGEKPFQCTFEGCGKRFSL
DFNLRTHVRIHTGDRPYVCPFDGCNKKFAQSTNLKSHILTHAKAKNNQ
KEERR
5How to confirm the predicted ORF
- we can test the existence of the predicted ORF
by searching similar protein - in protein database
- a logic behind this approach,
- many biological organisms have similar sets
of protein machineries -
- if the predicted ORF is the right one, it
should find similar sequence in the - existing database
- Most popular sequence searching program
- --------------gt BLAST (Basic Local
Alignment Search Tool) - 1) you can use
either DNA or Protein sequences - to find similar
sequences from databases -
- demonstration
using three ORFs
6ORF prediction with another program
- ORF Finder (Open Reading Frame Finder)
- http//www.ncbi.nlm.nih.gov/gorf/gorf.
- These two approaches just provide several hints,
and you (human) are still - the calling which one. (Use all the possible
hints to predict the right one!) - exon-intron structure
- ATG start and STOP codons
- the longest one
-
7Comparing two sequences
- Sequence alignment a procedure of comparing two
(pair-wise) or more - (multiple) DNA or protein sequences by looking
for a series of characters - or patterns that are in the same order in the
sequences. - Sequence alignment can provide us with
- Function
- Structure
- Evolutionary information
8Comparing two sequences
Many sequence alignment approaches have been
developed starting from dot plot to BLAST
approaches
9Comparing two sequences
Lets use again YY1 sequence. Compare two YY1
sequences derived from human and
mouse. Retrieving YY1 sequence from the database
(Use mapviewer http//www.ncbi.nlm.nih.gov/mapvi
ew/) gtMus musculus YY1 transcription factor
(Yy1), mRNA CTTCCCCACGGCCGGCCGCCTCCTCGCCCGCCCGCCCT
CCCTCCCGCAGCCCAGGAGCCGACGCCGCCTGCCGCGGCGGCCGTGGC G
GCGGAGCCCTCAGCCATGGCCTCGGGCGACACCCTCTACATCGCCACGGA
CGGCTCGGAGATGCCGGCCGAGATCGTGGAGCTGC ATGAGATCGAGGTG
GAGACCATCCCGGTGGAGACCATCGAGACCACGGTGGTGGGCGAGGAGGA
GGAGGAGGACGACGACGACGAG GACGGCGGCGGCGGCGACCACGGCGGC
GGCGGGGGCGGCCACGGGCACGCCGGCCACCACCATCACCACCACCACCA
CCACCACCA CCACCCGCCCATGATCGCGCTGCAGCCGCTGGTGACGGAC
GACCCGACCCAAGTGCACCACCACCAGGAGGTGATCCTGGTGCAGA CGC
GCGAGGAGGTGGTCGGCGGGGACGACTCGGACGGGCTGCGCGCCGAGGAC
GGCTTCGAGGACCAGATCCTCATCCCGGTGCCC GCGCCGGCCGGCGGCG
ACGACGACTACATAGAGCAGACGCTGGTCACCGTGGCGGCGGCCGGCAAG
AGCGGCGGCGGGGCCTCGTC GGGCGGCGGTCGCGTGAAGAAGGGCGGCG
GCAAGAAGAGCGGCAAGAAGAGTTACCTGGGCGGCGGGGCCGGCGCGGCG
GGCGGCG GCGGCGCCGACCCGGGGAATAAGAAGTGGGAGCAGAAGCAGG
TGCAGATCAAGACCCTGGAGGGCGAGTTCTCGGTCACCATGTGG TCCTC
GGATGAAAAAAAAGATATTGACCATGAAACAGTGGTTGAAGAGCAGATCA
TTGGAGAGAACTCACCTCCTGATTATTCTGA ATATATGACAGGCAAGAA
ACTCCCTCCTGGAGGGATACCTGGCATTGACCTCTCAGACCCTAAGCAAC
TGGCAGAATTTGCCAGAA TGAAGCCAAGAAAAATTAAAGAAGATGATGC
TCCAAGAACAATAGCTTGCCCTCATAAAGGCTGCACAAAGATGTTCAGGG
ATAAC TCTGCTATGAGAAAGCATCTGCACACCCACGGTCCCAGAGTCCA
CGTCTGTGCAGAGTGTGGCAAAGCGTTCGTTGAGAGCTCAAA GCTAAAA
CGACACCAGCTGGTTCATACTGGAGAAAAGCCCTTTCAGTGCACATTCGA
AGGCTGCGGGAAGCGCTTTTCACTGGACT TCAATTTGCGCACACATGTG
CGAATCCATACCGGAGACAGGCCCTATGTGTGCCCCTTCGACGGTTGTAA
TAAGAAGTTTGCTCAG TCAACTAACCTGAAATCTCACATCTTAACACAC
GCTAAAGCCAAAAACAACCAGTGAAAAGAAGAGAGAAGACCTTCTCGACC
CGG GAAGCCTCTTCAGGAGTGTGATTGGGAATAAATATGCCTCTCCTTT
GTATATTATTTCTAGGAAGAATTTTAAAAATGAATCCTAC ACACTTAAG
GGACATGTTTTGATAAAGTAGTAAAAATTTAAAAAATACTTTAATAAGAT
GACATTGCTAAGATGCTATATCTTGCT CTGTAATCTCGTTTCAAAAACA
AGGTATTTTTGTAAAGTGTGGTCCCAACAGGAGGACAATTCATGAACTTC
GCATCAAAAGACAA TTCTTTATACAACAGTGCTAAAAATGGGACTTCTT
TTCACATTCTTATAAATATGAAGCTCACCTGTTGCTTACAATTTTTTTAA
T TTTGTATTTTCCAAGTGTGCATATTGTACACTTTTTGGGGATATGCTT
AGTAATGCTGTGTGATTTTCTGGAGGTTGATAACTTTG CTTGCGGTAGA
TTTTCTTTAAAAGAATGGGCAGTTACATGCATACTTCAAAAGTATTTTTC
CTGTACAAAAAAAAAAGTTATATAG GTTTTGTTTGCTATCTTAATTTTG
GTTGTATTCTTTGATGTTAACACATTTTGTATAATTGTATCGTATAGCTG
TATTGAATCATG TAGAATCAAATATTAGATGTGATTTAATAGTGTTAAT
CAATTTAAACCCATTTTAGTCACTTTTTTTTTCCCCAAAAATACTGCCA
GATGCTGATGTTCAGTGTAATTTCTTTGCCTGTTCAGTTACAGAAAGTGG
TGCTCAGTTGTAGAATGTATTGTACCTTTTAACATC TGATGTGTACATC
CGTGTAACAGGAAGGGCAACAATAAAATAGCGATCCTAAAGAAAGATTAC
GGCAGAAAGAGCTCTGTAAGCAC AGCCTTATTTTCTTCTGTTGTCCAGA
ATACTTAGAATTCTTGAGCCTCCCAGAAATTGGAAGCAAATAAAGCAACT
TGAGTTTCCT TTAAAAAA
10Comparing two sequences
Compare two species sequences at the DNA and
protein levels Use one of BLAST program bl2seq
(BLAST 2 SEQUENCES) human YY1 mRNA (NM_003403)
protein (NP_003394) mouse YY1 mRNA (NM_009537)
protein (NP_033563) DNA sequence similarity
-------------gt 94 - 96 Protein sequence
similarity ------------gt 98 Why protein
sequence similarity is higher than DNA sequence?
11Final exam will cover the followings
- starting from Transcription to Genomics
- try to solve all the practice and homework tests
- go over each lecture slide and focus on the main
points - final review will be held at the Annex
auditorium at 140 Wednesday