Title: Bioinformatika
1 Bioinformatika Biologisko virknu
salidzinaana - datubazes un programmas Alignment
of biological sequences - databases and software
LU, 2008, Juris Viksna
2Prakse lietotas virknu salidzinaanas metodes
Smith-Waterman algoritms (1981) Lokala
salidzinaana Linearas gap penalties Lieto
substituciju matricas
Vai laiks ?(n2) ir praktiski pienemams? Proteina
garums - ap 100 aminoskabju... Divu proteinu
salidzinaana 10000 op. (0.01 sek, pie 1
MHz) Proteinu datubaze - 1000000
ieraksti... Mekleana datubaze 108 op, 100
sek Divu datubazu salidzinaana - 1016 op, 25
gadi (
3Heiristiskas metodes - FASTA
- Hash table of short words in the query sequence
- Go through DB and look for matches in the query
hash (linear in size of DB) - K-tuple determines word size (k-tup 1 is single
aa) - Lipman Pearson 1985
VLICT _
VLICTAVLMVLICTAAAVLICTMSDFFD
Adapted from M.Gerstein
4The FASTA Algorithm
- 4 steps
- use lookup table to find all identities at least
ktup long, find regions of identities (Fig.1A) - rescan 10 regions (diagonals) with highest
density of identities using PAM250 (Fig.1B) - join regions if possible without decreasing score
below threshold (Fig.1C) - rescore ala Smith-Waterman 32 residues around
initial region (Note doesnt save alignment)
(Fig.1D)
5FASTA Parameters
- ktup 2 for proteins, 6 for DNA
- init1 Score after rescanning with PAM250 (or
other) - initn Score after joining regions
- opt Score after Smith-Waterman
6FASTA algoritms
Fig.1 of Pearson and Lipman 1988
7FASTA algoritms
Adapted from D Brutlag
8FASTA histogrammas
9Heiristiskas metodes - Blast
- Altschul, S., Gish, W., Miller, W., Myers, E. W.
Lipman, D. J. (1990). Basic local alignment
search tool. J. Mol. Biol. 215, 403-410 - Indexes query and DB
- Starts with all overlapping words from query
- Calculates neighborhood of each word using PAM
matrix and probability threshold matrix and
probability threshold - Looks up all words and neighbors from query in
database index - Extends High Scoring Pairs (HSPs) left and right
to maximal length - Finds Maximal Segment Pairs (MSPs) between query
and database - Blast 1 does not permit gaps in alignments
Adapted from M.Gerstein
10BLAST algoritms
- Keyword search of all words of length w from the
in the query of length n in database of length m
with score above threshold - w 11 for nucleotide queries, 3 for proteins
- Do local alignment extension for each found
keyword - Extend result until longest match above threshold
is achieved - Running time O(nm)
Adapted from S.Daudenarde
11BLAST algoritms
keyword
Query KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVL
KIFLENVIRD
GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK
11 GEK 11 GDK 11
Neighborhood words
neighborhood score threshold (T 13)
extension
Query 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK
60 DN G IR L GK I L E
RGK Sbjct 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EK
HRGIIK 263
High-scoring Pair (HSP)
Adapted from S.Daudenarde
12Original BLAST
- Dictionary
- All words of length w
- Alignment
- Ungapped extensions until score falls below some
threshold - Output
- All local alignments with score gt statistical
threshold
Adapted from S.Daudenarde
13Original BLAST Example
- w 4
- Exact keyword match of GGTC
- Extend diagonals with mismatches until score is
under 50 - Output result
- GTAAGGTCC
- GTTAGGTCC
A C G A A G T A A G G T C
C A G T
C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
14Gapped BLAST Example
A C G A A G T A A G G T C
C A G T
- Original BLAST exact keyword search, THEN
- Extend with gaps in a zone around ends of exact
match until score lt threshold then merge nearby
alignments - Output result
- GTAAGGTCC-AGT
- GTTAGGTCCTAGT
C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
15Gapped BLAST Example
- Original BLAST exact keyword search, THEN
- Extend with gaps around ends of exact match until
score lt threshold, then merge nearby alignments - Output result
- GTAAGGTCC-AGT
- GTTAGGTCCTAGT
A C G A A G T A A G G T C
C A G T
C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
16BLAST - Programmas
blastp compares an amino acid query sequence
against a protein sequence database blastn compa
res a nucleotide query sequence against a
nucleotide sequence database blastx compares a
nucleotide query sequence translated in all
reading frames against a protein sequence
database tblastn compares a protein query
sequence against a nucleotide sequence database
dynamically translated in all reading
frames tblastx compares the six-frame
translations of a nucleotide query sequence
against the six-frame translations of a
nucleotide sequence database. Please note that
tblastx is extremely slow and cpu-intensive
17PSI-BLAST and transitive sequence comparison
18Rezultatu vizualizacija (dotplots)
19Proteinu virknu formati - FASTA
gtspP00156CYB_HUMAN Cytochrome b - Homo sapiens
(Human). MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGSLLGAC
LILQITTGLFLAMHYSPDAS TAFSSIAHITRDVNYGWIIRYLHANGASM
FFICLFLHIGRGLYYGSFLYSETWNIGIILL LATMATAFMGYVLPWGQM
SFWGATVITNLLSAIPYIGTDLVQWIWGGYSVDSPTLTRFFT FHFILPF
IIAALATLHLLFLHETGSNNPLGITSHSDKITFHPYYTIKDALGLLLFLL
SLM TLTLFSPDLLGDPDNYTLANPLNTPPHIKPEWYFLFAYTILRSVPN
KLGGVLALLLSILI LAMIPILHMSKQQSMMFRPLSQSLYWLLAADLLIL
TWIGGQPVSYPFTIIGQVASVLYFT TILILMPTISLIENKMLKWA
20Proteinu virknu formati - SwissProt
ID CYB_HUMAN STANDARD PRT 380 AA. AC P00156
Q34786 Q8HBR6 Q8HNQ0 Q8HNQ1 Q8HNQ9 Q8HNR4
Q8HNR7 AC Q8W7V8 Q8WCV9 Q8WCY2 Q8WCY7
Q8WCY8 Q9B1A6 Q9B1B6 Q9B1B8 AC Q9B2X7
Q9B2X9 Q9B2Y3 Q9B2Z0 Q9B2Z4 Q9T6H6 Q9T9Y0
Q9TEH4 DT 21-JUL-1986, integrated into
UniProtKB/Swiss-Prot. DT 21-JUL-1986, sequence
version 1. DT 19-SEP-2006, entry version 75. DE
Cytochrome b. GN NameMT-CYB SynonymsCOB,
CYTB, MTCYB OS Homo sapiens (Human). OG
Mitochondrion. OC Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi OC Mammalia
Eutheria Euarchontoglires Primates
Haplorrhini OC Catarrhini Hominidae Homo. OX
NCBI_TaxID9606 RN 1 RP NUCLEOTIDE SEQUENCE.
RC TISSUEPlacenta RX MEDLINE81173052
PubMed7219534 NCBI, ExPASy, EBI, Israel RA
Anderson S., Bankier A.T., Barrell B.G., de
Bruijn M.H.L., RA Coulson A.R., Drouin J.,
Eperon I.C., Nierlich D.P., Roe B.A., RA Sanger
F., Schreier P.H., Smith A.J.H., Staden R., Young
I.G. RT "Sequence and organization of the human
mitochondrial genome." RL Nature
290457-465(1981).
21Proteinu virknu formati - SwissProt
FT VARIANT 360 360 T -gt A. FT /FTIdVAR_013667.
FT VARIANT 368 368 T -gt I. FT /FTIdVAR_013668.
SQ SEQUENCE 380 AA 42730 MW 90097B07FF2C1FD8
CRC64 MTPMRKINPL MKLINHSFID LPTPSNISAW
WNFGSLLGAC LILQITTGLF LAMHYSPDAS TAFSSIAHIT
RDVNYGWIIR YLHANGASMF FICLFLHIGR GLYYGSFLYS
ETWNIGIILL LATMATAFMG YVLPWGQMSF WGATVITNLL
SAIPYIGTDL VQWIWGGYSV DSPTLTRFFT FHFILPFIIA
ALATLHLLFL HETGSNNPLG ITSHSDKITF HPYYTIKDAL
GLLLFLLSLM TLTLFSPDLL GDPDNYTLAN PLNTPPHIKP
EWYFLFAYTI LRSVPNKLGG VLALLLSILI LAMIPILHMS
KQQSMMFRPL SQSLYWLLAA DLLILTWIGG QPVSYPFTII
GQVASVLYFT TILILMPTIS LIENKMLKWA
22Proteinu virknu formati - NiceProt
23Proteinu virknu datubazes - UniProt
http//www.expasy.uniprot.org/
24Proteinu virknu datubazes - PIR
http//pir.georgetown.edu/
25Proteinu virknu datubazes - SwissProt
http//www.expasy.org/sprot/
26Proteinu virknu datubazes - NRL-3D )
27Proteinu strukturu datubazes - PDB
http//www.pdb.org
28Salidzinaanas programmas - BLAST
http//www.ncbi.nlm.nih.gov/BLAST/
http//www.ebi.ac.uk/blastall/
29Salidzinaanas programmas - FASTA
http//fasta.bioch.virginia.edu/
http//www.ebi.ac.uk/fasta33/
30Salidzinaanas programmas - Smith-Waterman
http//www.ebi.ac.uk/MPsrch/
31Salidzinajumu noverteana
- S Total Score
- S(i,j) similarity matrix score for aligning i
and j - Sum is carried out over all aligned i and j
- n number of gaps (assuming no gap ext. penalty)
- G gap penalty
Simplest score (for identity matrix) is S
matches What does a Score of 10 mean? What is
the Right Cutoff?
Adapted from M.Gerstein
32Assessing sequence similarity
- Need to know how strong an alignment can be
expected from chance alone - Chance is the comparison of
- Real but non-homologous sequences
- Real sequences that are shuffled to preserve
compositional properties - Sequences that are generated randomly based upon
a DNA or protein sequence model (favored)
Adapted from S.Daudenarde
33Model Random Sequence
- Necessary to evaluate the score of a match
- Take into account background
- Adjust for GC content
- Poly-A tails
- Junk sequences
- Codon bias
Adapted from S.Daudenarde
34E-Values
E-Value Expectation value cutoff. An E-Value of
1 means that one would expect to find at most 1
alignment with a score as high as reported in a
search of the given database. Setting the E-value
higher will report more sequence matches Resp.,
jo mazaka, jo labak... Rekina apmeram adi E
mn2-S, kur m,n - virknu garumi, S' -
rekina, piem., adi S lSln(K)
ln(2)
35P-Values
P-Value Varbutiba, ka ads (vai labaks)
rezultats tiks ieguts "nejauam" proteinam Ari,
jo mazaka, jo labak...
36P-Values
- According to Poisons distribution, the
probability of finding b HSPs with a score ?S is
given by - (e-EEb)/b!
- For b 0, that chance is
- e-E
- Thus the probability of finding at least one such
HSP is - P 1 e-E
Adapted from S.Daudenarde
37P-Values
- P(s gt S) .01
- P-value of .01 occurs at score threshold S (392
below) where score s from random comparison is
greater than this threshold 1 of the time - Likewise for P.001 and so on.
Adapted from M.Gerstein
38General Protein Search Principles
- Choose between local or global search algorithms
- Use most sensitive search algorithm available
- Original BLAST for no gaps
- Smith-Waterman for most sensitivity
- FASTA with k-tuple 1 is a good compromise
- Gapped BLAST for well delimited regions
- PSI-BLAST for families
- Initially BLOSUM62 and default gap penalties
- If no significant results, use BLOSUM30 and lower
gap penalties - FASTA cutoff of .01
- Blast cutoff of .0001
- Examine results between exp. 0.05 and 10 for
biological significance - Ensure expected score is negative
- Beware of hits on long sequences or hits with
unusual aa composition - Reevaluate results of borderline significance
using limited query region - Segment long queries ³ 300 amino acids
- Segment around known motifs
(some text adapted from D Brutlag)
39ROC diagrammas
Uzskatisim, ka proteini ir homologi, ja lidziba
parsniedz kaut kadu slieksni t true positives
(tp) - s(a,b) ? t un a, b ir homologi false
positives (fp) - s(a,b) ? t un a, b nav
homologi true negatives (tn) - s(a,b) lt t un a,
b nav homologi false negatives (fn) - s(a,b) lt t
un a, b ir homologi Sensitivity tp/n
tp/(tpfn) Specificity tn/n tn(tnfp)
40ROC diagrammas
41ROC diagrammas
Coverage (roughly, fraction of sequences that one
confidently says something about)
Thresh10
Thresh20
sensitivitytp/ntp/(tpfn)
Thresh30
Different score thresholds
Error rate (fraction of the statements that are
false positives)
Two methods (red is more effective)
Specificity tn/n tn/(tnfp) error rate
1-specificity fp/n
Adapted from M.Gerstein