Title: Sequence Similarity Searching
1Sequence Similarity Searching
Stuart M. Brown, Ph.D. Center for Health
Informatics and BioinformaticsNYU School of
Medicine
2Why Compare Sequences?
- Identify sequences found in lab experiments
- What is this thing I just found?
- Compare new genes to known ones
- Compare genes from different species
- information about evolution
- Guess functions for entire genomes full of new
gene sequences - Map sequence reads to a Reference
Genome (ChIP-seq, RNA-seq, etc.)
3Are there other sequences like this one?
- 1) Huge public databases - GenBank, Swissprot,
etc. - 2) Sequence comparison is the most powerful and
reliable method to determine evolutionary
relationships between genes - 3) Similarity searching is based on alignment
- 4) BLAST and FASTA provide rapid similarity
searching - a. rapid approximate (heuristic)
- b. false and - scores
4Similarity ? Homology
- 1) 25 similarity 100 AAs is strong evidence
for homology - 2) Homology is an evolutionary statement which
means descent from a common ancestor - common 3D structure
- usually common function
- homology is all or nothing, you cannot say "50
homologous"
5How to Compare Sequences?
- Manually line them up and count?
- an alignment program can do it for you
- or a just use a text editor
- Dot Plot
- shows regions of similarity as diagonals
GATGCCATAGAGCTGTAGTCGTACCCT lt gt
CTAGAGAGC-GTAGTCAGAGTGTCTTTGAGTTCC
6Percent Sequence Identity
- The extent to which two nucleotide or amino acid
sequences are invariant
A C C T G A G A G A C G T G G C
A G
mismatch
indel
70 identical
7 Global vs Local similarity
- Global similarity uses complete aligned sequences
- total matches - - Needleman Wunch algorithm
- 2) Local similarity looks for best internal
matching region between 2 sequences - Smith-Waterman algorithm,
- BLAST and FASTA
- 3) dynamic programming
- optimal computer solution, not approximate
8Global vs. Local Alignments
9Global Alignment
Two sequences sharing several regions of local
similarity
1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAA
TTAAAGAGGAGGTAGACCG.... 67
1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAG
CACTAAAGCGTCAGCGAGACCG 70
10 Similarity is Based on Dot Plots
- 1) two sequences on vertical and horizontal axes
of graph - 2) put dots wherever there is a match
- 3) diagonal line is region of identity (local
alignment) - 4) apply a window filter - look at a group of
bases, must meet identity to get a dot
11Simple Dot Plot
12Dot plot filtered with 4 base window and 75
identity
13Dot plot of real data
14Dotplot (Window 130 / Stringency 9)
Hemoglobin?-chain
Hemoglobin ?-chain
15Window / Stringency
Score 11
PTHPLASKTQILPEDLASEDLTI
?
PTHPLAGERAIGLARLAEEDFGM
Filtering
Score 11
Window 12 Stringency 9
PTHPLASKTQILPEDLASEDLTI
?
PTHPLAGERAIGLARLAEEDFGM
Score 7
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM
16Dotplot (Window 18 / Stringency 10)
Hemoglobin?-chain
Hemoglobin ?-chain
17 Scoring Similarity
- 1) Can only score aligned sequences
- 2) DNA is usually scored as identical or not
- 3) modified scoring for gaps - single vs.
multiple base gaps (gap extension) - 4) Protein AAs have varying degrees of similarity
- a. of mutations to convert one to another
- b. chemical similarity
- c. observed mutation frequencies
- 5) PAM matrix calculated from observed mutations
in protein families
18 Search with Protein, not DNA Sequences
- 1) 4 DNA bases vs. 20 amino acids - less chance
similarity - 2) can have varying degrees of similarity between
different AAs - - of mutations, chemical similarity, PAM matrix
- 3) protein databanks are much smaller than DNA
databanks
19The PAM 250 scoring matrix
20What program to use for searching?
- 1) Smith-Waterman is slower, but more sensitive
- known as a rigorous or exhaustive search
optimal alignments - EMBOSS water program
- 2) FASTA
- more sensitive for DNA-DNA comparisons
- FASTX and TFASTX can find similarities in
sequences with frameshifts - 3) BLAST is fastest and easily accessed on the
Web - limited sets of databases for web version
- Free software to install on UNIX computers, make
custom databases - nice translation tools (BLASTX, TBLASTN)
21Smith-Waterman Method
Basic principles of dynamic programming
- Creation of an alignment path matrix -
Stepwise calculation of score values -
Backtracking (evaluation of the optimal path)
22(No Transcript)
23Creation of an alignment path matrix
Idea Build up an optimal alignment using
previous solutions for optimal alignments of
smaller subsequences
- Construct matrix F indexed by i and j (one index
for each sequence) - F(i,j) is the score of the best alignment between
the initial segment x1...i of x up to xi and
the initial segment y1...j of y up to yj - Build F(i,j) recursively beginning with F(0,0) 0
24Creation of an alignment path matrix
- If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we
can calculate F(i,j) - Three possibilities
- xi and yj are aligned, F(i,j) F(i-1,j-1)
s(xi ,yj) - xi is aligned to a gap, F(i,j) F(i-1,j) - d
- yj is aligned to a gap, F(i,j) F(i,j-1) - d
- The best score up to (i,j) will be the largest of
the three options -
25Backtracking
H E A G A W G H
E E 0 -8 -16 -24 -32 -40 -48
-56 -64 -72 -80 P
-8 -2 -9 -17 -25 -33 -42 -49 -57 -65
-73 A -16 -10 -3 -4 -12 -20 -28 -36
-44 -52 -60 W -24 -18 -11 -6 -7 -15
-5 -13 -21 -29 -37 H -32 -14 -18 -13
-8 -9 -13 -7 -3 -11 -19 E -40 -22
-8 -16 -16 -9 -12 -15 -7 3 -5 A -48
-30 -16 -3 -11 -11 -12 -12 -15 -5
2 E -56 -38 -24 -11 -6 -12 -14 -15
-12 -9 1
0
-8
-16
-25
-17
-20
-5
-13
-3
3
-5
1
- A
E E
H H
G -
W W
A A
G -
A P
E -
H -
Optimal global alignment
26Smith-Waterman is OPTIMAL but computationally slow
- SW search requires computing of matrix of scores
at every possible alignment position with every
possible gap. - Compute task increases with the product of the
lengths of two sequence to be compared - Difficult for comparison of one small sequence
to a much larger one, very difficult for two
large sequences, essentially impossible to search
very large databases.
27 FASTA
- 1) Derived from logic of the dot plot
- compute best diagonals from all frames of
alignment - 2) Word method looks for exact matches between
words in query and test sequence - hash tables (fast computer technique)
- DNA words are usually 6 bases
- protein words are 1 or 2 amino acids
- only searches for diagonals in region of word
matches faster searching
28FASTA Format
- simple format used by almost all programs
- gtheader line with a return at end
- Sequence (no specific requirements for line
length, characters, etc)
gtURO1 uro1.seq Length 2018 November 9, 2000
1150 Type N Check 3854 .. CGCAGAAAGAGGAGGCGC
TTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAAC
TGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTT
GTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCA
ACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTAT
GGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGT
CTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCT
GGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CT
TGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCT
GAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGAT
GACCAGTGGAAAAACAATG GAGTCACCAAAACCTGGGACAGGCTCATGC
TCCAGGACAATTGCTGTGGCGTAAATGGTC CATCAGACTGGCAAAAATA
CACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC CCTGGCCT
CGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGC
TT
29FASTA Algorithm
30Makes Longest Diagonal
- 3) after all diagonals found, tries to join
diagonals by adding gaps - 4) computes alignments in regions of best
diagonals
31FASTA Alignments
32FASTA Results - Alignment
- SCORES Init1 1515 Initn 1565 Opt 1687
z-score 1158.1 E() 2.3e-58 - gtgtGB_IN3DMU09374
(2038 nt) - initn 1565 init1 1515 opt 1687 Z-score
1158.1 expect() 2.3e-58 - 66.2 identity in 875 nt overlap
- (83-957151-1022)
- 60 70 80
90 100 110 - u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAG
CGGAGGCGATGGCGCTGTTGGCC -
- DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAAC
AGAAGGCGCTCCAACTGATGGCC - 130 140 150
160 170 180 - 120 130 140
150 160 170 - u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCT
TCTCTGGCCTCTTTGGAGGCTCA -
- DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTC
TGGGATCGCTGTTCGGAGGGTCC - 190 200 210
220 230 240 - 180 190 200
210 220 230
33BLAST Searches GenBank(or a custom database)
- BLAST Basic Local Alignment Search Tool
- The NCBI BLAST web server lets you compare your
query sequence to various sections of GenBank - nr non-redundant (main sections)
- month new sequences from the past few weeks
- ESTs
- human, drososphila, yeast, or E.coli genomes
- proteins (by automatic translation)
- This is a VERY fast and powerful computer.
34BLAST
- Uses word matching
- Similarity matching of words (3 aas, 11 bases)
- does not require identical words.
- If no words are similar, then no alignment
- wont find matches for very short sequences
- gapped BLAST (BLAST 2) improved handling of
gaps in alignment - BLAST searches can be sent to the NCBIs server
from website, or a custom client program (Unix)
35BLAST Algorithm
36BLAST Word Matching
- MEAAVKEEISVEDEAVDKNI
- MEA
- EAA
- AAV
- AVK
- VKE
- KEE
- EEI
- EIS
- ISV
- ...
-
Break query into words
Break database sequences into words
37Compare Word Lists
- Database Sequence Word Lists
-
- RTT AAQ
- SDG KSS
- SRW LLN
- QEL RWY
- VKI GKG
- DKI NIS
- LFC WDV
- AAV KVR
- PFR DEI
-
- Query Word List
- MEA
- EAA
- AAV
- AVK
- VKL
- KEE
- EEI
- EIS
- ISV
?
Compare word lists by Hashing (allow near
matches)
38Find locations of matching words in database
sequences
ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT
MEA EAA AAV AVK KLV KEE EEI EIS ISV
TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSR
WNY
IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH
39Extend hits one base at a time
40BLAST 2 algorithm
- The NCBIs BLAST website and GCG (NETBLAST)
now both use BLAST 2 (also known as gapped
BLAST) - This algorithm is more complex than the original
BLAST - It requires two word matches close to each other
on a pair of sequences (i.e. with a gap) before
it creates an alignment
41Gapped BLAST
HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA
Seq_XYZ
QSVFDYIYYGCYCGWGLG_GK__PRDA
Query
E-val10-13
- Use two word matches as anchors to build an
alignment between the query and a database
sequence. - Then score the alignment.
42HSPs are Aligned Regions
- The results of the word matching and attempts to
extend the alignment are segments - - called HSPs (High-scoring Segment Pairs)
- BLAST often produces several short HSPs rather
than a single aligned region
43- gtgbBE588357.1BE588357 194087 BARC 5BOV Bos
taurus cDNA 5'. - Length 369
- Score 272 bits (137), Expect 4e-71
- Identities 258/297 (86), Gaps 1/297 (0)
- Strand Plus / Plus
-
- Query 17 aggatccaacgtcgctccagctgctcttgacgactccac
agataccccgaagccatggca 76 -
- Sbjct 1 aggatccaacgtcgctgcggctacccttaaccact-cgc
agaccccccgcagccatggcc 59 -
- Query 77 agcaagggcttgcaggacctgaagcaacaggtggagggg
accgcccaggaagccgtgtca 136 -
- Sbjct 60 agcaagggcttgcaggacctgaagaagcaagtggagggg
gcggcccaggaagcggtgaca 119 -
- Query 137 gcggccggagcggcagctcagcaagtggtggaccaggcc
acagaggcggggcagaaagcc 196 -
- Sbjct 120 tcggccggaacagcggttcagcaagtggtggatcaggcc
acagaagcagggcagaaagcc 179 -
- Query 197 atggaccagctggccaagaccacccaggaaaccatcgac
aagactgctaaccaggcctct 256
44(No Transcript)
45BLAST has Automatic Translation
- BLASTX makes automatic translation (in all 6
reading frames) of your DNA query sequence to
compare with protein databanks - TBLASTN makes automatic translation of an entire
DNA database to compare with your protein query
sequence - Only make a DNA-DNA search if you are working
with a sequence that does not code for protein.
46BLAST Results - Summary
47BLAST Results - List
48BLAST Results - Alignment
gtgi17556182refNP_497582.1 Predicted CDS,
phosphatidylinositol transfer protein
Caenorhabditis elegans gi14574401gbAAK68521.
1AC024814_1 Hypothetical protein Y54F10AR.1
Caenorhabditis elegans Length 336
Score 283 bits (723), Expect 8e-75
Identities 144/270 (53), Positives 186/270
(68), Gaps 13/270 (4) Query 48
KEYRVILPVSVDEYQVGQLYSVAEASKNXXXXXXXXXXXXXXPYEK----
DGE--KGQYT 101 K RVLPSVEYQVGQLSVAE
ASK P G KGQYT Sbjct 70
KKSRVVLPMSVEEYQVGQLWSVAEASKAETGGGEGVEVLKNEPFDNVPLL
NGQFTKGQYT 129 Query 102 HKIYHLQSKVPTFVRMLAPEGAL
NIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP 160
HKIYHLQSKVP R APGL IHEAWNAYPYCTVTN
YMKEF KIET H P Sbjct 130 HKIYHLQSKVPAILRKIAPKG
SLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP
189 Query 161 DLGTQENVHKLEPEAWKHVEAVYIDIADRSQVL-
SKDYKAEEDPAKFKSIKTGRGPLGPN 219 D GT EN
H L E V IIA L S D PKFS KTGRGPL
N Sbjct 190 DNGTTENAHGLKGDELAKREVVNINIANDHEYLNSG
DLHPDSTPSKFQSTKTGRGPLSGN 249 Query 220
WKQELVNQKDCPYMCAYKLVTVKFKWWGLQNKVENFIHKQERRLFTNFHR
QLFCWLDKWV 279 WK P MCAYKLVTV
FKWG Q VEN H Q RLF FHRFCWDKW Sbjct 250
WKDSVQ-----PVMCAYKLVTVYFKWFGFQKIVENYAHTQYPRLFSKFHR
EVFCWIDKWH 304 Query 280 DLTMDDIRRMEEETKRQLDEMRQ
KDPVKGM 309 LTM DIR E LE R
VGM Sbjct 305 GLTMVDIREIEAKAQKELEEQRKSGQVRGM
334
49BLAST alignments are short segments
- BLAST tends to break alignments into
non-overlapping segments - can be confusing
- reduces overall significance score
50BLAST is Approximate
- BLAST makes similarity searches very quickly
because it takes shortcuts. - looks for short, nearly identical words (11
bases) - It also makes errors
- misses some important similarities
- makes many incorrect matches
- easily fooled by repeats or skewed composition
51Web BLAST runs on a big computer at NCBI
- Usually fast, but does get busy sometimes
- Fixed choices of databases
- problems with genome data clogging the system
- ESTs are not part of the default NR dataset
- Uses filtering of repeats
- Graphical summary of output
- Links to GenBank sequences
52BLAST Statistics
- E() value is equivalent to standard P value
- Significant if E() lt 0.05 (smaller numbers are
more significant) - The E-value represents the likelihood that the
observed alignment is due to chance alone. A
value of 1 indicates that an alignment this good
would happen by chance with any random sequence
searched against this database.
53Interpretation of output
- very low E() values (e-100) are homologs or
identical genes - moderate E() values are related genes
- long list of gradually declining of E() values
indicates a large gene family - long regions of moderate similarity are more
significant than short regions of high identity
54Borderline similarity
- What to do with matches with E() values in the
0.5 -1.0 range? - this is the Twilight Zone
- retest these sequences and look for related hits
(not just your original query sequence) - similarity is transitive
- if AB and BC, then AC
55Biological Relevance
- It is up to you, the biologist to scrutinize
these alignments and determine if they are
significant. - Were you looking for a short region of nearly
identical sequence or a larger region of general
similarity? - Are the mismatches conservative ones?
- Are the matching regions important structural
components of the genes or just introns and
flanking regions?