Sequence Similarity Searching

About This Presentation

Title:

Sequence Similarity Searching

Description:

What is this thing I just found? ... Guess functions for entire genomes full of new gene sequences ... if A~B and B~C, then A~C. Advanced Similarity Techniques ... – PowerPoint PPT presentation

Number of Views:534

Avg rating:3.0/5.0

Slides: 56

Provided by: researchco3

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Similarity Searching

1
Sequence Similarity Searching
Stuart M. Brown, Ph.D. Center for Health
Informatics and BioinformaticsNYU School of
Medicine
2
Why Compare Sequences?

Identify sequences found in lab experiments
What is this thing I just found?
Compare new genes to known ones
Compare genes from different species
information about evolution
Guess functions for entire genomes full of new
gene sequences
Map sequence reads to a Reference
Genome (ChIP-seq, RNA-seq, etc.)

3
Are there other sequences like this one?

1) Huge public databases - GenBank, Swissprot,
etc.
2) Sequence comparison is the most powerful and
reliable method to determine evolutionary
relationships between genes
3) Similarity searching is based on alignment
4) BLAST and FASTA provide rapid similarity
searching
a. rapid approximate (heuristic)
b. false and - scores

4
Similarity ? Homology

1) 25 similarity 100 AAs is strong evidence
for homology
2) Homology is an evolutionary statement which
means descent from a common ancestor
common 3D structure
usually common function
homology is all or nothing, you cannot say "50
homologous"

5
How to Compare Sequences?

Manually line them up and count?
an alignment program can do it for you
or a just use a text editor
Dot Plot
shows regions of similarity as diagonals

GATGCCATAGAGCTGTAGTCGTACCCT lt gt
CTAGAGAGC-GTAGTCAGAGTGTCTTTGAGTTCC
6
Percent Sequence Identity

The extent to which two nucleotide or amino acid
sequences are invariant

A C C T G A G A G A C G T G G C
A G
mismatch
indel
70 identical
7
Global vs Local similarity

Global similarity uses complete aligned sequences
- total matches
- Needleman Wunch algorithm
2) Local similarity looks for best internal
matching region between 2 sequences
Smith-Waterman algorithm,
BLAST and FASTA
3) dynamic programming
optimal computer solution, not approximate

8
Global vs. Local Alignments
9
Global Alignment
Two sequences sharing several regions of local
similarity
1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAA
TTAAAGAGGAGGTAGACCG.... 67

1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAG
CACTAAAGCGTCAGCGAGACCG 70
10
Similarity is Based on Dot Plots

1) two sequences on vertical and horizontal axes
of graph
2) put dots wherever there is a match
3) diagonal line is region of identity (local
alignment)
4) apply a window filter - look at a group of
bases, must meet identity to get a dot

11
Simple Dot Plot
12
Dot plot filtered with 4 base window and 75
identity
13
Dot plot of real data
14
Dotplot (Window 130 / Stringency 9)
Hemoglobin?-chain
Hemoglobin ?-chain
15
Window / Stringency
Score 11
PTHPLASKTQILPEDLASEDLTI
?
PTHPLAGERAIGLARLAEEDFGM
Filtering
Score 11
Window 12 Stringency 9
PTHPLASKTQILPEDLASEDLTI
?
PTHPLAGERAIGLARLAEEDFGM
Score 7
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM
16
Dotplot (Window 18 / Stringency 10)
Hemoglobin?-chain
Hemoglobin ?-chain
17
Scoring Similarity

1) Can only score aligned sequences
2) DNA is usually scored as identical or not
3) modified scoring for gaps - single vs.
multiple base gaps (gap extension)
4) Protein AAs have varying degrees of similarity
a. of mutations to convert one to another
b. chemical similarity
c. observed mutation frequencies
5) PAM matrix calculated from observed mutations
in protein families

18
Search with Protein, not DNA Sequences

1) 4 DNA bases vs. 20 amino acids - less chance
similarity
2) can have varying degrees of similarity between
different AAs
- of mutations, chemical similarity, PAM matrix
3) protein databanks are much smaller than DNA
databanks

19
The PAM 250 scoring matrix
20
What program to use for searching?

1) Smith-Waterman is slower, but more sensitive
known as a rigorous or exhaustive search
optimal alignments
EMBOSS water program
2) FASTA
more sensitive for DNA-DNA comparisons
FASTX and TFASTX can find similarities in
sequences with frameshifts
3) BLAST is fastest and easily accessed on the
Web
limited sets of databases for web version
Free software to install on UNIX computers, make
custom databases
nice translation tools (BLASTX, TBLASTN)

21
Smith-Waterman Method
Basic principles of dynamic programming
- Creation of an alignment path matrix -
Stepwise calculation of score values -
Backtracking (evaluation of the optimal path)
22
(No Transcript)
23
Creation of an alignment path matrix
Idea Build up an optimal alignment using
previous solutions for optimal alignments of
smaller subsequences

Construct matrix F indexed by i and j (one index
for each sequence)
F(i,j) is the score of the best alignment between
the initial segment x1...i of x up to xi and
the initial segment y1...j of y up to yj
Build F(i,j) recursively beginning with F(0,0) 0

24
Creation of an alignment path matrix

If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we
can calculate F(i,j)
Three possibilities
xi and yj are aligned, F(i,j) F(i-1,j-1)
s(xi ,yj)
xi is aligned to a gap, F(i,j) F(i-1,j) - d
yj is aligned to a gap, F(i,j) F(i,j-1) - d
The best score up to (i,j) will be the largest of
the three options

25
Backtracking
H E A G A W G H
E E 0 -8 -16 -24 -32 -40 -48
-56 -64 -72 -80 P
-8 -2 -9 -17 -25 -33 -42 -49 -57 -65
-73 A -16 -10 -3 -4 -12 -20 -28 -36
-44 -52 -60 W -24 -18 -11 -6 -7 -15
-5 -13 -21 -29 -37 H -32 -14 -18 -13
-8 -9 -13 -7 -3 -11 -19 E -40 -22
-8 -16 -16 -9 -12 -15 -7 3 -5 A -48
-30 -16 -3 -11 -11 -12 -12 -15 -5
2 E -56 -38 -24 -11 -6 -12 -14 -15
-12 -9 1
0
-8
-16
-25
-17
-20
-5
-13
-3
3
-5
1
- A
E E
H H
G -
W W
A A
G -
A P
E -
H -
Optimal global alignment
26
Smith-Waterman is OPTIMAL but computationally slow

SW search requires computing of matrix of scores
at every possible alignment position with every
possible gap.
Compute task increases with the product of the
lengths of two sequence to be compared
Difficult for comparison of one small sequence
to a much larger one, very difficult for two
large sequences, essentially impossible to search
very large databases.

27
FASTA

1) Derived from logic of the dot plot
compute best diagonals from all frames of
alignment
2) Word method looks for exact matches between
words in query and test sequence
hash tables (fast computer technique)
DNA words are usually 6 bases
protein words are 1 or 2 amino acids
only searches for diagonals in region of word
matches faster searching

28
FASTA Format

simple format used by almost all programs
gtheader line with a return at end
Sequence (no specific requirements for line
length, characters, etc)

gtURO1 uro1.seq Length 2018 November 9, 2000
1150 Type N Check 3854 .. CGCAGAAAGAGGAGGCGC
TTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAAC
TGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTT
GTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCA
ACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTAT
GGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGT
CTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCT
GGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CT
TGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCT
GAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGAT
GACCAGTGGAAAAACAATG GAGTCACCAAAACCTGGGACAGGCTCATGC
TCCAGGACAATTGCTGTGGCGTAAATGGTC CATCAGACTGGCAAAAATA
CACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC CCTGGCCT
CGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGC
TT
29
FASTA Algorithm
30
Makes Longest Diagonal

3) after all diagonals found, tries to join
diagonals by adding gaps
4) computes alignments in regions of best
diagonals

31
FASTA Alignments
32
FASTA Results - Alignment

SCORES Init1 1515 Initn 1565 Opt 1687
z-score 1158.1 E() 2.3e-58
gtgtGB_IN3DMU09374
(2038 nt)
initn 1565 init1 1515 opt 1687 Z-score
1158.1 expect() 2.3e-58
66.2 identity in 875 nt overlap
(83-957151-1022)
60 70 80
90 100 110
u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAG
CGGAGGCGATGGCGCTGTTGGCC
DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAAC
AGAAGGCGCTCCAACTGATGGCC
130 140 150
160 170 180
120 130 140
150 160 170
u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCT
TCTCTGGCCTCTTTGGAGGCTCA
DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTC
TGGGATCGCTGTTCGGAGGGTCC
190 200 210
220 230 240
180 190 200
210 220 230

33
BLAST Searches GenBank(or a custom database)

BLAST Basic Local Alignment Search Tool
The NCBI BLAST web server lets you compare your
query sequence to various sections of GenBank
nr non-redundant (main sections)
month new sequences from the past few weeks
ESTs
human, drososphila, yeast, or E.coli genomes
proteins (by automatic translation)
This is a VERY fast and powerful computer.

34
BLAST

Uses word matching
Similarity matching of words (3 aas, 11 bases)
does not require identical words.
If no words are similar, then no alignment
wont find matches for very short sequences
gapped BLAST (BLAST 2) improved handling of
gaps in alignment
BLAST searches can be sent to the NCBIs server
from website, or a custom client program (Unix)

35
BLAST Algorithm
36
BLAST Word Matching

MEAAVKEEISVEDEAVDKNI
MEA
EAA
AAV
AVK
VKE
KEE
EEI
EIS
ISV
...

Break query into words
Break database sequences into words
37
Compare Word Lists

Database Sequence Word Lists
RTT AAQ
SDG KSS
SRW LLN
QEL RWY
VKI GKG
DKI NIS
LFC WDV
AAV KVR
PFR DEI

Query Word List
MEA
EAA
AAV
AVK
VKL
KEE
EEI
EIS
ISV

?
Compare word lists by Hashing (allow near
matches)
38
Find locations of matching words in database
sequences
ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT
MEA EAA AAV AVK KLV KEE EEI EIS ISV
TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSR
WNY
IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH
39
Extend hits one base at a time
40
BLAST 2 algorithm

The NCBIs BLAST website and GCG (NETBLAST)
now both use BLAST 2 (also known as gapped
BLAST)
This algorithm is more complex than the original
BLAST
It requires two word matches close to each other
on a pair of sequences (i.e. with a gap) before
it creates an alignment

41
Gapped BLAST
HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA
Seq_XYZ
QSVFDYIYYGCYCGWGLG_GK__PRDA
Query
E-val10-13

Use two word matches as anchors to build an
alignment between the query and a database
sequence.
Then score the alignment.

42
HSPs are Aligned Regions

The results of the word matching and attempts to
extend the alignment are segments
- called HSPs (High-scoring Segment Pairs)
BLAST often produces several short HSPs rather
than a single aligned region

gtgbBE588357.1BE588357 194087 BARC 5BOV Bos
taurus cDNA 5'.
Length 369
Score 272 bits (137), Expect 4e-71
Identities 258/297 (86), Gaps 1/297 (0)
Strand Plus / Plus
Query 17 aggatccaacgtcgctccagctgctcttgacgactccac
agataccccgaagccatggca 76
Sbjct 1 aggatccaacgtcgctgcggctacccttaaccact-cgc
agaccccccgcagccatggcc 59
Query 77 agcaagggcttgcaggacctgaagcaacaggtggagggg
accgcccaggaagccgtgtca 136
Sbjct 60 agcaagggcttgcaggacctgaagaagcaagtggagggg
gcggcccaggaagcggtgaca 119
Query 137 gcggccggagcggcagctcagcaagtggtggaccaggcc
acagaggcggggcagaaagcc 196
Sbjct 120 tcggccggaacagcggttcagcaagtggtggatcaggcc
acagaagcagggcagaaagcc 179
Query 197 atggaccagctggccaagaccacccaggaaaccatcgac
aagactgctaaccaggcctct 256

44
(No Transcript)
45
BLAST has Automatic Translation

BLASTX makes automatic translation (in all 6
reading frames) of your DNA query sequence to
compare with protein databanks
TBLASTN makes automatic translation of an entire
DNA database to compare with your protein query
sequence
Only make a DNA-DNA search if you are working
with a sequence that does not code for protein.

46
BLAST Results - Summary
47
BLAST Results - List
48
BLAST Results - Alignment
gtgi17556182refNP_497582.1 Predicted CDS,
phosphatidylinositol transfer protein
Caenorhabditis elegans gi14574401gbAAK68521.
1AC024814_1 Hypothetical protein Y54F10AR.1
Caenorhabditis elegans Length 336
Score 283 bits (723), Expect 8e-75
Identities 144/270 (53), Positives 186/270
(68), Gaps 13/270 (4) Query 48
KEYRVILPVSVDEYQVGQLYSVAEASKNXXXXXXXXXXXXXXPYEK----
DGE--KGQYT 101 K RVLPSVEYQVGQLSVAE
ASK P G KGQYT Sbjct 70
KKSRVVLPMSVEEYQVGQLWSVAEASKAETGGGEGVEVLKNEPFDNVPLL
NGQFTKGQYT 129 Query 102 HKIYHLQSKVPTFVRMLAPEGAL
NIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP 160
HKIYHLQSKVP R APGL IHEAWNAYPYCTVTN
YMKEF KIET H P Sbjct 130 HKIYHLQSKVPAILRKIAPKG
SLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP
189 Query 161 DLGTQENVHKLEPEAWKHVEAVYIDIADRSQVL-
SKDYKAEEDPAKFKSIKTGRGPLGPN 219 D GT EN
H L E V IIA L S D PKFS KTGRGPL
N Sbjct 190 DNGTTENAHGLKGDELAKREVVNINIANDHEYLNSG
DLHPDSTPSKFQSTKTGRGPLSGN 249 Query 220
WKQELVNQKDCPYMCAYKLVTVKFKWWGLQNKVENFIHKQERRLFTNFHR
QLFCWLDKWV 279 WK P MCAYKLVTV
FKWG Q VEN H Q RLF FHRFCWDKW Sbjct 250
WKDSVQ-----PVMCAYKLVTVYFKWFGFQKIVENYAHTQYPRLFSKFHR
EVFCWIDKWH 304 Query 280 DLTMDDIRRMEEETKRQLDEMRQ
KDPVKGM 309 LTM DIR E LE R
VGM Sbjct 305 GLTMVDIREIEAKAQKELEEQRKSGQVRGM
334
49
BLAST alignments are short segments

BLAST tends to break alignments into
non-overlapping segments
can be confusing
reduces overall significance score

50
BLAST is Approximate

BLAST makes similarity searches very quickly
because it takes shortcuts.
looks for short, nearly identical words (11
bases)
It also makes errors
misses some important similarities
makes many incorrect matches
easily fooled by repeats or skewed composition

51
Web BLAST runs on a big computer at NCBI

Usually fast, but does get busy sometimes
Fixed choices of databases
problems with genome data clogging the system
ESTs are not part of the default NR dataset
Uses filtering of repeats
Graphical summary of output
Links to GenBank sequences

52
BLAST Statistics

E() value is equivalent to standard P value
Significant if E() lt 0.05 (smaller numbers are
more significant)
The E-value represents the likelihood that the
observed alignment is due to chance alone. A
value of 1 indicates that an alignment this good
would happen by chance with any random sequence
searched against this database.

53
Interpretation of output

very low E() values (e-100) are homologs or
identical genes
moderate E() values are related genes
long list of gradually declining of E() values
indicates a large gene family
long regions of moderate similarity are more
significant than short regions of high identity

54
Borderline similarity

What to do with matches with E() values in the
0.5 -1.0 range?
this is the Twilight Zone
retest these sequences and look for related hits
(not just your original query sequence)
similarity is transitive
if AB and BC, then AC

55
Biological Relevance

It is up to you, the biologist to scrutinize
these alignments and determine if they are
significant.
Were you looking for a short region of nearly
identical sequence or a larger region of general
similarity?
Are the mismatches conservative ones?
Are the matching regions important structural
components of the genes or just introns and
flanking regions?