Title: Sequence Alignment and Approaches to Database Searching
1Sequence Alignment and Approaches to Database
Searching
2Why do we align sequences?
- To discover functional, structural and
evolutionary similarities - Because similarity may be an indicator of
homology and thus provide some insight into
function or gene identification.
3Origins of similar sequences
A
Gene Duplication
A1
A2
A1
A2
Gene Duplication
Speciation
A1
A2
A1
A2
Species A Species B
Gene Conversion
Horizontal Gene Transfer
4The various algorithms
- Dynamic programming algorithms provide a rigorous
mathematical approach to sequence alignment.
They are guaranteed to find the best alignment
for a given scoring matrix and gap penalty. - Local alignments, as opposed to global alignments
are better for DB searching and for finding
similar domains
5Scoring Matrices are designed to detect signal
above background, to detect similarities beyond
what would be observed by chance alone
6Why do we need these matrices?
- Database searching
- Need different levels of sensitivity
- Close relationships (Low PAM, high Blosum)
- Distant relationships (High PAM, low Blosum)
7Dot Plot Nuts Bolts
Dot Plot Word Size 1 g c t g g a a
g g c a t g c
a
g a
g
c a
c
t
8Dot Plots Nuts Bolts
Dot Plot Word Size 2 g c t g g a a
g g c a t g c
a
g a g
c
a c t
9Dot Plot Nuts Bolts
Dot Plot Word Size 3 g c t g g a a
g g c a t g c a
g a
g c
a c
t
10Plasmodium falciparum circumsporozoite protein
MMRKLAILSVSSFLFVEALFQEYQCYGSSSNTRVLNELNYDNAGTNLYNE
LEMNYYGKQENWYSLKKNSRSLGENDDGNN NNGDNGREGKDEDKRDGNN
EDNEKLRKPKHKKLKQPGDGNPDPNANPNVDPNANPNVDPNANPNVDPNA
NPNANPNANPN ANPNANPNANPNANPNANPNANPNANPNANPNANPNAN
PNANPNANPNVDPNANPNANPNANPNANPNANPNANPNANPN ANPNANP
NANPNANPNANPNANPNANPNANPNANPNANPNANPNANPNKNNQGNGQG
HNMPNDPNRNVDENANANNAVKN NNNEEPSDKHIEQYLKKIKNSISTEW
SPCSVTCGNGIQVRIKPGSANKPKDELDYENDIEKKICKMEKCSSVFNVV
NSSI GLIMVLSFLFLN
Plasmodium vivax circumsporozoite protein
MKNFILLAVSSILLVDLFPTHCGHNVDLSKAINLNGVNFNNVDASSLGAA
HVGQSASRGRGLGENPDDEEGDAKKKKDGK KAEPKNPRENKLKQPGDRA
DGQPAGDRADGQPAGDRADGQPAGDRAAGQPAGDRADGQPAGDRADGQPA
GDRADGQPAGD RADGQPAGDRAAGQPAGDRAAGQPAGDRADGQPAGDRA
AGQPAGDRADGQPAGDRAAGQPAGDRADGQPAGDRAAGQPAG DRAAGQP
AGDRAAGQPAGDRAAGQPAGNGAGGQAAGGNAGGGQGQNNEGANAPNEKS
VKEYLDKVRATVGTEWTPCSVTC GVGVRVRRRVNAANKKPEDLTLNDLE
TDVCTMDKCAGIFNVVSNSLGLVILLVLALFN
11Plasmodium falciparum CS protein
Plasmodium vivax CS protein
Window2
12Plasmodium falciparum CS protein
Plasmodium vivax CS protein
window 7
13Database Searching
- Database Searching ? Sequence alignment
- Database searching is the application of
knowledge gained from previous experiments to the
problem of gene discovery - Similarity ? Homology
14Database Searching
- The Assumptions
- The sequences being sought have an evolutionary
ancestral sequence in common with the query
sequence - The best guess at the actual path of evolution is
the path that requires the fewest evolutionary
events (most parsimonious) - All substitutions are not equally likely and
should be weighted accordingly - Insertions and deletions are less likely than
substitutions and should be weighted accordingly
15Database Searching
- Applied Considerations
- The choice of search algorithm influences the
sensitivity and selectivity of the search - The choice of matrix determines both the pattern
and the extent of substitution in the sequences
the database search is most likely to discover
16Protein vs Nucleotide
- Which molecules should you search with?
- Which databases should you search, nucleotide or
protein?
17Why cant we just look at the DNA sequence for
the protein?
- It was one thought that we might be able to
calculate a minimum mutation matrix, i.e. one in
which the minimum number of steps needed to
change from one aa to another we counted. The
problem is, because of the degeneracy of the
genetic code, often likely and unlikely mutations
would receive the same score
18BLAST
- BLAST is less sensitive than SW
- Basic BLAST uses a word size of 3 for proteins
and is more sensitive than FASTA (even though
FASTA uses a word of size 2) - Basic BLAST uses a word size of 11 or 12 for
nucleic acid sequences - The Heuristic is applied to the words in BLAST
via a threshold value, T for alignments of
words.
19Basic BLAST Algorithms
- BLASTN - compares a nucleotide query to a
nucleotide database - BLASTP - compares a protein query to a protein
database - BLASTX - compares a nucleotide query sequence
translated in all reading frames against a
protein sequence database - TBLASTN - compares a protein query sequence
against a nucleotide sequence database
dynamically translated in all reading frames. - TBLASTX - compares the six-frame translations of
a nucleotide query sequence against the six-frame
translations of a nucleotide sequence database.
Please note that tblastx program cannot be used
with the nr database on the BLAST Web page.
20BLAST Nuts Bolts
- Search a database with initial words and the
expanded word set(neighborhood words) with scores
above some threshold, T. - If a match is found between a word and a DB
entry, attempt to extend the alignment until the
score falls off by some value, i.e. the score is
no longer maximal
21Blast in a Nutshell
22PAM 120
A 3 R -3 6 N -1 -1 4 D 0 -3 2 5 C -3 -4
-5 -7 9 Q -1 1 0 1 -7 6 E 0 -3 1 3 -7
2 5 G 1 -4 0 0 -4 -3 -1 5 H -3 1 2 0 -4
3 -1 -4 7 I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 L
-3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 K -2 2 1 -1 -7
0 -1 -3 -2 -3 -4 5 M -2 -1 -3 -4 -6 -1 -3 -4
-4 1 3 0 8 F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0
-7 -1 8 P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3
-5 6 S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3
1 3 T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4
-1 2 4 W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6
-1 -7 -2 -6 12 Y -4 -5 -2 -5 -1 -5 -5 -6 -1 -2 -2
-5 -4 4 -6 -3 -3 -2 8 V 0 -3 -3 -3 -3 -3 -3 -2
-3 3 1 -4 1 -3 -2 -2 0 -8 -3 5 B 0 -2 3 4
-6 0 3 0 1 -3 -4 0 -4 -5 -2 0 0 -6 -3 -3
4 Z -1 -1 0 3 -7 4 4 -2 1 -3 -3 -1 -2 -6 -1
-1 -2 -7 -5 -3 2 4 X -1 -2 -1 -2 -4 -1 -1 -2 -2
-1 -2 -2 -2 -3 -2 -1 -1 -5 -3 -1 -1 -1 -2 -8 -8
-8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
-8 -8 -8 -8 -8 1 A R N D C Q E G H I
L K M F P S T W Y V B Z X
23Blast Extension
Database Sequence T G Y A A S S S T Y M Q V G P
R E G V L K
P R E G A I
Word has a hit
Extend the word
During extension, the matrix is used to calculate
the score. Extension continues until the score
is reached or the score deteriorates by the
specified fall or cutoff.
24BLAST Nuts Bolts
- Normal word sizes for proteins are W3 with T
14 or W4 with T16. - Normal word sizes for nucleic acids are W11 or
W12 - The default scoring matrix for nucleic acid
sequences is (1, -3) for NCBI BLAST and (5, -4)
for WUBLAST
25Gapped BLAST - 3 Changes to the Algorithm
- Criterion for extending word pairs modified, this
gives an increase in speed - Ability to create gapped alignments added
- Smith-Waterman calculations are used to produce
the final alignment
26Word Extension
- In the older versions of BLAST, if a word pair
with a score above T was encountered when
screening the DB, it was extended. - In the newer version, two non-overlapping words
located at some distance X (the hitdist)from
each other must hit the same sequence in the DB
before an extension is performed. - To maintain sensitivity, must lower the value of
T. This yields more hits, but few are extended.
27Gapped Alignment
- Original BLAST found many HSP and used all to
generate a SUM statistic - If you gap then you only need to find only one
rather than all ungapped alignments. - This allows T to be raised and increases the
initial scan - Gapped alignments are achieved via dynamic
programming to extend a central pair of aligned
residues in both directions.
28PSI-BLAST
- Distant relationships are often best detected by
motif or profile searches rather than pairwise
comparisons - PSI-BLAST searches are iterated, with a
position-specific matrix generated from
significant alignments found in round i used in
round i 1. - BLAST uses a generalized matrix
- May not be as sensitive as motif search but is
very general and easy to use.
29A PSSM (position specific scoring matrix) for
PSI-BLAST
A R N D C Q E G H I L K M F
P S T W Y V 20 N 0 0 3 -2 -4 2 0 0 -2
0 0 2 -2 -4 -3 2 0 -5 -3 -3 21 S -2 0 3 0
-4 0 0 0 -2 -4 -4 1 -3 -4 -3 2 2 4 -3 -3
22 G 1 0 2 -2 -3 0 -2 1 2 -2 0 1 -2 -3
-3 1 -2 -4 -3 0 23 W -2 2 1 1 -4 0 1 0
2 -1 -3 0 -3 2 -3 1 -2 3 -2 -3 24 D -3 0 0
4 -4 -1 3 -3 1 -2 0 0 -2 -4 0 -2 0 -5 -3
-1 25 Q -2 0 1 0 -4 2 3 0 -2 -1 -4 -1 -3
-3 -3 1 2 -4 0 -3
30There are 2 Blast Variants
- NCBI BLAST (http//ncbi.nlm.nih.gov/BLAST/) or
via local install - WUBLAST (http//blast.wustl.edu/) for
information. This program is most often used at
database web sites and for local installs.
31Available GenBank Peptide Sequence Databases
nr All non-redundant GenBank
CDS translationsPDBSwissProtPIRPRF month
All new or revised GenBank CDS
translationPDBSwissProtPIRPRF released in
the last 30 days. swissprot
Last major release of the SWISS-PROT protein
sequence database (no updates) Drosophila
genome Drosophila genome
proteins provided by Celera and Berkeley
Drosophila Genome Project (BDGP). yeast
Yeast (Saccharomyces cerevisiae)
genomic CDS translations ecoli
Escherichia coli genomic CDS translations
pdb Sequences derived from the
3-dimensional structure from Brookhaven Protein
Data Bank Patent Protein
sequences derived from the Patent division of
GenBank
32Available Genbank Nucleotide Sequence Databases
nr All GenBankEMBLDDBJPDB
sequences (but no EST, STS, GSS, or phase 0, 1
or 2 HTGS sequences). No longer "non-redundant".
month All new or revised
GenBankEMBLDDBJPDB sequences released in the
last 30 days. Drosophila genome
Drosophila genome provided by Celera and
Berkeley Drosophila Genome Project (BDGP).
dbest Database of
GenBankEMBLDDBJ sequences from EST Divisions
dbsts Database of
GenBankEMBLDDBJ sequences from STS Divisions
htgs Unfinished High
Throughput Genomic Sequences phases 0, 1 and 2
(finished, phase 3 HTG sequences are in nr)
gss Genome Survey Sequence,
includes single-pass genomic data, exon-trapped
sequences, and Alu PCR sequences.
33Available GenBank Nucleotide Databases continued
yeast Yeast (Saccharomyces
cerevisiae) genomic nucleotide sequences E. coli
Escherichia coli genomic
nucleotide sequences pdb
Sequences derived from the 3-dimensional
structure from Brookhaven Protein Data Bank
Patent Nucleotide sequences
derived from the Patent division of GenBank
vector Vector subset of
GenBank(R), NCBI, in ftp//ncbi.nlm.nih.gov/blast/
db/ mito Database of
mitochondrial sequences alu
Select Alu repeats from REPBASE, suitable for
masking Alu repeats from query sequences. It is
available by anonymous FTP from ncbi.nlm.nih.gov
(under the /pub/jmc/alu directory). See "Alu
alert" by Claverie and Makalowski, Nature vol.
371, page 752 (1994).
34Essential BLAST Parameters
- W word size
- T neighborhood word score threshold
(varies by word size and matrix used) - V number of descriptions to report
- B number of alignments to report
- M value of a nucleotide match
- N value of a nucleotide mismatch
- X word hit extension drop off score
- E Expected frequency of chance occurances
- S Score at which a single HSP would satisfy E
- -matrix defines a matrix to use
- -filter defines a specific filter program
35Command line BLAST
- Format algorithm db query options
- Example blastp nr myprot.txt -matrixpam70 V10
B10 - Example blastn nt mynuc.txt M5 N-4 E1.0e-5
- Example blastn nt mynuc.txt M5 N-4 E1.0e-5 gt
blast.out
36Making your own BLAST DB
- Any sequence file of fasta formatted sequences
can be turned into a BLAST DB. - How you do this depends on which BLAST variant
you are using. - NCBI BLAST-protein DB setdb myseqfile
- NCBI BLAST-nucleotide DB pressdb myseqfile
- WUBLAST - protienDB formatdb -p myseqfile
- WUBLAST-nucleotideDB