Title: Pairwise Sequence Alignment Part 1
1Pairwise Sequence AlignmentPart 1
- VIBE Education Edition (VIBE-Ed) Initiative
2Overview
- Part 1 Introduction
- Why compare sequences?
- Dynamic Programming Algorithms (Global vs. Local)
- Heuristic algorithms (K-tuple / Word-size)
- Scoring Matrices
- Part 2 Statistics of Similarity Searches
- Scoring Matrices, contd
- Statistics of similarity searching
3Why compare sequences?
- Nature is conservative
- Incremental modifications give rise to genetic
diversity and novel function - Detection of similarity between sequences allows
us to transfer information about one sequence to
other similar sequences with reasonable, though
not always total, confidence
4Sequence Alignment
- Before we can make comparative statements about
two (nucleic acid or protein) sequences, we have
to produce a pairwise sequence alignment - What is the optimal alignment between two
sequences? - Quantitative? Match/mismatch? Gaps/extensions? Is
an optimal alignment always significant? Random
sequences?
5Protein Evolution
- For many proteins, evolutionary history can be
traced back gt 1 billion years - Evolutionary time scales / Tree of Life
- Sequence Homology vs. Sequence Similarity
- Homology means common ancestry
6Three alignments, three meanings
- HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDL
HAHKL - G VKHGKKV AAHD LSLH KL
- HBB _HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSEL
HCDKL - HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALS
ALSDLHAHKL - H KV A L LH K
- LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLG
SVHVSKG - HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD-
---LHAHKL - GS G D L H D A AL D AH
- F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFK
AHQE
7Pairwise Sequence Alignment Methods
- Dynamic Programming
- Global Alignment (Needleman-Wunsch)
- Local Alignment (Smith-Waterman)
- Word, or k-tuple methods
- FASTA
- BLAST
8Dynamic Programming Algorithm
- Provides the best (optimal) alignment between two
sequences - Includes matches, mismatches and gaps to maximize
the number of matched characters - Score match, mismatch, gap (non-affine vs.
affine)
9Example
- Find optimal global alignment for sequences
- GAAGA
- and
- GTTTAAG
10Define Rules
- Score for match 1
- Score for mismatch -1
- Score for gap -3
11(No Transcript)
12Implement Rules
- When Moving Horizontally (gap)
- Alignment Score Existing Score Gap Score
- When Moving Vertically (gap)
- Alignment Score Existing Score Gap Score
- When Move Diagonally (match or mismatch)
- Alignment Score Existing Score Corner Value
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21Optimal Global Alignments
1) GAAGA__ -7 GTTTAAG
5) GAAA_G_ -7 GTTTAAG
6) GAA_GA_ -7 GTTTAAG
2) G_A_AGA -7 GTTTAAG
3) G__AAGA -7 GTTTAAG
4) G_AAGA_ -7 GTTTAAG
22Global vs. Local Alignment
23Smith Waterman
- Dynamic programming - local alignment
- Compares query to each sequence in database.
- Performs full pairwise comparison.
- More sensitive, but much slower than heuristic
algorithms (BLAST or FASTA)
24(No Transcript)
25Optimal Local Alignment
GAAGA 3 GTTTAAG
AAG 3
AAG
26Heuristic (word or k-tuple based) algorithms
- Make reasonable assumptions about nature of
sequence alignments and try out only most
likely alignments - Find perfect match (word, k-tuple)
- Extend alignment until
- One or the other sequences ends
- Score drops below a threshold
- Much faster than dynamic programming methods, but
less sensitive
27FASTA (Pearson and Lippman 1988)
- Combination of word (k-tup) search and
Smith-Waterman algorithm - The query sequence is divided into small words of
certain size. - The initial comparison of the query sequence to
the database is performed using these words. - If these words are located on the same diagonal
in an array the region surrounding the diagonals
are analyzed further. - Search time is only proportional to size of
database not (databasequery sequence)
28The ktup value
- The ktup (for k-tuples) value stands for the
length of the word used - to search for identity.
- For proteins a ktup value of 3 would give a hash
table of 203 - elements (8000 entries).
- The higher the ktup value the less likely you
will get a match - unless it is identical (remember the dot
plots). - The lower the ktup value the more background you
will have - The higher the ktup value the faster analysis
(fewer diagonals).
Typical ktup values
ktup analysis____________________
1 proteins- distantly related 2
proteins- somewhat related (default)
3 DNA-default
29FASTA Steps
2
Different offset values
1
Identical offset values in a contiguous sequence
Diagonals are extended
Local regions of identity are found
Rescore the local regions using scoring matrices
4
3
Create a gapped alignment in a narrow segment and
then perform S-W alignment
Eliminate short diagonals below a cutoff score
30Summary of FASTA steps
- 1. Analyzes database for identical matches that
are contiguous. - 2. Longest diagonals are scored again using the
PAM matrix (or other matrix). The best scores
are saved as init1 scores. - 3. Short diagonals are removed.
- 4. Long diagonals that are neighbors are joined.
The score for this joined region is initn.
This score may be lower due to a penalty for a
gap. - 5. A S-W dynamic programming alignment is
performed around the joined sequences to give an
opt score. - Thus, the time-consuming S-W step is performed
only on top scoring sequences
31FASTA Versions
fasta compares a protein sequence to a protein
database or nucleotide sequence to a nucleotide
database fastx compares a translated query
sequence fasty to a protein sequence database
(forward or backward translation of the
query) tfastx compares protein query sequence
to tfasty nucleotide sequence database that
has been translated into three forward
and three reverse reading frames
32BLAST(Karlin and Altschul 1990)
- Basic Local Alignment Search Tool
- Database is pre-indexed to increase speed
- The initial search is done for a word of length
"W" that scores at least "T" when compared to the
query using a substitution matrix. - Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of "S". - The "T" parameter dictates the speed and
sensitivity of the search.
33BLAST Versions
BLASTN Compares a nucleotide query to a
nucleotide database BLASTP Compares a protein
query to a protein database BLASTX Compares
a translated nucleotide query to a protein
database. TBLASTN Compares a protein query to a
translated nucleotide database. TBLASTX Compare
s a translated nucleotide query to a translated
nucleotide database.
34Scoring Matrices
- The alignment score represent odds of obtaining
that score between sequences known to be related
to that obtained by chance alignment between
unrelated sequences - When the correct scoring matrix is used,
alignment statistics are meaningful
35Dayhoff PAM Matrix(Point Accepted Mutation)
- Lists the likelihood of change from one amino
acid to another in homologous protein sequences
during evolution - PAM 1 estimated using 1572 changes in 71 groups
of protein sequences that were at least 85
similar - Assumes each amino acid change at a site is
independent of previous changes at the site - PAM 250 (20 similarity) obtained by multiplying
PAM1 by itself 250 times
36Blocks Amino Acid Substitution (BLOSUM) Matrix
- Based on the observed amino acid substitutions in
blocks (large set of 2000 conserved amino acid
patters) - Used 500 families of related proteins
- Not based on explicit evolutionary model, but
from considering all amino acid changes observed
in an aligned region from a related family of
proteins.
37PAM250 Scoring Matrix
A R N D C Q E G H I L K M F P
S T W Y V B Z A 2 -2 0 0 -2 0 0 1 -1
-1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1 R -2
6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2
-4 -2 1 2 N 0 0 2 2 -4 1 1 0 2 -2
-3 1 -2 -3 0 1 0 -4 -2 -2 4 3 D 0 -1
2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4
-2 5 4 C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6
-5 -5 -4 -3 0 -2 -8 0 -2 -3 -4 Q 0 1 1
2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2
3 5 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0
-2 -5 -1 0 0 -7 -4 -2 4 5 G 1 -3 0 1
-3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1
2 1 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2
-2 0 -1 -1 -3 0 -2 3 3 I -1 -2 -2 -2 -2
-2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1
-1 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4
2 -3 -3 -2 -2 -1 2 -2 -1 K -1 3 1 0 -5 1
0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2
-2 -1 -4 -2 2 -1 0 F -3 -4 -3 -6 -4 -5 -5 -5
-2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4 P
1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0
-6 -5 -1 1 1 S 1 0 1 0 0 -1 0 1 -1
-1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1 T 1
-1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3
-5 -3 0 2 1 W -6 2 -4 -7 -8 -5 -7 -7 -3
-5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4 Y -3
-4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3
0 10 -2 -2 -3 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4
2 -2 2 -1 -1 -1 0 -6 -2 4 0 0 B 2 1
4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2
0 6 5 Z 1 2 3 4 -4 5 5 1 3 -1 -1
2 0 -4 1 1 1 -4 -3 0 5 6
38Summary
- Choose appropriate algorithm (speed vs.
sensitivity) - Use smallest database that will answer your
question - Default matrices may not always give a meaningful
score