Title: LSM2104 Part II Lectures 3 and 4
1LSM2104Part IILectures 3 and 4
- Tan Tin Wee
- Dept of Biochemistry
- NUS
2Topics
- Derivation of PAM and BLOSUM and Scoring Matrices
- Basic Introduction to Dynamic Programming
Techniques - Details of Needleman-Wunsch and Smith-Waterman
Algoriths - FASTA and BLAST Heuristics
- Database Similarity Searches and Statistics of
the E Value - Multiple Sequence Alignments
3PAM
- PAM (Percent Accepted Mutations) similarity
matrix - PAM matrices pertaining to protein sequences are
constructed using amino acid similarities in
evolutionarily related sequences. They show
probability scores of replacement of amino acids
by each other based on natural mutation rates in
related protein families. Hence, these matrices
are sometimes called "substitution matrices". - A score above zero assigned to two amino acids
indicates that these two replace each other more
often than expected by chance alone. ie., they
are functionally exchangeable. - A negative score (below zero) indicates that the
two amino acids are rarely interchangeable. eg.,
a basic amino acid for an acidic one or one with
an aromatic side chain for one with aliphatic
side chain. - The number 250 in PAM250 corresponds to an
average of 250 amino acid replacements per 100
residues from a data set of 71 aligned sequences
Ref Dayhoff, M, Schwartz, RM, Orcutt, BC (1978)
A model of evolutionary change in proteins. in
Atlas of Protein Sequence and Structure, vol 5,
sup. 3, pp 345-352. M. Dayhoff ed., National
Biomedical Research Foundation, Silver Spring,
MD.. The higher the matrix number, the farther
the evolutionary distance between the compared
sequences
4PAM
- Margaret Dayhoff et al (1970s)
- First to assemble sequences into protein seq
atlas families and superfamilies based on seq
similarity - Tables of frequency of changes/mutations observed
in the sequences of a family derived. - Percent amino acid mutations accepted by
evolutionary selection or PAM Tables derived. - Shows probability that one amino acid change into
any other in these families - Shows which amino acids are most conserved at the
corresponding position in two sequences. - Assess the probabilities of residue replacement
in related proteins - Model of molecular evolution
- 1 PAM average change in 1 of all amino acid
possibilities - 100 PAMs does not mean every residue is changed
- Time is not correlated with PAM ie. Different
families of proteins evolve at different rates
5BLOSUM
- BLOSUM (Blocks Substitution Matrix) matrix
- These are substitution matrices derived from the
observed frequencies of amino acid replacements
in highly conserved regions of ungapped local
alignments. The data for the substitution scores
in these matrices come from about 2000 blocks of
aligned sequence segments characterizing more
than 500 groups of related proteins Ref
Henikoff, S., and Henikoff, J. G. (1992) Proc.
Natl. Acad. Sci. USA 89 10915-10919 - The BLAST server from NCBI and the search servers
from EBI use different versions of the BLOSUM
matrix for protein similarity searches and
alignments.
6BLOSUM Scoring Matrices
- Block Substitution Matrix
- Henikoff and Henikoff PNAS 1992
- Number indicate percent identity within set eg.
BLOSUM62 means 62 identity - Finds short highly similar sequences
7Use of these matrices
- PAM, BLOSUM, and other matrices
- Used as scoring matrices for sequence alignment
programs
8Elements of Dynamic Programming
- Dynamic programming is a computational method
that is used in bioinformatics and computational
biology to align two protein or nucleic acid
sequences - Every pair of characters in the two sequences are
compared pairwise - O(MN)
- Dynamic programming algorithms proven
mathematically to generate the best or OPTIMAL
alignment between two sequences under a given set
of conditions - http//bioinformatics.weizmann.ac.il/courses/BCG/l
ectures/02_pairwise/2.4tools/index.html - http//www.sbc.su.se/per/molbioinfo2001/dynprog/d
ynamic.html
9Example of Dynamic programming algorithm
- Needleman-Wunsch techniques. For this example,
the two sequences to be globally aligned are G A
A T T C A G T T A (sequence 1) G G A T C G A
(sequence 2) - So M 11 and N 7 (the length of sequence 1
and sequence 2, respectively) - A simple scoring scheme is assumed where
- Si,j 1 if the residue at position i of sequence
1 is the same as the residue at position j of
sequence 2 (match score) otherwise - Si,j 0 (mismatch score)
- w 0 (gap penalty)
10- Three steps in dynamic programming
- Initialization
- Matrix fill (scoring)
- Traceback (alignment)
- Good detailed description
- http//www.sbc.su.se/per/molbioinfo2001/dynprog/a
dv_dynamic.html
11Needleman-Wunsch
- S.B. Needleman and C.D. Wunsch. A general method
applicable to the search for similarities in the
amino acid sequence of two proteins. J. Mol.
Biol., 48443--453, 1970 - General Algorithm for Sequence Comparison
- Compuationally expensive O(M,N) where M and N are
lengths of sequences being pairwise compared - To Calculate the alignment score S(i,j), you only
need to enumerate and score all ways in which one
aligned pair can be added to a shorter alignment
to produce an alignment of the first I residues
of sequence A and the first j residues of
sequence B
12- All possible pairs are calculated in a two-D
array - All possible alignments are represented by
pathways through this array, the shortest being
the most optimal - Global alignments every residue of the two
sequences A and B is involved in the calculation. - Misses out on specific motifs or active sites
homology, or domains within a longer sequence
which match each other, where other parts dont
match
13Step 1 assign similarity values
- Assign Similarity Scores
- A numerical value/score is assigned to every cell
in the 2D array, based on the scoring matrix. - The similarity scores e.g. unitary soring matrix,
PAM, BLOSUM etc.
14Step 2 for each cell trace back all possible
pathways back to the start allowing for indels,
giving cell the value of the pathway with the max
score
- Score pathways through the 2D Array
- For each cell, calculate the maximum score for an
alignment ending at that point. - Compute Cumulative score by adding scores in a
path through the matrix - Search subrow and subcolumn for the highest score
- Include Gap penalty and Gap extension penalty
- The best optimal alignment is the pathway with
the highest score - Max match is the largest number of amino acids of
one protein that can be matched with those of
another protein while allowing for gaps
15Step 3 traceback
- Construct the alignment or pathway back from the
highest scoring cell to give the highest scoring
alignment - Examples of implementations GAP (in GCG package)
16- N-W statistical significance
- Probability of obtaining the same result of the
maximum score from a pair of random sequences - Estimated using
- Form pairs of random sequences by randomly
drawing one member from each set e.g. with the
same amino acid composition as the real sequences - If the value calculated for real proteins is
significantly different from that of the random
proteins, then it is likely that the alignment is
due to the specific sequences rather than
accidentally due to their composition.
17Smith Waterman Algorithm
- Smith, T. F., Waterman, M. S., J. Mol. Biol.
(1981) 147195-197 - Based on Needleman Wunsch Algorithm
- Local Alignment instead of N-W global alignment
- Examples of implementations BESTFIT (in GCG
package) - S-M Algo compares segments of all possible
lengths (local alignments) rather than entire
length (global alignments) in the NW algo and
chooses whichever optimises the similarity score
of these local alignments
18- Comparing two sequences A (a1 a2 a3 ... an) and
B (b1 b2 b3 ... bm) - HijmaxHi-1, j-1 s(ai,bj), maxHi-k,j -Wk,
maxHi, j-l -Wl, 0 - Hij is the maximum similarity of two segments
ending in ai and bj respectively - Including the alternative that the similarity
score be zero in the expression for Hij allows
the local alignment to restart at any pair of
aligned residues. In addition, it makes the
calculation much more sensitive to the precise
match and mismatch scores and gap penalties.
19- For every cell, the SW algo computes all possible
paths leadng to it, including paths of any
length, and not necessarily leading from the
beginning of the sequence (unlike the global
alignment). - So paths can be of any length, starting anywhere
in the sequences being compared, and can contain
insertions or deletions. - Relies on gap penalties to inhibit extensions
20Forming the optimal alignment diagonal path
- For every residue in the query sequence
- Align with next residue of the db target sequence
Previous score plus similarity score for the
two residues - Deletion score is previous score minus gap
penalty and gap size penalty query seq with gap - Insertion score is previous score minus gap
penalty and gap size penalty db target with gap - Stop when score is zero.
- Choose whichever above is highest
21SW alignment
- Score in each cell is max possible score for an
alignment of any length (not necessarily from the
beginning unlike the NW) ending at that cell - Trace pathway back form the highest scoring cell
- This cell can be anywhere in the 2D array
- Align the highest scoring segment
- http//www.maths.tcd.ie/lily/pres2/
22NW vs SW
- Global vs local alignment
- NW Alignment score for a pair of residues must be
gt 0 while SW can be or - No gap penalty required for NW to work properly,
but SW needs gap penalty - Score cannot decrease between two cells of a
pathway for NW, but for SW, in getting to an
alignment, score can increase, decrease or stay
level between two cells of a pathway
23BLAST and FASTA
- Heuristic approximations of N-W and S-M
algorithms - Reduce computation and increase speed
- NW and SM exhaustive, guarantee optimal
- BLAST is statistically sound. But it is not
mathematically guaranteed to produce the best
optimal alignment
24FASTA
- Pearson, W. R., Lipman, D. J., P.N.A.S. (1988)
852444-2448 - 1. Lookup Table
- http//acer.gen.tcd.ie/amclysag/fasta.html
- FastA is a family of heuristic algorithms
developed by William Pearson of the University of
Virginia. FastA lies between BLAST and
Smith-Waterman in both accuracy and speed. An
optimized FastA option makes use of
Smith-Waterman for part of the alignment process.
The FastA family includes DNA to DNA, protein to
protein, and translation searches.
25FASTA
- http//workbench.sdsc.edu
- Rapid Global Alignment
- Allows alignments to shift frames
- LALIGN a FASTA derivative for local alignments
- Works for internal repeats missed by FASTA
because of gaps (now there is gapped BLAST
BLAST 2.0)
26Basic Local Alignment Search Tool BLAST
- Altschul, S. F., Gish, W., Miller, W., Myers, E.
W., Lipman, D. J., J. Mol. Biol. (1990)
215403-410 - Strategy is to Optimise Maximal Segment Pair
(MSP) score - Heuristic modification of the SW dynamic
programming algo - Ungapped alignments only
- BLAST1 uses PAM120 matrix for protein alignments
- DNA alignment score matrix 5 for match and 4
for mismatch
27BLAST is heuristic
- BLAST (Basic Local Alignment Search Tool) is a
heuristic algorithm first released in 1990 by the
National Center for Biotechnology Information
(NCBI). The original BLAST compares very short
segments of the query and database sequences in
search of alignments that exceed a particular
score. No insertions or deletions are considered
during this process. If the required score is
exceeded, BLAST investigates whether the aligned
segments can be lengthened to produce a
higher-scoring alignment. In the last two years,
new and improved versions of BLAST have been
released. These newer releases consider
insertions and deletions to some extent producing
better results. BLAST offers several search
types DNA to DNA, protein to protein, and
translation searches.
28MSP max segment pair
- Similarity Score for a segment is the Sum of
Scores for individual pairs of residues. - MSP is highest scoring segment
- MSP is locally max if score cannot be increased
by extending or shortening both segments - Search for all locally maximal MSPs above some
cutoff/threshold. - Word pair of fixed length w
- Look for scores above threshold T
- Scan target db sequence for word of length w with
score geT - Find word hits - Extend word hits to see if it is part of a longer
segment with the segment pair with score at least
S where S is highest MSP score at which chance
similarities are likely to appear
29Algorithm
- Compile a list of high scoring words of length w
(w4 for proteins and 12 for nt) for query
sequence - Scan for word hits of score greater than
threshold T against db target sequence - Extend word hit in both directions to find High
Scoring Pairs with scores greater than S
30Step 1 compile list of high scoring words
- Generate All words of w length w-mers in query
sequence - Look for w-mers with score at least T when
matched with target sequence - Speed of program is dependent on list generated
in a time proportional to the length of the list
unlike SW or NW which is proportional to the
product of the lengths.
31Step 2. Scanning Phase
- Scan DB
- Search long sequences for occurrences of words in
the list - Use deterministic finite automaton or finite
state machine at approx 0.5M residues a sec
32Step 3. Extending Hits
- Extend w-mers
- Stop when score falls a certain distance below
highest so far.ie the highest score for shorter
or no extensions. - http//acer.gen.tcd.ie/amclysag/blast.html for
DNA
33Other more advanced programs
- See http//www.paracel.com/faq/faq_algorithm_prime
r.html - Gapped Blast (BLAST 2.0) extend words from no-gap
to gapped, can generate gapped alignments - PSI Blast Position specific iterated BLASTie.
Use gapped BLAST, generate a profile from
multiple iterations. This profile is used instead
of the input and distance matrix
34Limitations to BLAST for sequence database
searching
- Needs islands of strong similarity
- Limits on scoring and penalty values
- Blastx, tblastn and tblastx use 6 frame
translation but miss sequences with frameshifts - Finds and reports only local alignments
35BLAST rules of thumb
- For short aas 20-40, 50 identity happens by
chance - If A homologus to B, B to C then A and C are
homologous - Similarity can be present even if there is
absence of homology e.g. low complexity
transmembrane and coiled coil regions.
36BLAST significance
- S (lambdaS-lnK)/ln2
- Lambda and K are associated with the scoring
system - S with a given E, is sigificant if it is greater
than N/E where N is the size of the search space.
37Significance
- A common and simple test to determine if the
alignment of two sequences is statistically
significant is to carry out a simple permutation
test. This consists of - Randomly rearrange the order of one or both
sequences - Align the permuted sequences
- Record the score for this alignment
- Repeat steps 1-3 a large number of times.
38- http//helix.biology.mcmaster.ca/721/outline2/node
40.htmlSECTION00630000000000000000 - For protein sequences Doolittle's rule of thumb
is that greater than 25 identity will suggest
homology, less than 15 is doubtful and for those
cases between 15-25 identity a strong
statistical argument is required. Personally, I
would prefer the statistical test in all cases,
since they are easy to do and things such as
internal repeats and unusual amino acid
compositions can sometimes confuse the picture.
39- Gumbel Extreme Value Distribution
- Make many random protein sequences of varying
lengths - Locally align two sequences of a given length
with the same scoring matrix and gap penalty - Repeat for many pairs of sequences of approx same
length - Plot distribution of scores
- See variation from normal distribution curve
skew the Gumbel extreme value distribution
40- Probability that a score S between two unrelated
sequence is equal to or greater than a value x,
P(Sgtx) is given by P 1- exp (-Kmn e -lambdax) - Where m and n are sequence lengths, and lambda is
scaling factor for the scoring matrix used, - K is a constant that depends on the scoring
matrix and the gap penalty combination used. - For significant matches, we are looking for v.
low value of P, less than 0.01 to 0.05. - Calculate another value E, the expect value for
the alignment score that depends on how we found
P. E is used by BLAST. - For BLOSUM62, gap 11/-1, lambda 0.3 and K0.1
approx
41Sequence scrambling method shuffling
- Align two sequences and obtain optimal score S
- Scramble one of the sequences thousands of times
(N) and align it with the first sequence and
obtain a distribution of scores, scrambled
sequences should preserve same composition - Fit the scores into an extreme value distribution
and calculate lambda and K - Calculate P for probability that one of the
scrambled pairs will exceed optimal score S - Calculate E(expect value), which is usually P x
the number of sequence pairs compared. Eg. If P
10-6, and N 10,000 then E10-2 - E is the number we want to be lt0.01 to 0.05
42Tips on how to assess statistical significance
- http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/Bl
ast_output.html
43- Extreme-Value Distribution Probability density
function for distribution of local alignment
score. - Probability density function
- Several illustrative visual examples may be found
in - Karlin, Samuel and Altschul, Stephen,
Applications and statistics for multiple
high-scoring segments in molecular sequences
Proc. Natl. Acad. Sci. USA Vol. 90, pp.
5873-5877, June 1993.
44- Take an example (ChenXin)
- Suppose the genome of B.pseudomallei has some
base pairs that could code for 10,000,000 amino
acids. Now, we want to know if this genome could
code for a very short sequence "MF". What would
be the possibility of finding a sequence "MF" in
the sequence coded by the B.Pse... genome? - Math have it 1 - ( (1- (1/20)(1/20) )
9,999,999 ) - Since we have 20 amino acids, the possibility of
getting a sequence of "MF" is (1/20)(1/20). The
possibility of the sequence is not "MF" is
1-(1/20)(1/20). In the genome, there are
9,999,999 possible starting point of a length-two
sequence. So, the possibility that they are all
not "MF" is (1- (1/20)(1/20) ) 9,999,999. So
the possibility of finding a "MF" is 1 - ( (1-
(1/20)(1/20) ) 9,999,999 ). If you know well
about math, you will know this value is nearly
1(100) which means you are very, very likely to
find the sequence. - The sequence "MF" is arbitrarily dictated by me
which have no biological significance. That means
the existence of "MF" is because of chance (by
distribution), not because of the evolutionary
pressure. The sequence we are interested are
those developed and fixed through the
evolutionary process by the force of natural
selection (survival of the fittest). So, we wish
to calculate how possible the match is because of
chance (by distribution). - This is the E value. So the smaller the E value
is (The less possible the match can occur by
distribution), the more biological significance
it have (survived through selection).
45Multiple Sequence Alignments
- NW algorithm can be extended from pairwise to gt 2
sequences. Matrix become n dimensional. - O(MN) to O(Nm)
- just 100 nucleotides from 5 species, this is
- 100e5 10,000,000,000 operations
- Eg. Like alignment of 2 seqs 100,000 nt long
- Most popular is Higgins and Sharp (1989) Clustal
464 Steps in CLustal
- All pairwise similarity scores calculated using
Rapid alignment methods - Create a similarity matrix and to cluster the
sequences based on this similarity using a
cluster algorithm - Create an alignment of clusters via a consensus
method. - Create a progressive multiple alignment by
sequentially aligning groups of sequences
according to their branching order in the
clustering. - ClustalW a local alignment algorithm
- ClustalV the older global alignment method
47(No Transcript)
48- MSA ( Gupta, Kececioglu Schaffer, 1995)
- MACAW ( Schuler, Altschul Lipman, 1991).
- a multiple dimensional dot plot and then look for
dots that are common to each group ( Vingron
Argos 1991).)
49- Need to construct alignment and phylogeny at the
same time. - Eg TREEALIGN
- When all is said and done, people will still find
that the alignments produced by the programs can
be improved by a judicious and critical
examination by eye. Spending time to slowly and
carefully re-examine your alignments by hand is
recommended - http//helix.biology.mcmaster.ca/721/outline2/node
37.html
50FastDNAML Phylogenetic analysis Construction of
a Phylogenetic tree http//www.bic.nus.edu.sg/ bi
ogrid/phylo/cascon.html
51Major Milestones in bioinformatics
- http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/mi
lestones.html - 1962 Pauling's theory of molecular evolution
- 1965 Margaret Dayhoff's Atlas of Protein
Sequences - 1970 Needleman-Wunsch algorithm
- 1977 DNA sequencing and software to analyze it
(Staden) - 1981 Smith-Waterman algorithm developed
- 1981 The concept of a sequence motif (Doolittle)
- 1982 GenBank Release 3 made public
- 1982 Phage lambda genome sequenced
- 1983 Sequence database searching algorithm
(Wilbur-Lipman) - 1985 FASTP/FASTN fast sequence similarity
searching - 1988 National Center for Biotechnology
Information (NCBI) created at NIH/NLM - 1988 EMBnet network for database distribution
- 1990 BLAST fast sequence similarity searching
- 1991 EST expressed sequence tag sequencing
- 1993 Sanger Centre, Hinxton, UK
- 1994 EMBL European Bioinformatics Institute,
Hinxton, UK - 1995 First bacterial genomes completely sequenced
- 1996 Yeast genome completely sequenced