LSM2104 Part II Lectures 3 and 4

About This Presentation

Title:

LSM2104 Part II Lectures 3 and 4

Description:

Derivation of PAM and BLOSUM and Scoring Matrices. Basic Introduction to Dynamic Programming Techniques ... MSA ( Gupta, Kececioglu & Schaffer, 1995) MACAW ( Schu ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 52

Provided by: everestB

Category:

more less

Transcript and Presenter's Notes

Title: LSM2104 Part II Lectures 3 and 4

1
LSM2104Part IILectures 3 and 4

Tan Tin Wee
Dept of Biochemistry
NUS

2
Topics

Derivation of PAM and BLOSUM and Scoring Matrices
Basic Introduction to Dynamic Programming
Techniques
Details of Needleman-Wunsch and Smith-Waterman
Algoriths
FASTA and BLAST Heuristics
Database Similarity Searches and Statistics of
the E Value
Multiple Sequence Alignments

3
PAM

PAM (Percent Accepted Mutations) similarity
matrix
PAM matrices pertaining to protein sequences are
constructed using amino acid similarities in
evolutionarily related sequences. They show
probability scores of replacement of amino acids
by each other based on natural mutation rates in
related protein families. Hence, these matrices
are sometimes called "substitution matrices".
A score above zero assigned to two amino acids
indicates that these two replace each other more
often than expected by chance alone. ie., they
are functionally exchangeable.
A negative score (below zero) indicates that the
two amino acids are rarely interchangeable. eg.,
a basic amino acid for an acidic one or one with
an aromatic side chain for one with aliphatic
side chain.
The number 250 in PAM250 corresponds to an
average of 250 amino acid replacements per 100
residues from a data set of 71 aligned sequences
Ref Dayhoff, M, Schwartz, RM, Orcutt, BC (1978)
A model of evolutionary change in proteins. in
Atlas of Protein Sequence and Structure, vol 5,
sup. 3, pp 345-352. M. Dayhoff ed., National
Biomedical Research Foundation, Silver Spring,
MD.. The higher the matrix number, the farther
the evolutionary distance between the compared
sequences

4
PAM

Margaret Dayhoff et al (1970s)
First to assemble sequences into protein seq
atlas families and superfamilies based on seq
similarity
Tables of frequency of changes/mutations observed
in the sequences of a family derived.
Percent amino acid mutations accepted by
evolutionary selection or PAM Tables derived.
Shows probability that one amino acid change into
any other in these families
Shows which amino acids are most conserved at the
corresponding position in two sequences.
Assess the probabilities of residue replacement
in related proteins
Model of molecular evolution
1 PAM average change in 1 of all amino acid
possibilities
100 PAMs does not mean every residue is changed
Time is not correlated with PAM ie. Different
families of proteins evolve at different rates

5
BLOSUM

BLOSUM (Blocks Substitution Matrix) matrix
These are substitution matrices derived from the
observed frequencies of amino acid replacements
in highly conserved regions of ungapped local
alignments. The data for the substitution scores
in these matrices come from about 2000 blocks of
aligned sequence segments characterizing more
than 500 groups of related proteins Ref
Henikoff, S., and Henikoff, J. G. (1992) Proc.
Natl. Acad. Sci. USA 89 10915-10919
The BLAST server from NCBI and the search servers
from EBI use different versions of the BLOSUM
matrix for protein similarity searches and
alignments.

6
BLOSUM Scoring Matrices

Block Substitution Matrix
Henikoff and Henikoff PNAS 1992
Number indicate percent identity within set eg.
BLOSUM62 means 62 identity
Finds short highly similar sequences

7
Use of these matrices

PAM, BLOSUM, and other matrices
Used as scoring matrices for sequence alignment
programs

8
Elements of Dynamic Programming

Dynamic programming is a computational method
that is used in bioinformatics and computational
biology to align two protein or nucleic acid
sequences
Every pair of characters in the two sequences are
compared pairwise
O(MN)
Dynamic programming algorithms proven
mathematically to generate the best or OPTIMAL
alignment between two sequences under a given set
of conditions
http//bioinformatics.weizmann.ac.il/courses/BCG/l
ectures/02_pairwise/2.4tools/index.html
http//www.sbc.su.se/per/molbioinfo2001/dynprog/d
ynamic.html

9
Example of Dynamic programming algorithm

Needleman-Wunsch techniques. For this example,
the two sequences to be globally aligned are G A
A T T C A G T T A (sequence 1) G G A T C G A
(sequence 2)
So M 11 and N 7 (the length of sequence 1
and sequence 2, respectively)
A simple scoring scheme is assumed where
Si,j 1 if the residue at position i of sequence
1 is the same as the residue at position j of
sequence 2 (match score) otherwise
Si,j 0 (mismatch score)
w 0 (gap penalty)

Three steps in dynamic programming
Initialization
Matrix fill (scoring)
Traceback (alignment)
Good detailed description
http//www.sbc.su.se/per/molbioinfo2001/dynprog/a
dv_dynamic.html

11
Needleman-Wunsch

S.B. Needleman and C.D. Wunsch. A general method
applicable to the search for similarities in the
amino acid sequence of two proteins. J. Mol.
Biol., 48443--453, 1970
General Algorithm for Sequence Comparison
Compuationally expensive O(M,N) where M and N are
lengths of sequences being pairwise compared
To Calculate the alignment score S(i,j), you only
need to enumerate and score all ways in which one
aligned pair can be added to a shorter alignment
to produce an alignment of the first I residues
of sequence A and the first j residues of
sequence B

All possible pairs are calculated in a two-D
array
All possible alignments are represented by
pathways through this array, the shortest being
the most optimal
Global alignments every residue of the two
sequences A and B is involved in the calculation.
Misses out on specific motifs or active sites
homology, or domains within a longer sequence
which match each other, where other parts dont
match

13
Step 1 assign similarity values

Assign Similarity Scores
A numerical value/score is assigned to every cell
in the 2D array, based on the scoring matrix.
The similarity scores e.g. unitary soring matrix,
PAM, BLOSUM etc.

14
Step 2 for each cell trace back all possible
pathways back to the start allowing for indels,
giving cell the value of the pathway with the max
score

Score pathways through the 2D Array
For each cell, calculate the maximum score for an
alignment ending at that point.
Compute Cumulative score by adding scores in a
path through the matrix
Search subrow and subcolumn for the highest score
Include Gap penalty and Gap extension penalty
The best optimal alignment is the pathway with
the highest score
Max match is the largest number of amino acids of
one protein that can be matched with those of
another protein while allowing for gaps

15
Step 3 traceback

Construct the alignment or pathway back from the
highest scoring cell to give the highest scoring
alignment
Examples of implementations GAP (in GCG package)

N-W statistical significance
Probability of obtaining the same result of the
maximum score from a pair of random sequences
Estimated using
Form pairs of random sequences by randomly
drawing one member from each set e.g. with the
same amino acid composition as the real sequences
If the value calculated for real proteins is
significantly different from that of the random
proteins, then it is likely that the alignment is
due to the specific sequences rather than
accidentally due to their composition.

17
Smith Waterman Algorithm

Smith, T. F., Waterman, M. S., J. Mol. Biol.
(1981) 147195-197
Based on Needleman Wunsch Algorithm
Local Alignment instead of N-W global alignment
Examples of implementations BESTFIT (in GCG
package)
S-M Algo compares segments of all possible
lengths (local alignments) rather than entire
length (global alignments) in the NW algo and
chooses whichever optimises the similarity score
of these local alignments

Comparing two sequences A (a1 a2 a3 ... an) and
B (b1 b2 b3 ... bm)
HijmaxHi-1, j-1 s(ai,bj), maxHi-k,j -Wk,
maxHi, j-l -Wl, 0
Hij is the maximum similarity of two segments
ending in ai and bj respectively
Including the alternative that the similarity
score be zero in the expression for Hij allows
the local alignment to restart at any pair of
aligned residues. In addition, it makes the
calculation much more sensitive to the precise
match and mismatch scores and gap penalties.

For every cell, the SW algo computes all possible
paths leadng to it, including paths of any
length, and not necessarily leading from the
beginning of the sequence (unlike the global
alignment).
So paths can be of any length, starting anywhere
in the sequences being compared, and can contain
insertions or deletions.
Relies on gap penalties to inhibit extensions

20
Forming the optimal alignment diagonal path

For every residue in the query sequence
Align with next residue of the db target sequence
Previous score plus similarity score for the
two residues
Deletion score is previous score minus gap
penalty and gap size penalty query seq with gap
Insertion score is previous score minus gap
penalty and gap size penalty db target with gap
Stop when score is zero.
Choose whichever above is highest

21
SW alignment

Score in each cell is max possible score for an
alignment of any length (not necessarily from the
beginning unlike the NW) ending at that cell
Trace pathway back form the highest scoring cell
This cell can be anywhere in the 2D array
Align the highest scoring segment
http//www.maths.tcd.ie/lily/pres2/

22
NW vs SW

Global vs local alignment
NW Alignment score for a pair of residues must be
gt 0 while SW can be or
No gap penalty required for NW to work properly,
but SW needs gap penalty
Score cannot decrease between two cells of a
pathway for NW, but for SW, in getting to an
alignment, score can increase, decrease or stay
level between two cells of a pathway

23
BLAST and FASTA

Heuristic approximations of N-W and S-M
algorithms
Reduce computation and increase speed
NW and SM exhaustive, guarantee optimal
BLAST is statistically sound. But it is not
mathematically guaranteed to produce the best
optimal alignment

24
FASTA

Pearson, W. R., Lipman, D. J., P.N.A.S. (1988)
852444-2448
1. Lookup Table
http//acer.gen.tcd.ie/amclysag/fasta.html
FastA is a family of heuristic algorithms
developed by William Pearson of the University of
Virginia. FastA lies between BLAST and
Smith-Waterman in both accuracy and speed. An
optimized FastA option makes use of
Smith-Waterman for part of the alignment process.
The FastA family includes DNA to DNA, protein to
protein, and translation searches.

25
FASTA

http//workbench.sdsc.edu
Rapid Global Alignment
Allows alignments to shift frames
LALIGN a FASTA derivative for local alignments
Works for internal repeats missed by FASTA
because of gaps (now there is gapped BLAST
BLAST 2.0)

26
Basic Local Alignment Search Tool BLAST

Altschul, S. F., Gish, W., Miller, W., Myers, E.
W., Lipman, D. J., J. Mol. Biol. (1990)
215403-410
Strategy is to Optimise Maximal Segment Pair
(MSP) score
Heuristic modification of the SW dynamic
programming algo
Ungapped alignments only
BLAST1 uses PAM120 matrix for protein alignments
DNA alignment score matrix 5 for match and 4
for mismatch

27
BLAST is heuristic

BLAST (Basic Local Alignment Search Tool) is a
heuristic algorithm first released in 1990 by the
National Center for Biotechnology Information
(NCBI). The original BLAST compares very short
segments of the query and database sequences in
search of alignments that exceed a particular
score. No insertions or deletions are considered
during this process. If the required score is
exceeded, BLAST investigates whether the aligned
segments can be lengthened to produce a
higher-scoring alignment. In the last two years,
new and improved versions of BLAST have been
released. These newer releases consider
insertions and deletions to some extent producing
better results. BLAST offers several search
types DNA to DNA, protein to protein, and
translation searches.

28
MSP max segment pair

Similarity Score for a segment is the Sum of
Scores for individual pairs of residues.
MSP is highest scoring segment
MSP is locally max if score cannot be increased
by extending or shortening both segments
Search for all locally maximal MSPs above some
cutoff/threshold.
Word pair of fixed length w
Look for scores above threshold T
Scan target db sequence for word of length w with
score geT - Find word hits
Extend word hits to see if it is part of a longer
segment with the segment pair with score at least
S where S is highest MSP score at which chance
similarities are likely to appear

29
Algorithm

Compile a list of high scoring words of length w
(w4 for proteins and 12 for nt) for query
sequence
Scan for word hits of score greater than
threshold T against db target sequence
Extend word hit in both directions to find High
Scoring Pairs with scores greater than S

30
Step 1 compile list of high scoring words

Generate All words of w length w-mers in query
sequence
Look for w-mers with score at least T when
matched with target sequence
Speed of program is dependent on list generated
in a time proportional to the length of the list
unlike SW or NW which is proportional to the
product of the lengths.

31
Step 2. Scanning Phase

Scan DB
Search long sequences for occurrences of words in
the list
Use deterministic finite automaton or finite
state machine at approx 0.5M residues a sec

32
Step 3. Extending Hits

Extend w-mers
Stop when score falls a certain distance below
highest so far.ie the highest score for shorter
or no extensions.
http//acer.gen.tcd.ie/amclysag/blast.html for
DNA

33
Other more advanced programs

See http//www.paracel.com/faq/faq_algorithm_prime
r.html
Gapped Blast (BLAST 2.0) extend words from no-gap
to gapped, can generate gapped alignments
PSI Blast Position specific iterated BLASTie.
Use gapped BLAST, generate a profile from
multiple iterations. This profile is used instead
of the input and distance matrix

34
Limitations to BLAST for sequence database
searching

Needs islands of strong similarity
Limits on scoring and penalty values
Blastx, tblastn and tblastx use 6 frame
translation but miss sequences with frameshifts
Finds and reports only local alignments

35
BLAST rules of thumb

For short aas 20-40, 50 identity happens by
chance
If A homologus to B, B to C then A and C are
homologous
Similarity can be present even if there is
absence of homology e.g. low complexity
transmembrane and coiled coil regions.

36
BLAST significance

S (lambdaS-lnK)/ln2
Lambda and K are associated with the scoring
system
S with a given E, is sigificant if it is greater
than N/E where N is the size of the search space.

37
Significance

A common and simple test to determine if the
alignment of two sequences is statistically
significant is to carry out a simple permutation
test. This consists of
Randomly rearrange the order of one or both
sequences
Align the permuted sequences
Record the score for this alignment
Repeat steps 1-3 a large number of times.

http//helix.biology.mcmaster.ca/721/outline2/node
40.htmlSECTION00630000000000000000
For protein sequences Doolittle's rule of thumb
is that greater than 25 identity will suggest
homology, less than 15 is doubtful and for those
cases between 15-25 identity a strong
statistical argument is required. Personally, I
would prefer the statistical test in all cases,
since they are easy to do and things such as
internal repeats and unusual amino acid
compositions can sometimes confuse the picture.

Gumbel Extreme Value Distribution
Make many random protein sequences of varying
lengths
Locally align two sequences of a given length
with the same scoring matrix and gap penalty
Repeat for many pairs of sequences of approx same
length
Plot distribution of scores
See variation from normal distribution curve
skew the Gumbel extreme value distribution

Probability that a score S between two unrelated
sequence is equal to or greater than a value x,
P(Sgtx) is given by P 1- exp (-Kmn e -lambdax)
Where m and n are sequence lengths, and lambda is
scaling factor for the scoring matrix used,
K is a constant that depends on the scoring
matrix and the gap penalty combination used.
For significant matches, we are looking for v.
low value of P, less than 0.01 to 0.05.
Calculate another value E, the expect value for
the alignment score that depends on how we found
P. E is used by BLAST.
For BLOSUM62, gap 11/-1, lambda 0.3 and K0.1
approx

41
Sequence scrambling method shuffling

Align two sequences and obtain optimal score S
Scramble one of the sequences thousands of times
(N) and align it with the first sequence and
obtain a distribution of scores, scrambled
sequences should preserve same composition
Fit the scores into an extreme value distribution
and calculate lambda and K
Calculate P for probability that one of the
scrambled pairs will exceed optimal score S
Calculate E(expect value), which is usually P x
the number of sequence pairs compared. Eg. If P
10-6, and N 10,000 then E10-2
E is the number we want to be lt0.01 to 0.05

42
Tips on how to assess statistical significance

http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/Bl
ast_output.html

Extreme-Value Distribution Probability density
function for distribution of local alignment
score.
Probability density function
Several illustrative visual examples may be found
in
Karlin, Samuel and Altschul, Stephen,
Applications and statistics for multiple
high-scoring segments in molecular sequences
Proc. Natl. Acad. Sci. USA Vol. 90, pp.
5873-5877, June 1993.

Take an example (ChenXin)
Suppose the genome of B.pseudomallei has some
base pairs that could code for 10,000,000 amino
acids. Now, we want to know if this genome could
code for a very short sequence "MF". What would
be the possibility of finding a sequence "MF" in
the sequence coded by the B.Pse... genome?
Math have it 1 - ( (1- (1/20)(1/20) )
9,999,999 )
Since we have 20 amino acids, the possibility of
getting a sequence of "MF" is (1/20)(1/20). The
possibility of the sequence is not "MF" is
1-(1/20)(1/20). In the genome, there are
9,999,999 possible starting point of a length-two
sequence. So, the possibility that they are all
not "MF" is (1- (1/20)(1/20) ) 9,999,999. So
the possibility of finding a "MF" is 1 - ( (1-
(1/20)(1/20) ) 9,999,999 ). If you know well
about math, you will know this value is nearly
1(100) which means you are very, very likely to
find the sequence.
The sequence "MF" is arbitrarily dictated by me
which have no biological significance. That means
the existence of "MF" is because of chance (by
distribution), not because of the evolutionary
pressure. The sequence we are interested are
those developed and fixed through the
evolutionary process by the force of natural
selection (survival of the fittest). So, we wish
to calculate how possible the match is because of
chance (by distribution).
This is the E value. So the smaller the E value
is (The less possible the match can occur by
distribution), the more biological significance
it have (survived through selection).

45
Multiple Sequence Alignments

NW algorithm can be extended from pairwise to gt 2
sequences. Matrix become n dimensional.
O(MN) to O(Nm)
just 100 nucleotides from 5 species, this is
100e5 10,000,000,000 operations
Eg. Like alignment of 2 seqs 100,000 nt long
Most popular is Higgins and Sharp (1989) Clustal

46
4 Steps in CLustal

All pairwise similarity scores calculated using
Rapid alignment methods
Create a similarity matrix and to cluster the
sequences based on this similarity using a
cluster algorithm
Create an alignment of clusters via a consensus
method.
Create a progressive multiple alignment by
sequentially aligning groups of sequences
according to their branching order in the
clustering.
ClustalW a local alignment algorithm
ClustalV the older global alignment method

47
(No Transcript)
48

MSA ( Gupta, Kececioglu Schaffer, 1995)
MACAW ( Schuler, Altschul Lipman, 1991).
a multiple dimensional dot plot and then look for
dots that are common to each group ( Vingron
Argos 1991).)

Need to construct alignment and phylogeny at the
same time.
Eg TREEALIGN
When all is said and done, people will still find
that the alignments produced by the programs can
be improved by a judicious and critical
examination by eye. Spending time to slowly and
carefully re-examine your alignments by hand is
recommended
http//helix.biology.mcmaster.ca/721/outline2/node
37.html

50
FastDNAML Phylogenetic analysis Construction of
a Phylogenetic tree http//www.bic.nus.edu.sg/ bi
ogrid/phylo/cascon.html
51
Major Milestones in bioinformatics