Title: Bioinformatics 1 lecture 6
1Bioinformatics 1 -- lecture 6
Affine gap penalty Substitution
Matrices PAM BLOSUM Matrix bias in local
alignment
2Reminder scoring an alignment
The score of the alignment is the sum of the
scores of each column (match, deletion or
insertion) in the alignment. Match look up
match score from substitution matrix New gap
use gap initiation penalty Additional gap use
gap extension penalty End gap Optional, may be
zero.
A T S F M
A G L S T F M
3Affine gap exercise
- Which alignment scores the highest?
- Given
- No end gap penalty.
- BLOSUM score.
- Affine gap penalty. (-2, -1)
A T S F M
A G L S T F M
A T S F M
A G L S T F M
A T S F M
A G L S T F M
4Worksheet for affine gap local dynamic programming
Fill in the scores are the traceback letters
using the BLOSUM62 matrix and Gap opening -2,
gap extension -1. Start at 0. End at maximum.
No end gaps (meaning no starting gaps).
D
I
M
L T V K P
L T V K P
Q S I M P R P
Q S I M P R P
5Rules for affine gap penalty DP
Do not penalize end-gaps. I can follow D, and
vice versa, but it is a gap opening. For each
box, write the score and the traceback letter
(M,I,or D)
Mi-1,j-1match score
Mi,j MAX
Ii-1,j-1match score
Di-1,j-1match score
Mi,j-1 - 2
Ii,j MAX
Ii,j-1 - 1
Di,j-1 - 2
Mi-1,j - 2
Di,j MAX
Ii-1,j - 2
Di-1,j - 1
6BLOSUM matrix for match scores
Two 20x20 substitution matrices are used BLOSUM
PAM.
A C D E F G H I K L M N P Q R S T V W Y
4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1
0 0 -3 -2 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3
-3 -3 -1 -1 -1 -2 -2 6 2 -3 -1 -1 -3 -1
-4 -3 1 -1 0 -2 0 -1 -3 -4 -3 5 -3
-2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2
6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1
1 3 6 -2 -4 -2 -4 -3 0 -2 -2 -2
0 -2 -3 -2 -3 8 -3 -1 -3 -2
1 -2 0 0 -1 -2 -3 -2 2 4
-3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1
5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2
4 2 -3 -3 -2 -2 -2 -1 1
-2 -1 5 -2 -2 0
-1 -1 -1 1 -1 -1
6 -2 0 0 1 0 -3 -4 -2
7 -1 -2 -1 -1 -2 -4 -3
5 1 0 -1 -2 -2 -1
5 -1 -1
-3 -3 -2
4 1 -2 -3 -2
5 0 -2 -2
4 -3 -1
11
2
7
Each number is the score for aligning a single
pair of amino acids.
ACDEFGH IKLMNPQRSTVWY
What is the score for this alignment? ACEPGAA
ASDDGTV
BLOSUM62
7Substitution matrices
Read Mount pp94-113
- Used to score aligned positions, usually of amino
acids. - Expressed as the log-likelihood ratio of mutation
(or log-odds ratio) - Derived from multiple sequence alignmentsTwo
commonly used matrices PAM and BLOSUM - PAM percent accepted mutations (Dayhoff)
- BLOSUM Blocks substitution matrix (Henikoff)
8PAM
M Dayhoff, 1978
- Evolutionary time is measured in Percent Accepted
Mutations, or PAMs - One PAM of evolution means 1 of the
residues/bases have changed, averaged over all 20
amino acids. - To get the relative frequency of each type of
mutation, we count the times it was observed in a
database of multiple sequence alignments. - Based on global alignments
- Assumes a Markov model for evolution.
9BLOSUM
Henikoff Henikoff, 1992
- Based on database of ungapped local alignments
(BLOCKS) - Alignments have lower similarity than PAM
alignments. - BLOSUM number indicates the percent identity
level of sequences in the alignment. For example,
for BLOSUM62 sequences with approximately 62
identity were counted. - Some BLOCKS represent functional units, providing
validation of the alignment.
10Multiple Sequence Alignment
A multiple sequence alignment is made using many
pairwise sequence alignments
11Columns in a MSA have a common evolutionary
history
By aligning the sequences, we assert that the
aligned residues in each column had a common
ancestor.
12How do you count the mutations?
Assume any of the sequences could be the
ancestral one.
G
GGWWNGG
L K F R L S K K P
L K F R L S K K P
L K F R L T K K P
L K F R L S K K P
G W W N G G
L K F R L S R K P
If the first sequence was the ancestor, then it
mutated to a W twice, to N once, and conserved G
three times.
L K F R L T R K P
L K F R L K K P
13Or, we could have picked...
W
GGWWNGG
L K F R L S K K P
L K F R L S K K P
L K F R L T K K P
L K F R L S K K P
G W W N G G
L K F R L S R K P
W was the ancestor, then it mutated to a G four
times, to N once, and was conserved once.
L K F R L T R K P
L K F R L K K P
14Subsitution matrices are symmetrical
Since we don't know which sequence came first, we
don't know whether
w
G
or
w
G
...is correct. So we count this as one mutation
of each type. G--gtW and W--gtG. In the end the
20x20 matrix will have the same number for
elements (i,j) and (j,i). (That's why we only
show the upper triangle)
15Summing the substitution counts
We assume the ancester is one of the observed
amino acids, but we don't know which, so we try
them all.
G
W
N
GGWWNGG
G
3
1
2
N
symmetrical matrix
one column of a MSA
W
16Next possible ancester...
We already counted this residue against all
others, so be blank it out.
G
W
N
GGWWNGG
G
2
1
2
N
W
17Next...
G
W
N
GGWWNGG
G
2
N
1
W
1
18Next...
G
W
N
GGWWNGG
G
2
N
1
W
0
19Next...
G
W
N
GGWWNGG
G
2
N
0
0
W
20Next...
G
W
N
GGWWNGG
G
1
0
0
N
W
21Last...
G
W
N
GGWWNGG
G
0
0
0
N
(no counts for last seq.)
W
22Summing the substitution counts
G
W
N
GGWWNGG
G
6
4
8
N
0
2
TOTAL21
Now we do this for every column in every multiple
sequence alignment...
W
1
23log odds
Substitutions (and many other things in
bioinformatics) are expressed as a "likelihood
ratio", or "odds ratio" of the observed data over
the expected value. Likelihood and odds are
synomyms for Probability. So Log Odds is the log
(usually base 2) of the odds ratio.
log odds ratio log2(observed/expected )
24Getting log-odds from counts
P(G) 4/7 0.57
Observed probability of G-gtG qGG P(G-gtG)6/21
0.29
Expected probability of G-gtG, eGG 0.570.57
0.33
If the lod is lt 0., then the mutation is less
likely than expected by chance. If it is gt 0., it
is more likely.
odds ratio qGG/eGG 0.29/0.33
log odds ratio log2(qGG/eGG )
25Different observations, same expectation
P(G)0.50 eGG 0.25qGG 21/42 0.5 lod
log2(0.50/0.25) 1
P(G)0.50 eGG 0.25qGG 9/42 0.21 lod
log2(0.21/0.25) 0.2
G GG AW GW AN GG AG A
G WG AG WG AG WG AG A
Gs spread over many columns
Gs concentrated
26Different observations, same expectation
P(G)0.50, P(W)0.14 eGW 0.07qGG 3/42
0.07 lod log2(0.07/0.07) 0
P(G)0.50, P(W)0.14 eGW 0.07qGW 7/42
0.17 lod log2(0.17/0.07) 1.3
G GG AW GA WN GG AG A
G WG AG WG AG WG AA G
G and W seen together more often than expected.
Gs and Ws not seen together.
27Get the substitution value for P-gtQ
In class exercise
...given a very small database.
PQPPQQQPQQPPQPPPQQQP
P(P)_____, P(Q)_____ ePQ _____qPQ ___/___
_____ lod log2(ePQ/qPQ) ____
28Markovian evolution and PAM
A Markov process is one where te likelihood of
the next "state" depends only on the current
state. Markovian evolution assumes that base
changes (or amino acid changes) occur at a
constant rate and depend only on the identity of
the current base (or amino acid).
.9946
.0001
.9932
.0002
.0021
one position in a protein
G
G
A
V
V
G
millions of years
29Markovian evolution is an extrapolation
Start with all G's. Wait 1 million years. Where
do they go? Using PAM1, we expect them to mutate
to about 0.0002 A, 0.0007 P, 0.9946 G, etc Wait
another million years. The new A's mutate
according to PAM1 for A's, P's mutate according
to PAM1 for P's, etc. Wait another million, etc ,
etc etc. What is the final distribution of amino
acids at the positions that were once G's?
PAM1
PAM1
30Matrix multiplication
PAM1
To start we have 100G, 0 everything else
After 1MY we have each amino acid according to
the PAM probabilities.
0.00010.00010.000150.000050.999430.000020.00
0050.000010.00020.000150.000020.000030.0006
0.00060.00002
P(G-gtA)P(G-gtC)P(G-gtD)P(G-gtE)P(G-gtG)P(G-gtF)P(
G-gtH)P(G-gtI) P(G-gtK) P(G-gtL) P(G-gtM) P(G-gtN)P(G-
gtP) P(G-gtQ) P(G-gtR) P(G-gtS) P(G-gtT)
000010000 0 0 0000 00000
PAM1
x
31Matrix multiplication
PAM1
PAM1
After 2MY each amino acid has mutated again
according to the PAM1 probabilities.
0.00010.00010.000150.000050.999430.000020.00
0050.000010.00020.000150.000020.000030.0006
0.00060.00002
PAM1
x
etc.
32250 PAMs
PAM1
PAM1
PAM1
250
PAM1
33Differences between PAM and BLOSUM
- PAM
- PAM matrices are based on global alignments of
closely related proteins. - The PAM1 is the matrix calculated from
comparisons of sequences with no more than 1
divergence. - Other PAM matrices are extrapolated from PAM1
using an assumed Markov chain. - BLOSUM
- BLOSUM matrices are based on local alignments.
- BLOSUM 62 is a matrix calculated from comparisons
of sequences with approx 62 identity. - All BLOSUM matrices are based on observed
alignments they are not extrapolated from
comparisons of closely related proteins. - BLOSUM 62 is the default matrix in BLAST (the
database search program). It is tailored for
comparisons of moderately distant proteins.
Alignment of distant relatives may be more
accurate with a different matrix.
34Increasing sophistication in match scoring
1. Identity score. 2. Genetic code changes
(mutations on one base more likely than 2,3).
(1966) 3. Matrices based on chemical similarity
of amino acids. (1985) 4. Matrices based on
multiple sequence alignments (PAM (1978), BLOSUM
(1994)) 5. Dipeptide substitution matrices (ie.
AG --gt DG, etc) (1994) 6. Class specific
substitution matrices (D. Jones' transmembrane
protein matrix) (1994) 7. Structure-based
substitution matrices (2000) 8.
Position-specific, structure-based substitution
matrices (2006)
35PAM250
36BLOSUM62
37Which substitution matrix favors...
PAM250 BLOSUM62
conservation of polar residues conservation of
non-polar residues conservation of C, Y, or
W polar-to-nonpolar mutations polar-to-polar
mutations
38Local alignment, revisited
- Starts at zero (the score of a non-alignment)
- Ends at the maximum score anywhere in the matrix.
- Advantages
- Does not care if the aligned region has long
"tails". - Can align pieces of one sequence to pieces of
another. - Multi-domain sequences are OK
Global
Local
39Local alignment, revisited
- Disadvantages
- Fails on multidomain alignments if large gaps
are present. - Success depends on an additional parameter
Matrix Bias
Global
Local
40Local Alignment with matrix bias
Matrix bias a constant added to the substitution
score. Has the same effect as starting the
alignment at a number other than zero.
A(i-1,j-1) match score bias
A(i,j-1) gap
A(i,j) MAX
A(i-1,j) gap
0 match score
linear gap penalty.
41Effect of matrix bias
Higher matrix bias favors matches over gaps.
RESULT more matches, longer local alignments.
more matrix bias
42How do we know when the alignment is correct?
Compare to known structure-based alignments
2DRCA 1/2 MISLIAALAVDRVIGMENAM-PFNLPADL
AWFKRNTL-------DKPVIMGRHTWESIG- 1DRF_ 3/4
SLNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQNLV
IMGKKTWFSIPE 2DRCA 52/53
--RPLPGRKNIILSSQP--GTDDRVTWVKSVDEAIAACG------DVPEI
MVIGGGRVYE 1DRF_ 63/64 KNRPLKGRINLVLSRELKE
PPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYK
2DRCA 102/103 QFLPK--AQKLYLTHIDAEVEGDTHFPDYEPD
DWESVF------SEFHDADAQNSHSYCF 1DRF_ 123/124
EAMNHPGHLKLFVTRIMQDFESDTFFPEIDLEKYKLLPEYPGVLSDVQEE
---KGIKYKF 2DRCA 154/155 EILERR 1DRF_
180/181 EVYEKN
43Aligning fibronectin
Fibronectin is a long multidomain protein
involved in adhesion/migration of cells, blood
clotting, signaling, and interactions with the
extracellular matrix (ECM). Interacts with
collagen, fibrin, heparin and integins. It is
made up of many copies of at least 3 "modules".
Small differences within modules cause important
biological effects. How do you align fibronectins?
44Multiple local alignments?
One way would be to select the maximum score,
then the next highest score, and so on to get all
of the possible alignments.
II
II
II
II
III
III
II
III
Fragment-based alignment methods find all local
alignments. (BLAST, FASTA)
45You have seen....
Dynamic programming Global alignment
Global/local alignment (no end gaps. 3 ways to
do it.) Local alignment Linear gap
penalty Affine gap penalty
How many ways are there to do DP?
46In class exercise local alignment using BestFit
Start SeqLab Go to Editor mode, with no
sequences. Download two protein sequences from
the PIR database(File/Add sequences/Databases)PI
R2A46444PIR2S58653 Select both. Run BestFit
(Functions/Pairwise comparison/BestFit.) Go to
Options set gap creation (range 1-20), gap
extension (range 0-10) . Run. Look at the
results. (Compare to what you see on the screen.)