Title: Jacques'van'Heldenulb'ac'be
1Sequence analysisPart 1. Dot plots and
substitution matrices
- Introduction to Bioinformatics
2Contents
- Pairwise sequence alignment
- Dot plots (dottup, dotmatcher)
- Substitution matrices
- Gapless alignment
- Alignment with gaps
- Global alignment (Needleman-Wunsch)
- Local alignment (Smith-Waterman)
- Matching a sequence against a database (Fasta,
BLAST) - Multiple sequence alignment (ClustalX)
- Matching motifs against sequences
3Dot plots
4Dot plot
- A dot plot is a simple graphical representation
of identical residues between two sequences. - The X axis represents the first sequence (PHO5),
- The Y axis represents the second sequence (PHO3)
- A dot is plotted for each match between two
residues of the sequences. - Diagonal lines reveal regions of identity between
the two sequences.
Example protein sequences of the Pho5p and Pho3p
phosphatases in the yeast Saccharomyces
cerevisiae
5Dot plot with word matches
- With nucleic sequences, each residue is expected
every 4 positions on average. A letter-based dot
plot is thus very confusing. - The dot plot can be adapted to display only word
matches, which correspond to a diagonal of dots
in the letter-based dot plot. - Example alignment of PHO5 and PHO3 coding
sequences, with different word sizes.
6Detecting repeats with a dot plot
- Sequence repeats are easily detected in a dot
plot when a sequence is compared to itself. - The main diagonal is completely marked (by
definition, since the sequence is identical do
itself) - Repeats appear as segments of lines parallel to
the diagonal.
7Gray scale dottups
- Word matches require a perfect match over the
whole word length. - A more refined way to show partial similarities
is to use windows. - For each point, the window score is the sum of
matches of its neighbours on the diagonal. - The gray level reflects the window similarity
score.
Source ?
8Alignment matrices and substitution matrices
9Alignment matrix
- An alignment matrix is conceptually related to a
dot plot - One sequence is pasted horizontally, the other
vertically - A score is assigned to each match
- In the case of dot plots, the scoring scheme was
very simple - match 1
- mismatch 0
10Substitution matrix
- One could decide to give a lower cost to A-T
substitutions, if we assume that these are more
likely to occur in our sequences - Example
- match 2
- A-T mismatch -1
- other mismatch -2
- The scoring scheme can be represented as a
substitution matrix
11Scoring an alignment matrix with a substitution
matrix
- Let us come back to our previous alignment matrix
- For each cell of the alignment matrix, we compare
the residue in sequences A and B, and take the
score for this pair of residues in the
substitution matrix.
12Substitution counts in 71 groups of aligned
proteins (Dayhoff, 1978)
13Substitution matrices for proteins
- Margaret Dayhoff (1978) measured the rate of
substitutions between each pair of amino acids,
in a collection of aligned proteins. - Scores are calculated as log-odds
- Positive values reflect frequent ("accepted")
substitutions, i.e. substitutions that occur more
frequently than expected by chance. - Negative values reflect rare ("unfavourable")
mutations, i.e. substitutions that occur less
frequently than expected by chance - The diagonal reflect residue conservation
Reference Dayhoff et al. (1978). A model of
evolutionary change in proteins. In Atlas of
Protein Sequence and Structure, vol. 5, suppl. 3,
345352. National Biomedical Research Foundation,
Silver Spring, MD, 1978.
14PAM scoring matrices
- The alignments used by Dayhoff had 85 identity
- However, frequencies of substitutions are
expected to depend on the rate of divergence
between sequences the number of substitutions
increases with time. - In order to take into account the divergence
rate, Margaret Dayhoff calculated a series of
scoring matrices, each reflecting a certain level
of divergence - PAM001 rates of substitutions between amino-acid
pairs expected for proteins with an average of 1
substitution per position - PAM050 rates of substitutions between amino-acid
pairs expected for proteins with an average of
50 substitution per position - PAM250 250 mutations/position (note a position
could mutate several times) - The substitution matrix must this be chosen
according to the relatedness of the sequences to
be aligned
Reference Dayhoff et al. (1978). A model of
evolutionary change in proteins. In Atlas of
Protein Sequence and Structure, vol. 5, suppl. 3,
345352. National Biomedical Research Foundation,
Silver Spring, MD, 1978.
15Extrapolation of the PAM series from PAM001
Mi,3P(XArg)
M17,jP(ThrX)
Ala
0.0009
0.0022
Arg
0.0001
0.0002
Asn
Asn
0.9822
0.0013
Asp
0.0042
0.0004
Cys
0.0000
0.0001
Gln
0.0004
0.0003
...
...
...
Thr
0.0013
0.9871
Trp
Thr
0.0000
0.0000
Tyr
0.0003
0.0002
Val
0.0001
0.0009
P(Asn -gt Thr) P(Asn -gt Ala -gt Thr) P(Asn -gt
Arg -gt Thr) ... P(Asn -gt Val -gt Thr)
(0.0009)(0.0001) (0.0001)(0.0002) ...
(0.0001)(0.009)
16PAM250 matrix
17Hinton diagram of the PAM250 matrix
- Yellow boxes indicate positive values (accepted
mutations) - Red boxes indicate negative values (avoided
mutations). - The area of each box is proportional to the
absolute value of the log-odds score.
18BLOSUM scoring matrices
- Henikoff and Henikoff (1992) analyzed
substitution rates on the basis of aligned
regions (blocks) - They calculated scoring matrices from blocks with
different percentages of protein divergence - Example
- BLOSUM62 calculated from blocks with 62
similarity - BLOSUM80 calculated from blocks with 80
similarity - When these substitution matrices are used to
score sequence alignments, one should always
choose the matrix appropriate to the expected
percentage of similarity.
Reference Henikoff, S. and Henikoff, J.G.
(1992). Amino acid substitution matrices from
protein blocks. PNAS 8910915-10919.
19BLOSSUM62
20BLOSSUM62substitutions between acidic residues
21BLOSSUM62substitutions between basic residues
22BLOSSUM62substitutions between aromatic residues
23BLOSSUM62substitutions between polar residues
24BLOSSUM62substitutions between hydrophobic
residues
25Substitution matrices - summary
- Different substitution scoring matrices have been
established - PAM (Dayhoff, 1979)
- BLOSSUM (Henikoff Henikoff, 1992)
- Residue categories (Phylip)
- Substitution matrices allow to detect
similarities between more distant proteins than
what would be detected with the simple identity
of residues. - The matrix must be chosen carefully, depending on
the expected rate of conservation between the
sequences to be aligned. - Beware
- With PAM matrices
- the score indicates the percentage of
substitution per position-gt higher numbers are
appropriate for more distant proteins - With BLOSUM matrices
- the score indicates the percentage of
conservation-gt higher numbers are appropriate
for more conserved proteins
26Scoring an alignment with a substitution matrix
- The substitution matrix can be used to assign a
score to a pair-wise alignment. - The score of the alignment is the sum, over all
the aligned positions (i from 1 to L), of the
scores of the pairs of residues (r1,I and r2,I). - Gaps are treated by subtracting a penalty, with
two parameters - Gap opening (go) penalty
- Typical values between -10 and -15
- Gap extension (ge) penalty
- Typical values between -0.5 and -2
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 R L A S V E T D M
P - - - - - L T L R Q H . .
. . go ge ge ge ge . . .
. T L T S L Q T T L K N L K
E M A H L G T H S -1 4 0 4 1 2 5 -1
2 -1 -10 -1 -1 -1 -1 -1 -2 4 -2 -1 8 7
27Substitution matrices in dot plots
- Substitution matrices can be used in dot plots
- The user has to specify the following parameters
- Window size (word size)
- Score threshold
- Substitution matrix
- At each position of the plot
- A pair of words of size w are extracted from the
first and second sequences. - The score of the word pair is calculated by
summing the scores of the corresponding pairs of
residues. - If the pair of words passes the threshold, a
black diagonal is displayed at the corresponding
position of the dot plot. - The regions of similarity between the two
sequences are appear as black diagonals on the
dot plot.
28Aspartokinases dot plot with simple word matches
- Let us compare the peptidic sequences of two
enzymes from the bacteria Escherichia coli K12. - LysC aspartokinase involved in lysine
biosynthesis - MetL bifunctional enzyme which combines two
domains - aspartokinase catalyses the first step of
methionine biosynthesis - homoserine dehydrogenase catalyze the third step
of methionine biosynthesis - On the dot plot, the region of similarity between
the two aspartokinase domains is barely visible..
29Aspartokinases dot plot with substitution matrix
(BLOSUM62)
- With dotmatcher, a substitution matrix is used to
score the similarity between each pair of
residues. - This reveals the similarity between the
aspartokinase domains of MetL and LysC. - Note that this similarity only covers the N
terminal half of MetL, because it is a
bi-functional enzyme the C-terminal domain is a
homoserine dehydrogenase. - It is quite tricky to find the appropriate
parameters, because these can vary from case ot
case.
30Software tools for displaying dot plot
- DNA Strider
- Marck C (1988) 'DNA Strider' a 'C' program for
the fast analysis of DNA and protein sequences on
the Apple Macintosh family of computers. Nucleic
Acids Res. 161829-36. - http//www...
- Dotlet
- Junier T, Pagni M (2000) Dotlet diagonal plots
in a web browser. Bioinformatics. 16178-9. - http//myhits.isb-sib.ch/cgi-bin/dotlet
- Dotter
- Sonnhammer EL, Durbin R (1995) A dot-matrix
program with dynamic threshold control suited for
genomic DNA and protein sequence analysis. Gene.
167GC1-10. - http//www.cgb.ki.se/cgb/groups/sonnhammer/Dotter.
html
31Bibliography
- Substitution matrices
- PAM series
- Dayhoff, M. O., Schwartz, R. M. Orcutt, B.
(1978). A model of evolutionary change in
proteins. Atlas of Protein Sequence and Structure
5, 345--352. - BLOSUM substitution matrices
- Henikoff, S. Henikoff, J. G. (1992). Amino acid
substitution matrices from protein blocks. Proc
Natl Acad Sci U S A 89, 10915-9. - Gonnet matrices, built by an iterative procedure
- Gonnet, G. H., Cohen, M. A. Benner, S. A.
(1992). Exhaustive matching of the entire protein
sequence database. Science 256, 1443-5. 1.