Jacques'van'Heldenulb'ac'be - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Jacques'van'Heldenulb'ac'be

Description:

Matching a sequence against a database (Fasta, BLAST) Multiple sequence alignment (ClustalX) ... Word matches require a perfect match over the whole word length. ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 32

Provided by: jacquesv8

Category:

more less

Transcript and Presenter's Notes

Title: Jacques'van'Heldenulb'ac'be

1
Sequence analysisPart 1. Dot plots and
substitution matrices

Introduction to Bioinformatics

2
Contents

Pairwise sequence alignment
Dot plots (dottup, dotmatcher)
Substitution matrices
Gapless alignment
Alignment with gaps
Global alignment (Needleman-Wunsch)
Local alignment (Smith-Waterman)
Matching a sequence against a database (Fasta,
BLAST)
Multiple sequence alignment (ClustalX)
Matching motifs against sequences

3
Dot plots

Bioinformatics

4
Dot plot

A dot plot is a simple graphical representation
of identical residues between two sequences.
The X axis represents the first sequence (PHO5),
The Y axis represents the second sequence (PHO3)
A dot is plotted for each match between two
residues of the sequences.
Diagonal lines reveal regions of identity between
the two sequences.

Example protein sequences of the Pho5p and Pho3p
phosphatases in the yeast Saccharomyces
cerevisiae
5
Dot plot with word matches

With nucleic sequences, each residue is expected
every 4 positions on average. A letter-based dot
plot is thus very confusing.
The dot plot can be adapted to display only word
matches, which correspond to a diagonal of dots
in the letter-based dot plot.
Example alignment of PHO5 and PHO3 coding
sequences, with different word sizes.

6
Detecting repeats with a dot plot

Sequence repeats are easily detected in a dot
plot when a sequence is compared to itself.
The main diagonal is completely marked (by
definition, since the sequence is identical do
itself)
Repeats appear as segments of lines parallel to
the diagonal.

7
Gray scale dottups

Word matches require a perfect match over the
whole word length.
A more refined way to show partial similarities
is to use windows.
For each point, the window score is the sum of
matches of its neighbours on the diagonal.
The gray level reflects the window similarity
score.

Source ?
8
Alignment matrices and substitution matrices

Bioinformatics

9
Alignment matrix

An alignment matrix is conceptually related to a
dot plot
One sequence is pasted horizontally, the other
vertically
A score is assigned to each match
In the case of dot plots, the scoring scheme was
very simple
match 1
mismatch 0

10
Substitution matrix

One could decide to give a lower cost to A-T
substitutions, if we assume that these are more
likely to occur in our sequences
Example
match 2
A-T mismatch -1
other mismatch -2
The scoring scheme can be represented as a
substitution matrix

11
Scoring an alignment matrix with a substitution
matrix

Let us come back to our previous alignment matrix
For each cell of the alignment matrix, we compare
the residue in sequences A and B, and take the
score for this pair of residues in the
substitution matrix.

12
Substitution counts in 71 groups of aligned
proteins (Dayhoff, 1978)
13
Substitution matrices for proteins

Margaret Dayhoff (1978) measured the rate of
substitutions between each pair of amino acids,
in a collection of aligned proteins.
Scores are calculated as log-odds
Positive values reflect frequent ("accepted")
substitutions, i.e. substitutions that occur more
frequently than expected by chance.
Negative values reflect rare ("unfavourable")
mutations, i.e. substitutions that occur less
frequently than expected by chance
The diagonal reflect residue conservation

The alignments used by Dayhoff had 85 identity
However, frequencies of substitutions are
expected to depend on the rate of divergence
between sequences the number of substitutions
increases with time.
In order to take into account the divergence
rate, Margaret Dayhoff calculated a series of
scoring matrices, each reflecting a certain level
of divergence
PAM001 rates of substitutions between amino-acid
pairs expected for proteins with an average of 1
substitution per position
PAM050 rates of substitutions between amino-acid
pairs expected for proteins with an average of
50 substitution per position
PAM250 250 mutations/position (note a position
could mutate several times)
The substitution matrix must this be chosen
according to the relatedness of the sequences to
be aligned

Reference Dayhoff et al. (1978). A model of
evolutionary change in proteins. In Atlas of
Protein Sequence and Structure, vol. 5, suppl. 3,
345352. National Biomedical Research Foundation,
Silver Spring, MD, 1978.
15
Extrapolation of the PAM series from PAM001
Mi,3P(XArg)
M17,jP(ThrX)
Ala
0.0009
0.0022
Arg
0.0001
0.0002
Asn
Asn
0.9822
0.0013
Asp
0.0042
0.0004
Cys
0.0000
0.0001
Gln
0.0004
0.0003
...
...
...
Thr
0.0013
0.9871
Trp
Thr
0.0000
0.0000
Tyr
0.0003
0.0002
Val
0.0001
0.0009
P(Asn -gt Thr) P(Asn -gt Ala -gt Thr) P(Asn -gt
Arg -gt Thr) ... P(Asn -gt Val -gt Thr)
(0.0009)(0.0001) (0.0001)(0.0002) ...
(0.0001)(0.009)
16
PAM250 matrix
17
Hinton diagram of the PAM250 matrix

Yellow boxes indicate positive values (accepted
mutations)
Red boxes indicate negative values (avoided
mutations).
The area of each box is proportional to the
absolute value of the log-odds score.

18
BLOSUM scoring matrices

Henikoff and Henikoff (1992) analyzed
substitution rates on the basis of aligned
regions (blocks)
They calculated scoring matrices from blocks with
different percentages of protein divergence
Example
BLOSUM62 calculated from blocks with 62
similarity
BLOSUM80 calculated from blocks with 80
similarity
When these substitution matrices are used to
score sequence alignments, one should always
choose the matrix appropriate to the expected
percentage of similarity.

Reference Henikoff, S. and Henikoff, J.G.
(1992). Amino acid substitution matrices from
protein blocks. PNAS 8910915-10919.
19
BLOSSUM62
20
BLOSSUM62substitutions between acidic residues
21
BLOSSUM62substitutions between basic residues
22
BLOSSUM62substitutions between aromatic residues
23
BLOSSUM62substitutions between polar residues
24
BLOSSUM62substitutions between hydrophobic
residues
25
Substitution matrices - summary

Different substitution scoring matrices have been
established
PAM (Dayhoff, 1979)
BLOSSUM (Henikoff Henikoff, 1992)
Residue categories (Phylip)
Substitution matrices allow to detect
similarities between more distant proteins than
what would be detected with the simple identity
of residues.
The matrix must be chosen carefully, depending on
the expected rate of conservation between the
sequences to be aligned.
Beware
With PAM matrices
the score indicates the percentage of
substitution per position-gt higher numbers are
appropriate for more distant proteins
With BLOSUM matrices
the score indicates the percentage of
conservation-gt higher numbers are appropriate
for more conserved proteins

26
Scoring an alignment with a substitution matrix

The substitution matrix can be used to assign a
score to a pair-wise alignment.
The score of the alignment is the sum, over all
the aligned positions (i from 1 to L), of the
scores of the pairs of residues (r1,I and r2,I).
Gaps are treated by subtracting a penalty, with
two parameters
Gap opening (go) penalty
Typical values between -10 and -15
Gap extension (ge) penalty
Typical values between -0.5 and -2

i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 R L A S V E T D M
P - - - - - L T L R Q H . .
. . go ge ge ge ge . . .
. T L T S L Q T T L K N L K
E M A H L G T H S -1 4 0 4 1 2 5 -1
2 -1 -10 -1 -1 -1 -1 -1 -2 4 -2 -1 8 7
27
Substitution matrices in dot plots

Substitution matrices can be used in dot plots
The user has to specify the following parameters
Window size (word size)
Score threshold
Substitution matrix
At each position of the plot
A pair of words of size w are extracted from the
first and second sequences.
The score of the word pair is calculated by
summing the scores of the corresponding pairs of
residues.
If the pair of words passes the threshold, a
black diagonal is displayed at the corresponding
position of the dot plot.
The regions of similarity between the two
sequences are appear as black diagonals on the
dot plot.

28
Aspartokinases dot plot with simple word matches

Let us compare the peptidic sequences of two
enzymes from the bacteria Escherichia coli K12.
LysC aspartokinase involved in lysine
biosynthesis
MetL bifunctional enzyme which combines two
domains
aspartokinase catalyses the first step of
methionine biosynthesis
homoserine dehydrogenase catalyze the third step
of methionine biosynthesis
On the dot plot, the region of similarity between
the two aspartokinase domains is barely visible..

29
Aspartokinases dot plot with substitution matrix
(BLOSUM62)

With dotmatcher, a substitution matrix is used to
score the similarity between each pair of
residues.
This reveals the similarity between the
aspartokinase domains of MetL and LysC.
Note that this similarity only covers the N
terminal half of MetL, because it is a
bi-functional enzyme the C-terminal domain is a
homoserine dehydrogenase.
It is quite tricky to find the appropriate
parameters, because these can vary from case ot
case.

30
Software tools for displaying dot plot

DNA Strider
Marck C (1988) 'DNA Strider' a 'C' program for
the fast analysis of DNA and protein sequences on
the Apple Macintosh family of computers. Nucleic
Acids Res. 161829-36.
http//www...
Dotlet
Junier T, Pagni M (2000) Dotlet diagonal plots
in a web browser. Bioinformatics. 16178-9.
http//myhits.isb-sib.ch/cgi-bin/dotlet
Dotter
Sonnhammer EL, Durbin R (1995) A dot-matrix
program with dynamic threshold control suited for
genomic DNA and protein sequence analysis. Gene.
167GC1-10.
http//www.cgb.ki.se/cgb/groups/sonnhammer/Dotter.
html

31
Bibliography

Substitution matrices
PAM series
Dayhoff, M. O., Schwartz, R. M. Orcutt, B.
(1978). A model of evolutionary change in
proteins. Atlas of Protein Sequence and Structure
5, 345--352.
BLOSUM substitution matrices
Henikoff, S. Henikoff, J. G. (1992). Amino acid
substitution matrices from protein blocks. Proc
Natl Acad Sci U S A 89, 10915-9.
Gonnet matrices, built by an iterative procedure
Gonnet, G. H., Cohen, M. A. Benner, S. A.
(1992). Exhaustive matching of the entire protein
sequence database. Science 256, 1443-5. 1.