Jacques'van'Heldenulb'ac'be - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Jacques'van'Heldenulb'ac'be

Description:

Matching a sequence against a database (Fasta, BLAST) Multiple sequence alignment (ClustalX) ... Word matches require a perfect match over the whole word length. ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 32
Provided by: jacquesv8
Category:

less

Transcript and Presenter's Notes

Title: Jacques'van'Heldenulb'ac'be


1
Sequence analysisPart 1. Dot plots and
substitution matrices
  • Introduction to Bioinformatics

2
Contents
  • Pairwise sequence alignment
  • Dot plots (dottup, dotmatcher)
  • Substitution matrices
  • Gapless alignment
  • Alignment with gaps
  • Global alignment (Needleman-Wunsch)
  • Local alignment (Smith-Waterman)
  • Matching a sequence against a database (Fasta,
    BLAST)
  • Multiple sequence alignment (ClustalX)
  • Matching motifs against sequences

3
Dot plots
  • Bioinformatics

4
Dot plot
  • A dot plot is a simple graphical representation
    of identical residues between two sequences.
  • The X axis represents the first sequence (PHO5),
  • The Y axis represents the second sequence (PHO3)
  • A dot is plotted for each match between two
    residues of the sequences.
  • Diagonal lines reveal regions of identity between
    the two sequences.

Example protein sequences of the Pho5p and Pho3p
phosphatases in the yeast Saccharomyces
cerevisiae
5
Dot plot with word matches
  • With nucleic sequences, each residue is expected
    every 4 positions on average. A letter-based dot
    plot is thus very confusing.
  • The dot plot can be adapted to display only word
    matches, which correspond to a diagonal of dots
    in the letter-based dot plot.
  • Example alignment of PHO5 and PHO3 coding
    sequences, with different word sizes.

6
Detecting repeats with a dot plot
  • Sequence repeats are easily detected in a dot
    plot when a sequence is compared to itself.
  • The main diagonal is completely marked (by
    definition, since the sequence is identical do
    itself)
  • Repeats appear as segments of lines parallel to
    the diagonal.

7
Gray scale dottups
  • Word matches require a perfect match over the
    whole word length.
  • A more refined way to show partial similarities
    is to use windows.
  • For each point, the window score is the sum of
    matches of its neighbours on the diagonal.
  • The gray level reflects the window similarity
    score.

Source ?
8
Alignment matrices and substitution matrices
  • Bioinformatics

9
Alignment matrix
  • An alignment matrix is conceptually related to a
    dot plot
  • One sequence is pasted horizontally, the other
    vertically
  • A score is assigned to each match
  • In the case of dot plots, the scoring scheme was
    very simple
  • match 1
  • mismatch 0

10
Substitution matrix
  • One could decide to give a lower cost to A-T
    substitutions, if we assume that these are more
    likely to occur in our sequences
  • Example
  • match 2
  • A-T mismatch -1
  • other mismatch -2
  • The scoring scheme can be represented as a
    substitution matrix

11
Scoring an alignment matrix with a substitution
matrix
  • Let us come back to our previous alignment matrix
  • For each cell of the alignment matrix, we compare
    the residue in sequences A and B, and take the
    score for this pair of residues in the
    substitution matrix.

12
Substitution counts in 71 groups of aligned
proteins (Dayhoff, 1978)
13
Substitution matrices for proteins
  • Margaret Dayhoff (1978) measured the rate of
    substitutions between each pair of amino acids,
    in a collection of aligned proteins.
  • Scores are calculated as log-odds
  • Positive values reflect frequent ("accepted")
    substitutions, i.e. substitutions that occur more
    frequently than expected by chance.
  • Negative values reflect rare ("unfavourable")
    mutations, i.e. substitutions that occur less
    frequently than expected by chance
  • The diagonal reflect residue conservation

Reference Dayhoff et al. (1978). A model of
evolutionary change in proteins. In Atlas of
Protein Sequence and Structure, vol. 5, suppl. 3,
345352. National Biomedical Research Foundation,
Silver Spring, MD, 1978.
14
PAM scoring matrices
  • The alignments used by Dayhoff had 85 identity
  • However, frequencies of substitutions are
    expected to depend on the rate of divergence
    between sequences the number of substitutions
    increases with time.
  • In order to take into account the divergence
    rate, Margaret Dayhoff calculated a series of
    scoring matrices, each reflecting a certain level
    of divergence
  • PAM001 rates of substitutions between amino-acid
    pairs expected for proteins with an average of 1
    substitution per position
  • PAM050 rates of substitutions between amino-acid
    pairs expected for proteins with an average of
    50 substitution per position
  • PAM250 250 mutations/position (note a position
    could mutate several times)
  • The substitution matrix must this be chosen
    according to the relatedness of the sequences to
    be aligned

Reference Dayhoff et al. (1978). A model of
evolutionary change in proteins. In Atlas of
Protein Sequence and Structure, vol. 5, suppl. 3,
345352. National Biomedical Research Foundation,
Silver Spring, MD, 1978.
15
Extrapolation of the PAM series from PAM001
Mi,3P(XArg)
M17,jP(ThrX)
Ala
0.0009
0.0022
Arg
0.0001
0.0002
Asn
Asn
0.9822
0.0013
Asp
0.0042
0.0004
Cys
0.0000
0.0001
Gln
0.0004
0.0003
...
...
...
Thr
0.0013
0.9871
Trp
Thr
0.0000
0.0000
Tyr
0.0003
0.0002
Val
0.0001
0.0009
P(Asn -gt Thr) P(Asn -gt Ala -gt Thr) P(Asn -gt
Arg -gt Thr) ... P(Asn -gt Val -gt Thr)
(0.0009)(0.0001) (0.0001)(0.0002) ...
(0.0001)(0.009)
16
PAM250 matrix
17
Hinton diagram of the PAM250 matrix
  • Yellow boxes indicate positive values (accepted
    mutations)
  • Red boxes indicate negative values (avoided
    mutations).
  • The area of each box is proportional to the
    absolute value of the log-odds score.

18
BLOSUM scoring matrices
  • Henikoff and Henikoff (1992) analyzed
    substitution rates on the basis of aligned
    regions (blocks)
  • They calculated scoring matrices from blocks with
    different percentages of protein divergence
  • Example
  • BLOSUM62 calculated from blocks with 62
    similarity
  • BLOSUM80 calculated from blocks with 80
    similarity
  • When these substitution matrices are used to
    score sequence alignments, one should always
    choose the matrix appropriate to the expected
    percentage of similarity.

Reference Henikoff, S. and Henikoff, J.G.
(1992). Amino acid substitution matrices from
protein blocks. PNAS 8910915-10919.
19
BLOSSUM62
20
BLOSSUM62substitutions between acidic residues
21
BLOSSUM62substitutions between basic residues
22
BLOSSUM62substitutions between aromatic residues
23
BLOSSUM62substitutions between polar residues
24
BLOSSUM62substitutions between hydrophobic
residues
25
Substitution matrices - summary
  • Different substitution scoring matrices have been
    established
  • PAM (Dayhoff, 1979)
  • BLOSSUM (Henikoff Henikoff, 1992)
  • Residue categories (Phylip)
  • Substitution matrices allow to detect
    similarities between more distant proteins than
    what would be detected with the simple identity
    of residues.
  • The matrix must be chosen carefully, depending on
    the expected rate of conservation between the
    sequences to be aligned.
  • Beware
  • With PAM matrices
  • the score indicates the percentage of
    substitution per position-gt higher numbers are
    appropriate for more distant proteins
  • With BLOSUM matrices
  • the score indicates the percentage of
    conservation-gt higher numbers are appropriate
    for more conserved proteins

26
Scoring an alignment with a substitution matrix
  • The substitution matrix can be used to assign a
    score to a pair-wise alignment.
  • The score of the alignment is the sum, over all
    the aligned positions (i from 1 to L), of the
    scores of the pairs of residues (r1,I and r2,I).
  • Gaps are treated by subtracting a penalty, with
    two parameters
  • Gap opening (go) penalty
  • Typical values between -10 and -15
  • Gap extension (ge) penalty
  • Typical values between -0.5 and -2

i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 R L A S V E T D M
P - - - - - L T L R Q H . .
. . go ge ge ge ge . . .
. T L T S L Q T T L K N L K
E M A H L G T H S -1 4 0 4 1 2 5 -1
2 -1 -10 -1 -1 -1 -1 -1 -2 4 -2 -1 8 7
27
Substitution matrices in dot plots
  • Substitution matrices can be used in dot plots
  • The user has to specify the following parameters
  • Window size (word size)
  • Score threshold
  • Substitution matrix
  • At each position of the plot
  • A pair of words of size w are extracted from the
    first and second sequences.
  • The score of the word pair is calculated by
    summing the scores of the corresponding pairs of
    residues.
  • If the pair of words passes the threshold, a
    black diagonal is displayed at the corresponding
    position of the dot plot.
  • The regions of similarity between the two
    sequences are appear as black diagonals on the
    dot plot.

28
Aspartokinases dot plot with simple word matches
  • Let us compare the peptidic sequences of two
    enzymes from the bacteria Escherichia coli K12.
  • LysC aspartokinase involved in lysine
    biosynthesis
  • MetL bifunctional enzyme which combines two
    domains
  • aspartokinase catalyses the first step of
    methionine biosynthesis
  • homoserine dehydrogenase catalyze the third step
    of methionine biosynthesis
  • On the dot plot, the region of similarity between
    the two aspartokinase domains is barely visible..

29
Aspartokinases dot plot with substitution matrix
(BLOSUM62)
  • With dotmatcher, a substitution matrix is used to
    score the similarity between each pair of
    residues.
  • This reveals the similarity between the
    aspartokinase domains of MetL and LysC.
  • Note that this similarity only covers the N
    terminal half of MetL, because it is a
    bi-functional enzyme the C-terminal domain is a
    homoserine dehydrogenase.
  • It is quite tricky to find the appropriate
    parameters, because these can vary from case ot
    case.

30
Software tools for displaying dot plot
  • DNA Strider
  • Marck C (1988) 'DNA Strider' a 'C' program for
    the fast analysis of DNA and protein sequences on
    the Apple Macintosh family of computers. Nucleic
    Acids Res. 161829-36.
  • http//www...
  • Dotlet
  • Junier T, Pagni M (2000) Dotlet diagonal plots
    in a web browser. Bioinformatics. 16178-9.
  • http//myhits.isb-sib.ch/cgi-bin/dotlet
  • Dotter
  • Sonnhammer EL, Durbin R (1995) A dot-matrix
    program with dynamic threshold control suited for
    genomic DNA and protein sequence analysis. Gene.
    167GC1-10.
  • http//www.cgb.ki.se/cgb/groups/sonnhammer/Dotter.
    html

31
Bibliography
  • Substitution matrices
  • PAM series
  • Dayhoff, M. O., Schwartz, R. M. Orcutt, B.
    (1978). A model of evolutionary change in
    proteins. Atlas of Protein Sequence and Structure
    5, 345--352.
  • BLOSUM substitution matrices
  • Henikoff, S. Henikoff, J. G. (1992). Amino acid
    substitution matrices from protein blocks. Proc
    Natl Acad Sci U S A 89, 10915-9.
  • Gonnet matrices, built by an iterative procedure
  • Gonnet, G. H., Cohen, M. A. Benner, S. A.
    (1992). Exhaustive matching of the entire protein
    sequence database. Science 256, 1443-5. 1.
Write a Comment
User Comments (0)
About PowerShow.com