Scoring Matrices - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Scoring Matrices

Description:

It is easy to score if an amino acid is identical to another (the score is 1 if ... a score for amino acids that are somewhat ... Amino Acid Scoring Matrices ... – PowerPoint PPT presentation

Number of Views:759
Avg rating:3.0/5.0
Slides: 48
Provided by: jmom5
Category:
Tags: acid | amino | matrices | scoring

less

Transcript and Presenter's Notes

Title: Scoring Matrices


1
Scoring Matrices
2
(No Transcript)
3
Limitations to Needleman-Wunsch
  • The problem with Needleman-Wunsch is the amount
    of processor memory resources it requires.
  • Because of this, it is not favored for practical
    use, despite the guarantee of an optimal
    alignment.

4
What is the problem?
  • There are about 10 88 possible alignments for
    two sequences with 300 nucleotides long( There
    are only about 10 80 elementary particles in the
    universe.
  • It is not possible to solve the alignment
    problem with brute force.
  • Therefore, we need some smart methods (or
    algorithms to overcome this problem

5
Limitations to Needleman-Wunsch
  • The other difficulty is that the concept of
    global alignment is not used in pairwise sequence
    comparison searches.

6
Global Alignment vs. Local Alignment
Global
Needleman-Wunsch Method
Local
Dot Plots Smith-Waterman FastA BLAST
7
Global alignment The global alignment optimizes
the alignment over the full length of the
sequences. LGPSTKDFGKISESREFDN LNQLERSFGKINMRLE-D
A
  • Local Alignment
  • ---------------FGKI-----------
  • ---------------FGKI-----------
  • In local alignment ,stretches with the highest
    density of matches are given the highest
    priority.
  • The alignment tends to stop at the ends of
    regions of identity or strong similarity.

8
Purpose of Smith Waterman Algorithm
  • Smith-Waterman dynamic programming algorithm,
    finds the most similar subsequences of two
    sequences, that has been generally recognized as
    the most sensitive sequence.
  • The search sequences in protein and DNA databases
    searches for similarity to the query sequence by
    using Smith-Waterman algorithm as the core
    sequence comparison method.

9
Smith-Waterman searches
  • A more sensitive brute force approach to
    searching
  • much slower than BLAST or FASTA
  • uses dynamic programming
  • SSEARCH is a GCG program for Smith-Waterman
    searches

10
Differences
Needleman- Wunsch
Smith - Waterman
  • Local alignments
  • Residue alignment score may be positive or
    negative
  • Requires a gap penalty to work effectively
  • Score can increase, decrease or stay level
    between two cells of a pathway.
  • Global alignments
  • Requires alignments score for a pair of residues
    to be gt0
  • No gap penalty required

11
Scoring Matrix/Substitution Matrix
  • To score quality of an alignment
  • Contains scores for pairs of residues (amino
    acids or nucleic acids) in a sequence alignment
  • For protein/protein comparisons a 20 x 20
    matrix of similarity scores where identical amino
    acids and those of similar character (e.g. Ile,
    Leu) give higher scores compared to those of
    different character (e.g. Ile, Asp).
  • Symmetric, so often only half is shown.

12
Substitution Matrices
  • Not all amino acids are equal
  • Some are more easily substituted than others
  • Some mutations occur more often
  • Some substitutions are kept more often
  • Mutations tend to favor some substitutions
  • Some amino acids have similar codons
  • They are more likely to be changed from DNA
    mutation
  • Selection tends to favor some substitutions
  • Some amino acids have similar properties or
    structure
  • They are more likely to be kept

13
Substitution Matrix
  • A substitution matrix describes the likelihood
    that two residue types would mutate to each other
    in evolutionary time.
  • This is used to estimate how well two residues of
    given types would match if they were aligned in a
    sequence alignment.

14
Substitution Matrix
  • An amino acid substitution matrix is a
    symmetrical 2020 matrix, where each element
    contains the score for substituting a residue of
    type i with a residue of type j in a protein,
    where i and j are one of the 20 amino-acid
    residue types.
  • Same residues should obviously have high scores,
    but if we have different residues in a position,
    how should that be scored?

15
Scoring Matrices
  • Scoring matrices tell how similar amino acids
    are.
  • There are two main sets of scoring matrices PAM
    and BLOSUM.
  • PAM is based on evolutionary distances
  • BLOSUM is based on structure/function similarities

16
Substitution Matrix Scoring
  • The same residues in a position give the score
    value 1, and different residues give 0.
  • The same residues give a score 1, similar
    residues (for example Tyr/Phe, or Ile/Leu) give
    0.5, and all others 0.
  • One may calculate, using well established
    sequence alignments, the frequencies
    (probabilities) that a particular residue in a
    position is exchanged for another.

17
Similarity Searching
  • It is easy to score if an amino acid is identical
    to another (the score is 1 if identical and 0 if
    not). However, it is not easy to give a score
    for amino acids that are somewhat similar.
  • Should they get a 0 (non-identical) or a 1
    (identical) or something in between?

Isoleucine
Leucine
18
Scoring Similarity
  • 1) Can only score aligned sequences
  • 2) DNA is usually scored as identical or not
  • 3) Modified scoring for gaps - single vs.
    multiple base gaps (gap extension)
  • 4) AAs have varying degrees of similarity
  • a. of mutations to convert one to another
  • b. chemical similarity
  • c. observed mutation frequencies
  • 5) PAM matrix calculated from observed mutations
    in protein families

19
Dayhoff Matrix
  • This was done originally be Margaret Dayhoff.
    Her matrices are called the PAM (Point Accepted
    Mutation) matrices, which describe the exchange
    frequencies after having accepted a given number
    of point mutations over the sequence.
  • Typical values are PAM 120 (120 mutations per 100
    residues in a protein) and PAM 250.
  • There are many other substitution matrices
    BLOSUM, Gonnet, etc.

20
Dayhoff Matrix
  • Derived from how often different amino acids
    replace other amino acids in evolution.
  • Created from a dataset of closely similar protein
    sequences (less than 15 amino acid difference).
    These could be unambiguously aligned.
  • A mutation probability matrix was derived where
    the entries reflect the probabilities of a
    mutational event.
  • This matrix is called PAM 1. An evolutionary
    distance of 1 PAM (point accepted mutation) means
    there has been 1 point mutation per 100 residues

21
Importance of Scoring Matrices
  • Scoring matrices appear in all analyses involving
    sequence comparisons.
  • The choice of matrix can strongly influence the
    outcome of the analysis.
  • Scoring matrices implicitly represent a
    particular theory of relationships.
  • Understanding theories underlying a given scoring
    matrix can aid in making proper choice.

22
Scoring Matrix Conventions
  • Scoring matrices are conventionally numbered with
    numeric indices corresponding to the rows and
    columns of the matrix.
  • For example, M11 refers to the entry at the first
    row and the first column.
  • In general, Mij refers to the entry at the ith
    row and the jth column.

23
Scoring Matrices
  • To use this for sequence alignment, we simply
    associate a numeric value to each letter in the
    alphabet of the sequence.
  • For example, if the matrix is A,C,T,G then A
    1, C 2, etc. Thus, one would find the score for
    a match between A and C at M12.

24
The Filled-in F matrix for global alignment of
xAAGT and YAGCGT(using BLOSUM50 substitution
matrix)
25
Global alignment using BLOSUM50 substitution
matrix
alignment AAG _T AGCGT
26
Amino Acid Scoring Matrices
  • There are two major scoring matrices for amino
    acid sequence comparisons
  • PAM-derived from sequences known to be closely
    related (Eg. Chimpanzee and human). Ranges from
    PAM1 to PAM500
  • BLOSUM-derived from sequences not closely related
    (Eg. E. coli and human). Ranges from BLOSUM
    10-BLOSUM 100

27
PAM250 Matrix
28
The Point-Accepted-Mutation (PAM) model
  • This model implies that amino acids (AA) mutate
    independently of each other with a probability
    which depends only on the AA.
  • Since there are 20 AA, the transition
    probabilities are described by a 20X20-mutation
    matrix, denoted by M. A standard M defines a
    1-PAM change.
  • Point Accepted Mutation (PAM) Distance A 1-PAM
    unit changes 1 of the amino acids on average
  • where fi is the frequency of AAi, and Mii is the
    frequency of no change in amino acid i.

29
The Point-Accepted-Mutation (PAM) model
  • Started by Margaret Dayhoff, 1978
  • A series of matrices describing the extent to
    which two amino acids have been interchanged in
    evolution
  • PAM-1 was obtained by aligning very similar
    sequences. Other PAMs were obtained by
    extrapolation

30
The Point-Accepted-Mutation (PAM) model of
evolution and the PAM scoring matrix
A 2-PAM unit is equivalent to two 1-PAM unit
evolution (or M2). A k-PAM unit is equivalent to
k 1-PAM unit evolution (or Mk). Example 1
CNGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQ
V
CNGTTDQVDKIVKIRNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPE
IQV length 50 1 mismatch PAM distance 2
31
The Point-Accepted-Mutation (PAM) model of
evolution and the PAM scoring matrix
Observed Sequence Difference
Evolutionary Distance In PAMs
1 5 10 20 40 50 60 70 80
1 5 11 23 56 80 112 159 246
32
Assumptions in the PAM model
1. Replacement at any site depends only on the
amino acid at that site and the probability given
by the table (Markov model). 2. Sequences that
are being compared have average amino acid
composition.
33
Steps to building the first PAM
  • Aligned sequences that were at least 85
    identical.
  • Reconstructed phylogenetic trees and inferred
    ancestral sequences. 71 trees containing 1,572 aa
    exchanges were used.
  • Tallied aa replacements "accepted" by natural
    selection, in all pairwise comparisons (each Aij
    is the number of times amino acid j was replaced
    by amino acid i in all comparisons).

34
Steps to building PAM
  • 4. Computed amino acid mutability, mj (the
    propensity of a given amino acid, j, to be
    replaced by any other amino acid)
  • 5. Combined data from 3 4 to produce a
    Mutation Probability Matrix for one PAM of
    evolutionary distance, according to the following
    formula

Replacements
35
Steps to building PAM
6. Take the log odds ratio to obtain each
score Sij log (Mij/fi) Where fi is the
normalized frequency of aai in the sequences
used. 7. Note must multiply the Mij/fi by a
factor of 10 prior to avoid fractions.
36
Sources of error in PAM model
1. Many sequences depart from average aa
composition. 2. Rare replacements were observed
too infrequently to determine probabilities
accurately (for 36 aa pairs (out of 400 aa pairs)
no replacements were observed!). 3. Errors in 1
PAM are magnified when extrapolated to 250 PAM.
(Mijk k PAM) 4. The idea that each amino acid
is acting independently is an imperfect
representation of evolution. Actually, distantly
related sequences usually have islands (blocks)
of conserved residues implying that replacement
is not equally probable over entire sequence.
37
The bottom line on PAM
Frequency of alignment
Frequency of occurrence
The probability that two amino acids, i and j
are aligned by evolutionary descent divided by
the probability that they are aligned by chance
38
BLOSUM Matrix (BLOcks SUbstitution Matrices)
  • Blocks Sum-created from BLOCKS database
  • A series of matrices describing the extent to
    which two amino acids are interchangeable in
    conserved structures of proteins
  • The number in the series represents the threshold
    percent similarity between sequences, for
    consideration for calculation
  • (For example, BLOSUM62 means 62 of the aas were
    similar)

39
BLOSUM Matrices
  • BLOSUM is built from distantly related sequences
    within conserved blocks whereas PAM is built from
    closely related sequences
  • BLOSUM is built from conserved blocks of aligned
    protein segments found in the BLOCKS database
    (the BLOCKS database is a secondary database that
    depends on the PROSITE Family database)

40
BLOSUM Matrices (cont.1)
  • Version 8.0 of the Blocks Database consists of
    2884 blocks based on 770 protein families
    documented in PROSITE. PROSITE supplies
    documentation for each family.

Hypothetical entry in red box in BLOCK record
AABCDA...BBCDA DABCDA.A.BBCBB BBBCDABA.BCCAA AAACD
AC.DCBCDB CCBADAB.DBBDCC AAACAA...BBCCC
41
Building BLOSUM Matrices
  • 1. To build the BLOSUM 62 matrix one must
    eliminate sequences that are identical in more
    than 62 of their amino acid sequences. This is
    done by either removing sequences from the Block
    or by finding a cluster of similar sequences and
    replacing it with a single representative
    sequence.
  • 2. Next, the probability for a pair of amino
    acids to be in the same column is calculated. In
    the previous page this would be the probability
    of replacement of A with A, A with B, A with C,
    and B with C. This gives the value qij
  • 3. Next, one calculates the probability that a
    certain amino acid frequency exists, fi.

42
Building BLOSUM Matrices (cont.)
  • 4. Finally, we calculate the log odds ratio si,j
    log2 (qij/fi). This value is entered into the
    matrix.
  • Which BLOSUM to use?

BLOSUM Identity 80
80 62
62 (usually default value) 35
35
If you are comparing sequences that are very
similar, use BLOSUM 80. Sequences that are more
divergent (dissimilar) than 20 are given very
low scores in this matrix.
43
Which Scoring Matrix to use?
  • PAM-1
  • BLOSUM-100
  • Small evolutionary distance
  • High identity within short sequences
  • PAM-250
  • BLOSUM-20
  • Large evolutionary distance
  • Low identity within long sequences

44
The PAM 250 Scoring Matrix
45
GCG Wisconsin Package GAP
  • GAP is the implementation of the Needleman-Wunsch
    algorithm in the GCG program package.
  • The NW algorithm will present you with a single
    globally optimal alignment, not all possible
    optimal alignments - different alignments may
    exist that give the same score.
  • GAP presents you with one member of the family of
    best alignments that align the full length of one
    sequence to the full length of a second sequence.
  • There may be many members of this family, but no
    other member has a higher score.

46
GCG Wisconsin Package GAP
  • The primary use of a global alignment algorithm
    is when you really want the whole of two
    sequences to be aligned, without truncation.
  • GAP could completely bypass a region of high
    local homology, if a better (or even just as
    good) path can be found in a different way.
  • This is problematic if one short sequence is
    aligned against a longer one with internal
    repeats.
  • If there is weak or unknown similarity between
    two sequences, a local alignment algorithm
    (BESTFIT) is the better choice.
  • Use GAP only when you believe the similarity is
    over the whole length.

47
Global Alignment vs. Local Alignment
  • Global alignment is used when the overall gene
    sequence is similar to another sequence-often
    used in multiple sequence alignment.
  • Clustal W algorithm
  • Local alignment is used when only a small portion
    of one gene is similar to a small portion of
    another gene.
  • BLAST
  • FASTA
  • Smith-Waterman algorithm
Write a Comment
User Comments (0)
About PowerShow.com