Scoring Matrices - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Scoring Matrices

Description:

To search for matches to a PSSM, scan along a the sequence using a window the ... To create an HMM to model data we need to determine two things: ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 24
Provided by: jiml7
Category:
Tags: matrices | scan | scoring

less

Transcript and Presenter's Notes

Title: Scoring Matrices


1
Scoring Matrices
  • Scoring matrices, PSSMs, and HMMs

BIO520 Bioinfromatics Jim Lund
2
Alignment scoring matrix
  • DNA matrix
  • A C G T
  • A 5 -4 -4 -4
  • C -4 5 -4 -4
  • G -4 -4 5 -4
  • T -4 -4 -4 5

3
Alginment scoring matrix
  • Protein matrix

4
Use of a scoring matrix
  • P L S - - C F G
  • G L T - A C H L
  • 111-2-1111
  • Score 3

5
Consensus sequences
  • Different ways to describe a consensus, from
    crude to refined
  • Consensus site
  • Sequence logos
  • Position Specific Score Matrix (PSSM)
  • Hidden Markov Model (HMM)

6
Consensus sequences and sequence logos
GTMGFGLPAAIGAKLARPDRRVVAIDGDGSFQMTVQELST
Consensus sequence
Sequence logo
7
Constructing (and using) a consensus sequence
  • Collect sequences
  • Align sequences (consensus sites are descriptions
    of the alignment)
  • Condense the set of sequences into a consensus
    (to a consensus, PSSM, HMM).
  • Apply the scoring matrix in alignments/searches.

8
Position Specific Score Matrix (PSSM)
  • A position specific scoring matrix (PSSM) is a
    matrix based on the amino acid frequencies (or
    nucleic acid frequencies) at every position of a
    multiple alignment.
  • From these frequencies, the PSSM that will be
    calculated will result in a matrix that will
    assign superior scores to residues that appear
    more often than by chance at a certain position.

9
Creating a PSSM Example
  • NTEGEWI
  • NITRGEW
  • NIAGECC

Amino acid frequencies at every position of the
alignment
10
Creating a PSSM Example
  • Amino acids that do not appear at a specific
    position of a multiple alignment must also be
    considered in order to model every possible
    sequence and have calculable log-odds scores. A
    simple procedure called pseudo-counts assigns
    minimal scores to residues that do not appear at
    a certain position of the alignment according to
    the following equation
  • Where
  • Frequency is the frequency of residue i in column
    j (the count of occurances).
  • pseudocount is a number higher or equal to 1.
  • N is the number of sequences in the multiple
    alignment.

11
Creating a PSSM Example
  • In this example, N 3 and lets use pseudocount
    1
  • Score(N) at position 1 3/3 1.
  • Score(I) at position 1 0/3 0.
  • Readjust
  • Score(I) at position 1 -gt (01) / (320) 1/23
    0.044.
  • Score(N) at position 1 -gt (31) / (320) 4/23
    0.174.
  • The PSSM is obtained by taking the logarithm of
    (the values obtained above divided by the
    background frequency of the residues).
  • To simplify for this example well assume that
    every amino acid appears equally in protein
    sequences, i.e. fi 0.05 for every i)
  • PSSM Score(I) at position 1 log(0.044 / 0.05)
    -0.061.
  • PSSM Score(N) at position 1 log(0.174 / 0.05)
    0.541.

12
Creating a PSSM Example
  • The matrix assigns positive scores to residues
    that appear more often than expected by chance
    and negative scores to residues that appear less
    often than expected by chance.

13
Using a PSSM
  • To search for matches to a PSSM, scan along a the
    sequence using a window the length (L) of the
    PSSM.
  • The matrix is slid on a sequence one residue at a
    time and the scores of the residues of every
    region of length L are added.
  • Scores that are higher than an empirically
    predetermined threshold are reported.

14
Advantages of PSSM
  • Weights sequence according to observed diversity
    specific to the family of interest
  • Minimal assumptions
  • Easy to compute
  • Can be used in comprehensive evaluations.

15
More sophisticated PSSMs
From less to more complicated
  • PSSM with pseudocounts.
  • Giving pseudocounts less weight when more
    alignment data is available.
  • Weight pseudocount amino acids by their frequency
    of occurrence in proteins.
  • Instead of giving pseudocounts all the same
    value, weight them by their similarity to the
    consensus (like BLOSUM62 does) at each position.
    (PSI-BLAST method).
  • Combine 2 4 (Dirichlet mixture method).

16
A PSSM column with a perfectly conserved
isoleucine with different methods used to
calculate the scores.
Method 1 and standard BLOSUM62 matrix
Method 5
17
Using Hidden Markov models to describe sequence
alignment profiles
  • A profile HMM can represent a sequence alignment
    profile similar to how a PSSM does.
  • A profile HMM includes information on the amino
    acid consensus at each position in the alignment
    like a PSSM.
  • A profile HMM also has position-specific scores
    for gap insertion and extensions.

18
Background Creating HMMs
  • To create an HMM to model data we need to
    determine two things
  • The structure/topology of the HMMstates and
    transitions
  • The values of the parametersemission and
    transition probabilities.
  • Determining the parameters is called training.

19
A HMM structure/topology
M match state (score the aa in the sequence at
this position in the profile) I insertion
(w.r.t profile - insert gap characters in
profile) D deletion (w.r.t sequence - insert
gap characters in sequence) M1 is first aa in
the profile, M2 is second, etc.
20
Example HMMER parameters
  • NULE 595 -1558 85 338 -294 453 -1158 (...) -21
    -313 45 531 201 384
  • HMM A C D E F G H (...) m-gtm m-gti m-gtd i-gtm i-gti
    d-gtm d-gtd b-gtm m-gte
  • 1 -1084 390 -8597 -8255 -5793 -8424 -8268
    (...) 1
  • - -149 -500 233 43 -381 399 106 (...)
  • C -1 -11642 -12684 -894 -1115 -701 -1378 -16
  • 2 -2140 -3785 -6293 -2251 3226 -2495 -727
    (...) 2
  • - -149 -500 233 43 -381 399 106 (...)
  • C -1 -11642 -12684 -894 -1115 -701 -1378
    (...)
  • 76 -2255 -5128 -302 363 -784 -2353 1398 (...)
    103
  • - -149 -500 233 43 -381 399 106 (...)
  • E -1 -11642 -12684 -894 -1115 -701 -1378
  • 77 -633 879 -2198 -5620 -1457 -5498 -4367
    (...) 104
  • - (...)
  • C 0
  • //

21
A profile HMM with match state probabilities shown
AAs PATH is the consensus sequence.
22
Building a profile HMM
  • Pick a HMM structure/topology.
  • Estimate initial parameters.
  • Train the HMM by running sequences through it.
  • Transitions that get used are given higher
    probabilities, those rarely used are given lower
    probabilities.

23
Protein profile HMMs
  • Better (in theory) representations than PSSMs.
  • More complicated.
  • Not hand-tuned by curators.
  • Used in some protein profile databases
  • Pfam (http//pfam.sanger.ac.uk/)
  • SMART (http//smart.embl-heidelberg.de/)
  • Difficult to describe in human readable formats.

Schuster-Böckler et al., 2004 (http//www.biomedce
ntral.com/1471-2105/5/7)
Write a Comment
User Comments (0)
About PowerShow.com