Title: Scoring Matrices
1Scoring Matrices
- Scoring matrices, PSSMs, and HMMs
BIO520 Bioinfromatics Jim Lund
2Alignment scoring matrix
- DNA matrix
- A C G T
- A 5 -4 -4 -4
- C -4 5 -4 -4
- G -4 -4 5 -4
- T -4 -4 -4 5
3Alginment scoring matrix
4Use of a scoring matrix
- P L S - - C F G
- G L T - A C H L
- 111-2-1111
- Score 3
5Consensus sequences
- Different ways to describe a consensus, from
crude to refined - Consensus site
- Sequence logos
- Position Specific Score Matrix (PSSM)
- Hidden Markov Model (HMM)
6Consensus sequences and sequence logos
GTMGFGLPAAIGAKLARPDRRVVAIDGDGSFQMTVQELST
Consensus sequence
Sequence logo
7Constructing (and using) a consensus sequence
- Collect sequences
- Align sequences (consensus sites are descriptions
of the alignment) - Condense the set of sequences into a consensus
(to a consensus, PSSM, HMM). - Apply the scoring matrix in alignments/searches.
8Position Specific Score Matrix (PSSM)
- A position specific scoring matrix (PSSM) is a
matrix based on the amino acid frequencies (or
nucleic acid frequencies) at every position of a
multiple alignment. - From these frequencies, the PSSM that will be
calculated will result in a matrix that will
assign superior scores to residues that appear
more often than by chance at a certain position.
9Creating a PSSM Example
Amino acid frequencies at every position of the
alignment
10Creating a PSSM Example
- Amino acids that do not appear at a specific
position of a multiple alignment must also be
considered in order to model every possible
sequence and have calculable log-odds scores. A
simple procedure called pseudo-counts assigns
minimal scores to residues that do not appear at
a certain position of the alignment according to
the following equation - Where
- Frequency is the frequency of residue i in column
j (the count of occurances). - pseudocount is a number higher or equal to 1.
- N is the number of sequences in the multiple
alignment.
11Creating a PSSM Example
- In this example, N 3 and lets use pseudocount
1 - Score(N) at position 1 3/3 1.
- Score(I) at position 1 0/3 0.
- Readjust
- Score(I) at position 1 -gt (01) / (320) 1/23
0.044. - Score(N) at position 1 -gt (31) / (320) 4/23
0.174. - The PSSM is obtained by taking the logarithm of
(the values obtained above divided by the
background frequency of the residues). - To simplify for this example well assume that
every amino acid appears equally in protein
sequences, i.e. fi 0.05 for every i) - PSSM Score(I) at position 1 log(0.044 / 0.05)
-0.061. - PSSM Score(N) at position 1 log(0.174 / 0.05)
0.541.
12Creating a PSSM Example
- The matrix assigns positive scores to residues
that appear more often than expected by chance
and negative scores to residues that appear less
often than expected by chance.
13Using a PSSM
- To search for matches to a PSSM, scan along a the
sequence using a window the length (L) of the
PSSM. - The matrix is slid on a sequence one residue at a
time and the scores of the residues of every
region of length L are added. - Scores that are higher than an empirically
predetermined threshold are reported.
14Advantages of PSSM
- Weights sequence according to observed diversity
specific to the family of interest - Minimal assumptions
- Easy to compute
- Can be used in comprehensive evaluations.
15More sophisticated PSSMs
From less to more complicated
- PSSM with pseudocounts.
- Giving pseudocounts less weight when more
alignment data is available. - Weight pseudocount amino acids by their frequency
of occurrence in proteins. - Instead of giving pseudocounts all the same
value, weight them by their similarity to the
consensus (like BLOSUM62 does) at each position.
(PSI-BLAST method). - Combine 2 4 (Dirichlet mixture method).
16A PSSM column with a perfectly conserved
isoleucine with different methods used to
calculate the scores.
Method 1 and standard BLOSUM62 matrix
Method 5
17Using Hidden Markov models to describe sequence
alignment profiles
- A profile HMM can represent a sequence alignment
profile similar to how a PSSM does. - A profile HMM includes information on the amino
acid consensus at each position in the alignment
like a PSSM. - A profile HMM also has position-specific scores
for gap insertion and extensions.
18Background Creating HMMs
- To create an HMM to model data we need to
determine two things - The structure/topology of the HMMstates and
transitions - The values of the parametersemission and
transition probabilities. - Determining the parameters is called training.
19A HMM structure/topology
M match state (score the aa in the sequence at
this position in the profile) I insertion
(w.r.t profile - insert gap characters in
profile) D deletion (w.r.t sequence - insert
gap characters in sequence) M1 is first aa in
the profile, M2 is second, etc.
20Example HMMER parameters
- NULE 595 -1558 85 338 -294 453 -1158 (...) -21
-313 45 531 201 384 - HMM A C D E F G H (...) m-gtm m-gti m-gtd i-gtm i-gti
d-gtm d-gtd b-gtm m-gte - 1 -1084 390 -8597 -8255 -5793 -8424 -8268
(...) 1 - - -149 -500 233 43 -381 399 106 (...)
- C -1 -11642 -12684 -894 -1115 -701 -1378 -16
- 2 -2140 -3785 -6293 -2251 3226 -2495 -727
(...) 2 - - -149 -500 233 43 -381 399 106 (...)
- C -1 -11642 -12684 -894 -1115 -701 -1378
(...) - 76 -2255 -5128 -302 363 -784 -2353 1398 (...)
103 - - -149 -500 233 43 -381 399 106 (...)
- E -1 -11642 -12684 -894 -1115 -701 -1378
- 77 -633 879 -2198 -5620 -1457 -5498 -4367
(...) 104 - - (...)
- C 0
- //
21A profile HMM with match state probabilities shown
AAs PATH is the consensus sequence.
22Building a profile HMM
- Pick a HMM structure/topology.
- Estimate initial parameters.
- Train the HMM by running sequences through it.
- Transitions that get used are given higher
probabilities, those rarely used are given lower
probabilities.
23Protein profile HMMs
- Better (in theory) representations than PSSMs.
- More complicated.
- Not hand-tuned by curators.
- Used in some protein profile databases
- Pfam (http//pfam.sanger.ac.uk/)
- SMART (http//smart.embl-heidelberg.de/)
- Difficult to describe in human readable formats.
Schuster-Böckler et al., 2004 (http//www.biomedce
ntral.com/1471-2105/5/7)