Title: Overview of Hidden Markov Models (HMMs) and profiles
1Overview of Hidden Markov Models (HMMs) and
profiles
2From this lecture
- Profiles
- Basics of Hidden Markov models
- Estimating HMM parameters
- Sequence weighting
- Using HMMs for alignment and homolog detection
- Subfamily HMMs
3Eddy papers in Nature Biotechnology
- http//selab.janelia.org/publications
Recommended reading
4UCSC tutorial on HMMs (by Rachel Karchin)
- http//www.cse.ucsc.edu/research/compbio/ismb99.h
andouts/KK185FP.html
(useful, but not required)
5HMMs are a kind of profile
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
6(No Transcript)
7Sample profile
- Gribskov et al, PNAS 1987
Gribskov et al, PNAS 1987
8(No Transcript)
9HMM for 5 splice site (5SS) recognition
- Assumptions (encoded in model)
- Exons (E) have a uniform base composition
- Introns (I) are A/T rich
- 5SS is almost always G
- Eddy, Nature Biotechnology 2004
10HMM for splice site recognition
- Eddy, Nature Biotechnology 2004
11HMM parameter estimation using unaligned training
sequences
Delete/skip Insert Match
- HMM parameter estimation
- Compute probabilities of data given model
- Align sequences to HMM
- Gather statistics of paths taken through HMM
(Expectation step) - 2. Modify HMM parameters to Maximize Prob (data
model) (Maximization step) (Maximum
Likelihood) - Iterate Steps 1-3 until parameters converge.
gtSeq1 MIVSP gtSeq2 MVVSTGP gtSeq3
MVVSSGP gtSeq4 MVLSSPP gtSeq5 MLSGPP
training data
12Hidden Markov Model (HMM)
Delete/skip Insert Match
END
START
M O R N I N G
Originally used in speech recognition (Rabiner,
1986)
- Proposed for DNA modeling (Churchill, 1989)
- Applied to modeling proteins (Haussler et al,
1992) - Multiple sequence alignment
- Identification of related family members
(homologs)
13Aligning sequences to an HMM to construct an MSA
Note how to read a UCSC a2m-formatted MSA
(in-class)
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
14Generating a multiple alignment by aligning
sequences to an HMM
gtSeq6 MIVSTSG gtSeq7 MVVTTG gtSeq8 SP gtSeq9 PP
Seq6 M I V S T S G Seq7 M
V V - T T G Seq8 - - - -
- S P Seq9 - - - - - P
P
15Estimating HMM parameters
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
16http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
17Viterbi and Baum-Welch algorithms
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
18Simulated annealing and other methods for
handling local optima
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
19Sequence weighting
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
20Henikoff weighting
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
21Henikoff weighting
Weight of a character in the MSA 1/mk m
unique amino acids seen k times a particular
amino acid is seen Weight of a sequence is the
average of the weights in all positions,
normalized to sum to 1.
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
22Overfitting and regularization
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
23Using pseudocounts
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
24Dirichlet mixture densities
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
25Including prior information in profile or HMM
construction
- The use of Dirichlet mixture densities
26Profile or HMM parameter estimation using small
training sets
What other amino acids might be seen at this
position among homologs? What are their
probabilities?
.
27The context is critical when estimating amino
acid distributions
This position may be critical for function or
structure, and may not allow substitutions
.
28Dirichlet Mixture Prior Blocks9
Parameters estimated using Expectation
Maximization (EM) algorithm. Training data
86,000 columns from BLOCKS alignment database.
29Combining Prior Knowledge with Observations using
Dirichlet Mixture Densities
Dirichlet Mixtures A Method for Improved
Detection of Weak but Significant Protein
Sequence Homology. Sjolander, Karplus, Brown,
Hughey, Krogh, Mian and Haussler. CABIOS (1996)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
34Log-odds ratio
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
35HMM construction using an initial multiple
sequence alignment
Delete/skip Insert Match
36In searching for family members, all features
must be assumed to be equally informative.
37Without knowing which features are more
important, would we recognize this relative?
38Gathering family members allows us to identify
conserved attributes and create a profile
Conserved stripes, cat. Variable coat color,
size.
39Profile generalization allows us to identify
sometruly remote relatives
40Conflict
- For effective remote homolog detection, a profile
or HMM needs information from divergent family
members - Without this context, we cannot differentiate
critical from variable positions - HMMs constructed with such data provide a coarse
classification - But, the more variability we introduce in
training data, the greater the potential noise at
some positions
D S L F M K I D S I F M K V D T I W M K M D T I
W M K L D T V W M K F D T F R D K I D T F R D K V
41Divergence across the family conservation
within subfamilies
Average BLOSUM62 Score
Position
42Subfamily HMM Construction
43Assessing classification accuracy
7TM GPCR
ABC Transporter
Amidohydrolase
ATPase
Family
44Discovering and Modeling Functional Subtypes
45How to build Subfamily HMMs (SHMMs)
Share statistics between subfamilies where there
is evidence of a common distribution.
D S L F M K I D S I F M K V D T I W M K M D T I
W M K L D T V W M K F D T F R K K I D T F R K K V
1 2 3 4 5 6 7
Keep statistics separate at positions where there
is evidence of divergent structure.
3 4 5
1 2
6 7
Improved specificity, sensitivity, alignment
accuracy
46Step 1 Form Dirichlet Mixture Posterior
At each position, for each subfamily, construct a
Dirichlet mixture posterior, by combining the
Dirichlet mixture prior with the amino
acids aligned at that position by that subfamily.
(Weighted) subfamily counts
Mixture coefficient
Component Parameters
(Weighted) subfamily counts of amino acid i
47Step 2 Calculate family contribution
Other subfamilies contribute, proportional to the
probability of the amino acids they aligned at
that position, given the revised Dirichlet
mixture density.
D S L F M K I D S I F M K V D T I W M K M D T I
W M K L D T V W M K F D T F R K K I D T F R K K V
(Weighted) counts from subfamily s?
(Formula for computing Prob (n T ) are in
Sjolander et al, 1996)
48Step 3 Compute pseudocounts
Add the family contribution to the observed
(weighted) counts, to obtain the pseudocounts ti
of amino acid i
(Weighted) subfamily counts for subfamily s
family contribution
49Step 4 Compute amino acid probabilities
Normally, we compute amino acid probabilities by
combining a Dirichlet mixture prior with
observed counts as follows
50SHMM Remote Homolog Detection
- 515 PFAM Full MSAs, each corresponding to a
unique SCOP Fold. - Family HMMs constructed using UCSC SAM w0.5
software. - Subfamily HMMs constructed using BETE.
- Each sequence in PDB90 assigned a family score
and a subfamily score (best-of-SHMMs). - E-values computed by fitting these scores to an
extreme value distribution
Brown D, Krishnamurthy N, Dale J, Christopher W,
and Sjölander K, "Subfamily HMMs in Functional
Genomics", Proceedings of the Pacific Symposium
on Biocomputing, 2005
51The Sum of the Parts Is Greater Than the Whole
Error
Subfamily HMM
General HMM
52Subfamily Decomposition Preserves Information
Average BLOSUM62 Score
Position
53Scoring assumes independence of positions
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.htmlstatprof
54HMMs do not include higher-order correlations
between positions
From UCSC
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
55HMM scoring assume independence across
sites(this is not supported by biology)
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
56Conclusions
http//www.cse.ucsc.edu/research/compbio/ismb99.ha
ndouts/KK185FP.html
57Summary
- HMMs and profiles are related
- Profiles are generalizations of multiple
alignments - HMMs include parameters for emission state
probabilities as well as transition state
probabilities - Generalization of amino acid distributions is
critical (overfitting observed data is
problematic) - Sequence weighting can help improve sensitivity,
but can also cause problems - HMMs can also be estimated from unaligned
sequences (use buildmodel) - HMM surgery enables nodes to be inserted or
deleted to try to explore alternative topologies - The most effective HMMs are derived from good
multiple sequences alignments - A one-size-fits-all approach to constructing HMMs
for a family may not be effective - Inclusion of too many remote homologs can degrade
HMM performance - Using structure information (2ary or 3ary) can
improve the indel parameter estimation,
especially if training data is limited