Title: HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
1HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
- CS 594 An Introduction to Computational
Molecular Biology - BY
- Shalini Venkataraman
- Vidhya Gunaseelan
2Relationship Between DNA, RNA And Proteins
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
3Protein Structure
Primary Structure of Proteins
The primary structure of peptides and proteins
refers to the linear number and order of the
amino acids present.
4Protein Structure
Secondary Structure
Protein secondary structure refers to regular,
repeated patters of folding of the protein
backbone. How a protein folds is largely dictated
by the primary sequence of amino acids
Beta Sheet
Alpha Helix
5Multiple Alignment Process
- Process of aligning three or more sequences with
each other - Generalization of the algorithm to align two
sequences - Local multiple alignment uses Sum of pairs
scoring scheme
6HMM Architecture
- Markov Chains
- What is a Hidden Markov Model(HMM)?
- Components of HMM
- Problems of HMMs
7Markov Chains
Rain
Sunny
Cloudy
State transition matrix The probability of the
weather given the previous day's weather.
States Three states - sunny, cloudy, rainy.
Initial Distribution Defining the probability
of the system being in each of the states at time
0.
8Hidden Markov Models
Hidden states the (TRUE) states of a system
that may be described by a Markov process (e.g.,
the weather). Observable states the states of
the process that are visible' (e.g., seaweed
dampness).
9Components Of HMM
Output matrix containing the probability of
observing a particular observable state given
that the hidden model is in a particular hidden
state. Initial Distribution contains the
probability of the (hidden) model being in a
particular hidden state at time t 1. State
transition matrix holding the probability of a
hidden state given the previous hidden state.
10Example-HMM
Transition Prob.
Output Prob.
Scoring a Sequence with an HMM The probability
of ACCY along this path is .4 .3 .46 .6
.97 .5 .015 .73 .01 1 1.76x10-6.
11Problems With HMM
Scoring problem Given an existing HMM and
observed sequence , what is the probability that
the HMM can generate the sequence
12Problems With HMM
- Alignment Problem
- Given a sequence, what is the optimal state
sequence that the HMM would use to generate it
13Problems With HMM
- Training Problem
- Given a large amount of data how can we estimate
the structure and the parameters of the HMM that
best accounts for the data
14HMMs in Biology
- Gene finding and prediction
- Protein-Profile Analysis
- Secondary Structure prediction
- Advantages
- Limitations
15Finding genes in DNA sequence
This is one of the most challenging and
interesting problems in computational biology at
the moment. With so many genomes being sequenced
so rapidly, it remains important to begin by
identifying genes computationally.
16What is a (protein-coding) gene?
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
17In more detail (color state)
(Left)
(Removed)
18Gene Finding HMMs
- Our Objective
- To find the coding and non-coding regions of an
unlabeled string of DNA nucleotides - Our Motivation
- Assist in the annotation of genomic data produced
by genome sequencing methods - Gain insight into the mechanisms involved in
transcription, splicing and other processes
19Why HMMs
- Classification Classifying observations within a
sequence - Order A DNA sequence is a set of ordered
observations - Grammar Our grammatical structure (and the
beginnings of our architecture) is right here - Success measure of complete exons correctly
labeled - Training data Available from various genome
annotation projects
20HMMs for gene finding
- Training - Expectation Maximization (EM)
- Parsing Viterbi algorithm
An HMM for unspliced genes. x non-coding DNA c
coding state
21Genefinders- a comparison
Sn Sensitivity Sp Specificity Ac
Approximate Correlation ME Missing Exons WE
Wrong Exons
GENSCAN Performance Data, http//genes.mit.edu/Acc
uracy.html
22Protein Profile HMMs
- Motivation
- Given a single amino acid target sequence of
unknown structure, we want to infer the structure
of the resulting protein. Use Profile Similarity - What is a Profile?
- Proteins families of related sequences and
structures - Same function
- Clear evolutionary relationship
- Patterns of conservation, some positions are more
conserved than the others
23An Overview
Aligned Sequences Build a Profile HMM (Training)
Database search
Multiple alignments (Viterbi)
Query against Profile HMM database (Forward)
24Building from an existing alignment
ACA - - - ATG TCA ACT ATC ACA C - -
AGC AGA - - - ATC ACC G - - ATC
insertion
Transition probabilities
Output Probabilities
A HMM model for a DNA motif alignments, The
transitions are shown with arrows whose thickness
indicate their probability. In each state, the
histogram shows the probabilities of the four
bases.
25Building Final Topology
Deletion states
Matching states
Insertion states
No of matching states average sequence length
in the family PFAM Database - of Protein
families (http//pfam.wustl.edu)
26Database Searching
- Given HMM, M, for a sequence family, find all
members of the family in data base. - LL score LL(x) log P(xM)
- (LL score is length dependent must normalize or
use Z-score)
27Query a new sequence
Suppose I have a query protein sequence, and I am
interested in which family it belongs to? There
can be many paths leading to the generation of
this sequence. Need to find all these paths and
sum the probabilities.
Consensus sequence P
(ACACATC) 0.8x1 x 0.8x1 x 0.8x0.6 x 0.4x0.6 x
1x1 x
0.8x1 x 0.8 4.7 x 10 -2
ACAC - - ATC
28Multiple Alignments
- Try every possible path through the model that
would produce the target sequences - Keep the best one and its probability.
- Output Sequence of match, insert and delete
states - Viterbi alg. Dynamic Programming
29Building unaligned sequences
- Baum-Welch Expectation-maximization method
- Start with a model whose length matches the
average length of the sequences and with random
output and transition probabilities. - Align all the sequences to the model.
- Use the alignment to alter the output and
transition probabilities - Repeat. Continue until the model stops changing
- By-product It produced a multiple alignment
30PHMM Example
An alignment of 30 short amino acid sequences
chopped out of a alignment of the SH3 domain. The
shaded area are the most conserved and were
represented by the main states in the HMM. The
unshaded area was represented by an insert state.
31Prediction of Protein Secondary structures
- Prediction of secondary structures is needed for
the prediction of protein function. - Analyze the amino-acid sequences of proteins
- Learn secondary structures
- helix, sheet and turn
- Predict the secondary structures of sequences
32Advantages
- Characterize an entire family of sequences.
- Position-dependent character distributions and
position-dependent insertion and deletion gap
penalties. - Built on a formal probabilistic basis
- Can make libraries of hundreds of profile HMMs
and apply them on a large scale (whole genome)
33Limitations
- Markov Chains
- Probabilities of states are supposed to be
independent - P(y) must be independent of P(x), and vice versa
- This usually isnt true
P(x)
P(y)
34Limitations - contd
- Standard Machine Learning Problems
- Watch out for local maxima
- Model may not converge to a truly optimal
parameter set for a given training set - Avoid over-fitting
- Youre only as good as your training set
- More training is not always good
35CONCLUSION
- For links slides
- www.evl.uic.edu/shalini/hmm/