MARKOV MODELS - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

MARKOV MODELS

Description:

Typically deal with 1st-order Markov chain, so only qt itself affects the transition probabilities. In a 1st-order chain, for each state Sj, ... – PowerPoint PPT presentation

Number of Views:282

Avg rating:3.0/5.0

Slides: 23

Provided by: ValuedGate1823

Learn more at: https://people.brandeis.edu

Category:

more less

Transcript and Presenter's Notes

Title: MARKOV MODELS

1
MARKOV MODELS
HIDDEN

Presentation
by
Jeff Rosenberg, Toru Sakamoto, Freeman Chen

2
The Plan

Modeling Biological Sequences
Markov Chains
Hidden Markov Models
Issues
Examples
Techniques and Algorithms
Doing it with Mathematica

3
Biological Sequences
VVGGLVALRGAHPYIAALYWGHSFCAGSLIAPC
FA12_HUMAN
TRYP_PIG IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWV
TRY1_BOVIN IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWV
URT1_DESRO STGGLFTDITSHPWQAAIFAQNRRSSGERFLCGG
TRY1_SALSA IVGGYECKAYSQTHQVSLNSGYHFCGGSLVNENWV
TRY1_RAT IVGGYTCPEHSVPYQVSLNSGYHFCGGSLINDQWV
NRPN_MOUSE ILEGRECIPHSQPWQAALFQGERLICGGVLVGDRW
COGS_UCAPU IVGGVEAVPNSWPHQAALFIDDMYFCGGSLISPEW
4
Sequences and Models

Many biological sequences (DNA/RNA, proteins)
have very subtle rules for their structure
they clearly form families and are related,
yet simple measures or descriptions of these
relationships or rules rarely apply
There is a need to create some kind of model
that can be used to identify relationships among
sequences and distinguish members of families
from non-members
Given the complexity and variability of these
biological structures, any practical model must
have a probabilistic component that is, it will
be a stochastic model, rather than a mechanistic
one. It will be evaluated by the (statistical)
accuracy and usefulness of its predictions,
rather than the correspondence of its internal
features to any corresponding internal mechanism
in the structures being modeled.

5
Markov Chains

A system with a set of m possible states, Si at
each of a sequence of discrete points in time
tgt0, the system is in exactly one of those
states the state at time t gt 0 is designated by
qt the movement from qt to qt1 is
probabilistic, and depends only on the states of
the system at or prior to t.
An initial state distribution p(i) Prob(q0
Si)
Process terminates either at time T or when
reaching a designated final state Sf

6
Markov Chains of Order N

Nth-order Markov chain (N gt 0) transition
probabilities out of state qt depend only on the
values of qt, qt-1, qt-(N-1).
Typically deal with 1st-order Markov chain, so
only qt itself affects the transition
probabilities.
In a 1st-order chain, for each state Sj, there is
set of m probabilities for selecting the next
state to move to ai,j Prob(qt1 Si qt
Sj) 1 lt i lt m, t gt 0
If there is some ordering of states such that
ai,j 0 whenever i lt j (i.e., no non-trivial
loops), then this is a linear (or
left-to-right) Markov process
Homogeneous Markov model ai,j is independent
of t

7
Simple Markov Models

Might use a Markov chain to model a sequence
where the symbol in position n depends on the
symbol(s) in position(s) n-1,n-N.
For example, if a protein is more likely to have
Lys after a sequence Arg-Cys, this could be
encoded as (a small part of) a 2nd-order Markov
model.
If the probabilities of a given symbol are the
same for all positions in the sequence, and
independent of symbols in other positions, then
can use the degenerate 0th-order Markov chain,
where the probability of a given symbol is
constant, regardless of the preceding symbol (or
of the position in the sequence).

8
Hidden Markov Models (HMMs)

In a Markov chain (or model), the states are
themselves visible they can be considered the
outputs of the system (or deterministically
associated with those outputs).
However, if each state can emit (generate) any of
several possible outputs (symbols) vk, from an
output alphabet O of M symbols, on a
probabilistic basis, then it is not possible (in
general) to determine the sequence of the states
themselves they are hidden.
Classic example the urn game
A set of N urns (states), each containing various
colored balls (output symbols total of M colors
available), behind a curtain
Player 1 selects an urn at random (with Markov
assumptions), then picks a ball at random from
that urn and announces its color to player 2
Player 1 then repeats the above process, a total
of T times
Player 2 must determine sequence of urns selected
based on the sequence of colors announced

9
Additional Parameters for HMMs

Now, in addition to the transition probabilities,
each state has a prescribed probability
distribution to emit or produce a symbol vk from
O bi,k Prob(vk Si)
If qt Si, then the generated output at time t
is vk with probability bi,k.
So, a HMM is doubly stochastic both the
(hidden) state transition process and the
(visible) output symbol generation process are
probabilistic.

10
Bayesian Aspects of HMM Usage

Given an HMM M, we can relatively easily
calculate the probability of occurrence of an
arbitrary output sequence, s P(s M)
However, we often want to determine the
underlying set of states, transition
probabilities, etc. (the model) that is most
likely to have produced the output sequence s
that we have observed P(M s)
Bayes Formula for sequence recognition P(M
s) P(s M) P(M) / P(s)
Very hard to find this absolute probability
depends on specific a priori probabilities we are
unlikely to know)
Instead, make it a discrimination problem
define a null model N, find P(M s) / P(N s)

11
Issues in Using HMMs

Model architecture/topology
Training
Selecting an appropriate training set
Finding an optimal HMM that fits that set
Must avoid overfitting
Scoring (for HMM construction, sequence
recognition)
How likely that our sequence was generated by our
HMM?
Versus some null model this converts a very
difficult recognition problem into a tractable
discrimination problem
Score is the relative likelihood for our HMM
Efficiency of evaluation
Pruning the search Dynamic programming
Using log-odds scores

12
A Simple HMM for Some DNA Sequences
State(Si)
Emission Probabilities (bi,k)
13
HMMs and Multiple Alignments

Can convert a multiple alignment into an HMM
Create a node for each column in which most
sequences have an aligned residue
Columns with many missing letters go to Insert
states
Emission probabilities are computed from the
relative frequencies in the alignment column (for
Match states), usually with aid of a regularizer
(to avoid zero-probability cases)
Emission probabilities for Insert states are
taken from background frequencies
Can also create a multiple alignment from a
linear HMM
Find Viterbi (most likely) path in the HMM for
each sequence
Each match state on that path creates a column in
the alignment
Ignore or show in lower case letters from insert
states
Setting transition probabilities is equivalent to
setting gap penalties in sequence alignment -
more an art than a science.

14
Protein Sequences Alignment
ALYW-------GHSFCAGSL AIFAKHRRSPGERFLCGGIL AIYRRHRG
-GSVTYVCGGSL AIFAQNRRSSGERFLCGGIL ALFQGE------RLIC
GGVL ALFIDD------MYFCGGSL AIYHYS------SFQCGGVL SLN
S-------GSHFCGGSL

Three states
Match states
Insertion states
Deletion states

15
Topology of Profile HMM
Match states
Deletion states
Insertion states
Mi -gt Mi1 Mi -gt Ii Mi -gt Di1
Di -gt Mi1 Di -gt Ii Di -gt Di1
Ii -gt Mi1 Ii -gt Ii Ii -gt Di1
16
Regularizers

For avoiding overfitting to training set
Substitution matrices
Identify more likely amino acid substitutions,
reflecting biochemical similarities/differences
Fixed for all positions in a sequence one value
for a given pair of amino acides
Pseudocounts
For protein sequences, typically based on
observed (relative) frequencies of various amino
acids
Universal frequencies or position/type
dependent values
Dirichlet mixtures
Probabilistic combinations of Dirichlet densities
Densities over probability distributions i.e.,
the probability density of various distributions
of symbols (in a given sequence position)
Used to generate data-dependent pseudocounts

17
Algorithms for HMM Tasks (1)

3 Major Problems
Determine HMM parameters (given some HMM topology
and a training set)
Calculate (relative) likelihood of a given output
sequence through a given HMM
Find the optimal (most likely) path through a
given HMM for a specific output sequence, and its
(relative) likelihood
Forward/Backward Algorithm
Used for determining the parameters of an HMM
from training set data
Calculates probability of going forward to a
given state (from initial state), and of
generating final model state (member of training
set) from that state
Iteratively adjusts the model parameters

18
Algorithms for HMM Tasks (2)

Baum-Welch (Expectation-maximization,
EM)Algorithm
Often used to determine the HMM parameters
Can also determine most likely path for a (set
of) output sequence(s)
Add up probabilities over all possible paths
Then re-update parameters and iterate
Cannot guarantee global optimum very expensive
Forward Algorithm
Calculates probability of a particular output
sequence given the HMM
Straightforward summation of product of (partial
path) probabilities
Viterbi Algorithm
Classical dynamic programming algorithm
Choose best path (at each point), based on
log-odds scores
Save results of subsubproblems and re-use them
as part of higher-level evaluations
More efficient than Baum-Welch

19
HMMs for Protein/Gene Sequence Analysis

Using any of various means, identify a set of
related sequences with conserved regions
Make 1st-order Markov assumptions transitions
independent of sequence history and sequence
content (other than at the substitution site
itself)
Construct a HMM based on the set of sequences
Use this HMM to search for additional members of
this family, possibly performing alignments
Search by comparing fit to HMM against fit to
some null model
For phylogenetic trees, also concerned with the
length of the paths involved and with shared
intermediate states (sequences)

20
Very Simple Viterbi
DNA Sequence Alignment with value of 1 for
match, -1 for mismatch, -2 for gap remember
best at each step
min -1099
21
Very Simple Viterbi Traceback
22
Onward to Mathematica!

Write a Comment

User Comments (0)