MARKOV MODELS - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

MARKOV MODELS

Description:

Typically deal with 1st-order Markov chain, so only qt itself affects the transition probabilities. In a 1st-order chain, for each state Sj, ... – PowerPoint PPT presentation

Number of Views:279
Avg rating:3.0/5.0
Slides: 23
Provided by: ValuedGate1823
Category:
Tags: markov | models | chain | markov

less

Transcript and Presenter's Notes

Title: MARKOV MODELS


1
MARKOV MODELS
HIDDEN
  • Presentation
  • by
  • Jeff Rosenberg, Toru Sakamoto, Freeman Chen

2
The Plan
  • Modeling Biological Sequences
  • Markov Chains
  • Hidden Markov Models
  • Issues
  • Examples
  • Techniques and Algorithms
  • Doing it with Mathematica

3
Biological Sequences
VVGGLVALRGAHPYIAALYWGHSFCAGSLIAPC
FA12_HUMAN
TRYP_PIG IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWV
TRY1_BOVIN IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWV
URT1_DESRO STGGLFTDITSHPWQAAIFAQNRRSSGERFLCGG
TRY1_SALSA IVGGYECKAYSQTHQVSLNSGYHFCGGSLVNENWV
TRY1_RAT IVGGYTCPEHSVPYQVSLNSGYHFCGGSLINDQWV
NRPN_MOUSE ILEGRECIPHSQPWQAALFQGERLICGGVLVGDRW
COGS_UCAPU IVGGVEAVPNSWPHQAALFIDDMYFCGGSLISPEW
4
Sequences and Models
  • Many biological sequences (DNA/RNA, proteins)
    have very subtle rules for their structure
    they clearly form families and are related,
    yet simple measures or descriptions of these
    relationships or rules rarely apply
  • There is a need to create some kind of model
    that can be used to identify relationships among
    sequences and distinguish members of families
    from non-members
  • Given the complexity and variability of these
    biological structures, any practical model must
    have a probabilistic component that is, it will
    be a stochastic model, rather than a mechanistic
    one. It will be evaluated by the (statistical)
    accuracy and usefulness of its predictions,
    rather than the correspondence of its internal
    features to any corresponding internal mechanism
    in the structures being modeled.

5
Markov Chains
  • A system with a set of m possible states, Si at
    each of a sequence of discrete points in time
    tgt0, the system is in exactly one of those
    states the state at time t gt 0 is designated by
    qt the movement from qt to qt1 is
    probabilistic, and depends only on the states of
    the system at or prior to t.
  • An initial state distribution p(i) Prob(q0
    Si)
  • Process terminates either at time T or when
    reaching a designated final state Sf

6
Markov Chains of Order N
  • Nth-order Markov chain (N gt 0) transition
    probabilities out of state qt depend only on the
    values of qt, qt-1, qt-(N-1).
  • Typically deal with 1st-order Markov chain, so
    only qt itself affects the transition
    probabilities.
  • In a 1st-order chain, for each state Sj, there is
    set of m probabilities for selecting the next
    state to move to ai,j Prob(qt1 Si qt
    Sj) 1 lt i lt m, t gt 0
  • If there is some ordering of states such that
    ai,j 0 whenever i lt j (i.e., no non-trivial
    loops), then this is a linear (or
    left-to-right) Markov process
  • Homogeneous Markov model ai,j is independent
    of t

7
Simple Markov Models
  • Might use a Markov chain to model a sequence
    where the symbol in position n depends on the
    symbol(s) in position(s) n-1,n-N.
  • For example, if a protein is more likely to have
    Lys after a sequence Arg-Cys, this could be
    encoded as (a small part of) a 2nd-order Markov
    model.
  • If the probabilities of a given symbol are the
    same for all positions in the sequence, and
    independent of symbols in other positions, then
    can use the degenerate 0th-order Markov chain,
    where the probability of a given symbol is
    constant, regardless of the preceding symbol (or
    of the position in the sequence).

8
Hidden Markov Models (HMMs)
  • In a Markov chain (or model), the states are
    themselves visible they can be considered the
    outputs of the system (or deterministically
    associated with those outputs).
  • However, if each state can emit (generate) any of
    several possible outputs (symbols) vk, from an
    output alphabet O of M symbols, on a
    probabilistic basis, then it is not possible (in
    general) to determine the sequence of the states
    themselves they are hidden.
  • Classic example the urn game
  • A set of N urns (states), each containing various
    colored balls (output symbols total of M colors
    available), behind a curtain
  • Player 1 selects an urn at random (with Markov
    assumptions), then picks a ball at random from
    that urn and announces its color to player 2
  • Player 1 then repeats the above process, a total
    of T times
  • Player 2 must determine sequence of urns selected
    based on the sequence of colors announced

9
Additional Parameters for HMMs
  • Now, in addition to the transition probabilities,
    each state has a prescribed probability
    distribution to emit or produce a symbol vk from
    O bi,k Prob(vk Si)
  • If qt Si, then the generated output at time t
    is vk with probability bi,k.
  • So, a HMM is doubly stochastic both the
    (hidden) state transition process and the
    (visible) output symbol generation process are
    probabilistic.

10
Bayesian Aspects of HMM Usage
  • Given an HMM M, we can relatively easily
    calculate the probability of occurrence of an
    arbitrary output sequence, s P(s M)
  • However, we often want to determine the
    underlying set of states, transition
    probabilities, etc. (the model) that is most
    likely to have produced the output sequence s
    that we have observed P(M s)
  • Bayes Formula for sequence recognition P(M
    s) P(s M) P(M) / P(s)
  • Very hard to find this absolute probability
    depends on specific a priori probabilities we are
    unlikely to know)
  • Instead, make it a discrimination problem
    define a null model N, find P(M s) / P(N s)

11
Issues in Using HMMs
  • Model architecture/topology
  • Training
  • Selecting an appropriate training set
  • Finding an optimal HMM that fits that set
  • Must avoid overfitting
  • Scoring (for HMM construction, sequence
    recognition)
  • How likely that our sequence was generated by our
    HMM?
  • Versus some null model this converts a very
    difficult recognition problem into a tractable
    discrimination problem
  • Score is the relative likelihood for our HMM
  • Efficiency of evaluation
  • Pruning the search Dynamic programming
  • Using log-odds scores

12
A Simple HMM for Some DNA Sequences
State(Si)
Emission Probabilities (bi,k)
13
HMMs and Multiple Alignments
  • Can convert a multiple alignment into an HMM
  • Create a node for each column in which most
    sequences have an aligned residue
  • Columns with many missing letters go to Insert
    states
  • Emission probabilities are computed from the
    relative frequencies in the alignment column (for
    Match states), usually with aid of a regularizer
    (to avoid zero-probability cases)
  • Emission probabilities for Insert states are
    taken from background frequencies
  • Can also create a multiple alignment from a
    linear HMM
  • Find Viterbi (most likely) path in the HMM for
    each sequence
  • Each match state on that path creates a column in
    the alignment
  • Ignore or show in lower case letters from insert
    states
  • Setting transition probabilities is equivalent to
    setting gap penalties in sequence alignment -
    more an art than a science.

14
Protein Sequences Alignment
ALYW-------GHSFCAGSL AIFAKHRRSPGERFLCGGIL AIYRRHRG
-GSVTYVCGGSL AIFAQNRRSSGERFLCGGIL ALFQGE------RLIC
GGVL ALFIDD------MYFCGGSL AIYHYS------SFQCGGVL SLN
S-------GSHFCGGSL
  • Three states
  • Match states
  • Insertion states
  • Deletion states

15
Topology of Profile HMM
Match states
Deletion states
Insertion states
Mi -gt Mi1 Mi -gt Ii Mi -gt Di1
Di -gt Mi1 Di -gt Ii Di -gt Di1
Ii -gt Mi1 Ii -gt Ii Ii -gt Di1
16
Regularizers
  • For avoiding overfitting to training set
  • Substitution matrices
  • Identify more likely amino acid substitutions,
    reflecting biochemical similarities/differences
  • Fixed for all positions in a sequence one value
    for a given pair of amino acides
  • Pseudocounts
  • For protein sequences, typically based on
    observed (relative) frequencies of various amino
    acids
  • Universal frequencies or position/type
    dependent values
  • Dirichlet mixtures
  • Probabilistic combinations of Dirichlet densities
  • Densities over probability distributions i.e.,
    the probability density of various distributions
    of symbols (in a given sequence position)
  • Used to generate data-dependent pseudocounts

17
Algorithms for HMM Tasks (1)
  • 3 Major Problems
  • Determine HMM parameters (given some HMM topology
    and a training set)
  • Calculate (relative) likelihood of a given output
    sequence through a given HMM
  • Find the optimal (most likely) path through a
    given HMM for a specific output sequence, and its
    (relative) likelihood
  • Forward/Backward Algorithm
  • Used for determining the parameters of an HMM
    from training set data
  • Calculates probability of going forward to a
    given state (from initial state), and of
    generating final model state (member of training
    set) from that state
  • Iteratively adjusts the model parameters

18
Algorithms for HMM Tasks (2)
  • Baum-Welch (Expectation-maximization,
    EM)Algorithm
  • Often used to determine the HMM parameters
  • Can also determine most likely path for a (set
    of) output sequence(s)
  • Add up probabilities over all possible paths
  • Then re-update parameters and iterate
  • Cannot guarantee global optimum very expensive
  • Forward Algorithm
  • Calculates probability of a particular output
    sequence given the HMM
  • Straightforward summation of product of (partial
    path) probabilities
  • Viterbi Algorithm
  • Classical dynamic programming algorithm
  • Choose best path (at each point), based on
    log-odds scores
  • Save results of subsubproblems and re-use them
    as part of higher-level evaluations
  • More efficient than Baum-Welch

19
HMMs for Protein/Gene Sequence Analysis
  • Using any of various means, identify a set of
    related sequences with conserved regions
  • Make 1st-order Markov assumptions transitions
    independent of sequence history and sequence
    content (other than at the substitution site
    itself)
  • Construct a HMM based on the set of sequences
  • Use this HMM to search for additional members of
    this family, possibly performing alignments
  • Search by comparing fit to HMM against fit to
    some null model
  • For phylogenetic trees, also concerned with the
    length of the paths involved and with shared
    intermediate states (sequences)

20
Very Simple Viterbi
DNA Sequence Alignment with value of 1 for
match, -1 for mismatch, -2 for gap remember
best at each step
min -1099
21
Very Simple Viterbi Traceback
22
Onward to Mathematica!
Write a Comment
User Comments (0)
About PowerShow.com