Algorithms for variable length Markov chain modeling - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Algorithms for variable length Markov chain modeling

Description:

An additional improvement is expected if a larger sample set is used to train the PST. Currently the PST is built from the training set alone. ... – PowerPoint PPT presentation

Number of Views:209
Avg rating:3.0/5.0
Slides: 20
Provided by: digita1
Category:

less

Transcript and Presenter's Notes

Title: Algorithms for variable length Markov chain modeling


1
Algorithms for variable length Markov chain
modeling
  • Author Gill Bejerano
  • Presented by Xiangbin Qiu

2
Review of Markov Chain Model
  • Often used in bioinformatics to capture
    relatively simple sequence patterns, such as
    genomic CpG islands.

3
Problem
  • The low order Markov chains are poor classifiers
  • Higher order chains are often impractical to
    implement or train.
  • The memory and training set size requirements of
    an order-k Markov chain grow exponentially with
    k!

4
Variable length Markov Model (VMM)
  • The models are not restricted to a predefined
    uniform depth (e.g. order-k).
  • The model is constructed that fits higher order
    Markov dependencies where such contexts exist,
    while using lower order Markov dependencies
    elsewhere.
  • The order is determined by examining the training
    data.

5
Description of Authors Work
  • Four main modules are implemented
  • Train
  • Predict
  • Emit
  • 2pfa

6
Probabilistic Suffix Tree (PST)
  • A special tree data structure

7
PST-Definitions
  • S the alphabet, string set i 1, 2 ..m
  • Empirical probability
  • Conditional empirical probability

8
Parameters
  • Minimum probability
  • Smoothing factors
  • Memory length L
  • Difference measure parameter r

9
Building the PST
10
Biologically Extended PST- a Variant of PST Model
11
Incremental Model Refinement
  • ?
  • L ?
  • r ? 1

12
Prediction using a PST
13
Results and Discussion
  • When averaged over all 170 families, the PST
    detected 90.7 of the true positives.
  • Much better than a typical BLAST search, and
    comparable to an HMM trained from a multiple
    alignment of the input sequences in a global
    search mode.

14
Results and Discussion (Cont.)
15
Results and Discussion (Cont.)
16
Limitations
17
Why Significant?
  • While performance comparable to HMM models
  • Built in a fully automated manner
  • Without multiple alignment
  • Without scoring matrices
  • Less demanding than HMMs in terms of data
    abundance and quality

18
Future Work
  • An additional improvement is expected if a larger
    sample set is used to train the PST. Currently
    the PST is built from the training set alone.
  • Obviously, training the PST on all strings of a
    family should improve its prediction as well.

19
Confused?
Write a Comment
User Comments (0)
About PowerShow.com