Generalizations of Markov model to characterize biological sequences - PowerPoint PPT Presentation

About This Presentation
Title:

Generalizations of Markov model to characterize biological sequences

Description:

Training ... Training time for single nucleotide model was 55.8 minutes. Training time reduced to 23.8 minutes for di-nucleotide model ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 23
Provided by: nikhils6
Category:

less

Transcript and Presenter's Notes

Title: Generalizations of Markov model to characterize biological sequences


1
Generalizations of Markov model to characterize
biologicalsequences
Authors Junwen Wang and Sridhar Hannenhalli
  • CISC841 Bioinformatics
  • Presented By Nikhil Shirude
  • November 20, 2007

2
Outline
  • Motivation
  • Model Implementation
  • - Training
  • - Testing
  • Results
  • Challenges
  • Conclusion

3
Motivation
  • Markov Model statistical technique to model
    sequences such that the probability of a sequence
    element is based on a limited context preceding
    the element
  • Current kth order Markov Model Generates a
    single base (model unit size1) according to a
    probability distribution depending on k bases
    immediately preceding the generated base (gap0)
  • Used in DNA sequence recognition problems such as
    promoter and gene prediction

4
Motivation contd
  • Longer range dependencies and joint dependency of
    neighboring bases have been observed in protein
    and DNA sequences
  • CG di-nucleotide characterizes CpG islands
  • So, model with unit size of 2 is appropriate
    to characterize this joint dependency
  • Longer range dependencies (gapgt0) are useful to
    model the periodicity of the helix pattern

5
Model Implementation
  • Generalized Markov Model (GMM) ? Configurable
    tool to allow for generalizations
  • Posterior bases - bases whose probability is to
    be computed
  • Prior bases - bases upon which the above
    probability is calculated
  • 6 parameters to specify the Markov Model
  • Other parameters include type of biological
    sequence,
  • threshold for min. count of prior for k-mer
    elimination, pseudo count for k-mer absent in
    training set

6
Model Implementation contd
Prior
Posterior
L1
L1
L1
G
g1
U1
U2
Uo
X1
X2
XL2
...
...
g2
Parameters L1 ? model unit size in prior O ?
Order or the number of units g1 ? spacing between
units L2 ? model unit size in posterior g2 ?
spacing between bases G ? gap parameter
Prior
Posterior
7
Model Implementation contd
  • Examples

A gap of length 2 within a posterior model in an
amino acid captures the joint dependency for the
first and fourth amino acid residue It is likely
to form a hydrogen bond which is vital for the
protein helix structure
For a model where each tri-nucleotide depends on
the previous 4 bases, configurable parameters
can be set as L14, O1, L23, g1g2G0
To use the 4 bases after ignoring the immediate
preceding 3 bases, set G3
8
Training
  • K-mer ?Refers to specific nucleic or amino acid
    sequences that can be used to identify certain
    regions within bio-molecules like DNA or protein
  • For statistical robustness consider k-mers above
    a certain threshold in positive sequences
  • For the current model, default frequency
    threshold for positive sequences set at 300
  • For nucleosome sequences, the default frequency
    threshold is set at 50 due to the smaller size of
    the data set

9
Training contd
  • Slide window one base at a time along the
    training sequence.
  • Window size is given by user-defined parameters
  • For each window, extract the words corresponding
    to prior and posterior

Window Size L1O g1(O-1) G L2 g2
(L2-1)
User Defined parameters L11, O6, L22, g10,
G1, g21 Window Size 10 Say the sequence to be
? ACTGATGCAG The di-nucleotide CG represents the
posterior
10
Training (contd)
  • Increment the k-mer counts
  • ? ACTGATCG (6th order), CTGATCG (5th
    order),.,
  • and so on till CG (0th order)
  • Thus, 7 sub-models are present, one for each
    order
  • After processing the training sequences,
    calculate the transition probabilities from the
    k-mer counts
  • - for 0th order, probability is composition
    of the L2-mers
  • - for higher order, compute the sum of
    frequencies of all the k-mers of that form
  • (eg, for 4th order TGATCG, compute the sum of
    frequencies of all hexamers of the form TGAT)

11
Training contd
  • - if (sum gt threshold)
  • - calculate prob. by dividing the count of
    that sequence form by the sum
  • - else the program automatically uses the
    (k-1)-mer
  • Finally, convert the probability for each k-mer
    into a log-odds score

12
Testing
  • Program reads the model ? k-mer log-odds score
  • Scoring - proceeds in the sliding window fashion
  • - to score a window consider the
    highest order
  • - if string exists, then use the
    score
  • - else look for string corresponding to a
    lower order
  • Sequence score is obtained by adding all the
    window scores

To score ACTGATGCAG, first look at 6th order
dependence i.e., ACTGATCG in the 8-mer table
Look for 5th order and so on till the 0th order
13
Results
  • Tested on
  • - Human Promoter Sequences
  • - CpG poor promoters
  • - All promoters
  • - Human Exon Dataset
  • - Nucleosome positioning sequences

14
Model Evaluation
  • 10 fold cross-validations to train test the
    models
  • Sequences were partitioned into 10 equal parts
  • Each part was tested after training on the 9
    other parts
  • Once models were trained, a score was calculated
    on the training set using the models
  • A cutoff was obtained based on the
    Specificity-Sensitivity curve
  • Choose a score cutoff that results in the best
    Correlation Coefficient for the training set

15
Model Evaluation contd
  • Score the independent test set apply this
    cutoff to obtain the CC values
  • Calculate the mean and standard deviation over
    the 10 CC values

Sensitivity (Sn) TP / (TP FN) Specificity
(Sp) TP / (TP FP) CC (TPTN
FPFN)/v(TPFP)(TPFN) (TNFP)(TNFN)
16
Model Evaluation contd
  • Total number of prior bases 6 for all 3 models
  • Classification accuracy for the three sequence
    classes was tested using the above 3
    configurations

6th order single nucleotide model L1 L2 1,
O6, g10, G0, g20
3rd order di-nucleotide model L1 L2 2, O3,
g10, G0, g20
2nd order tri-nucleotide model L1 L2 3,
O2, g10, G0, g20
17
Model Evaluation contd
  • Classification of CpG poor promoters

Sample(size) Single nucleotide Di-nucleotide Tri-nucleotide
CpG-poor Promoters(1,466) 0.24 0.05 0.28 0.03 0.34 0.04
18
Model Evaluation contd
  • Classification of all promoters
  • Classification of Exons

Sample(size) Single nucleotide Di-nucleotide Tri-nucleotide
All Promoters (12,333) 0.54 0.02 0.54 0.03 0.56 0.02
Sample(size) Single nucleotide Di-nucleotide Tri-nucleotide
All Exons (219,624) 0.63 0.00 0.64 0.00 0.67 0.00
19
Model Evaluation contd
  • Classification of nucleosome positioning
    sequences (112)
  • Best classification accuracy at G 4, 15 25
  • Worst classification accuracy at G 7 18

20
Model Evaluation contd
  • Compare Run-time for the three models
  • Training time for single nucleotide model was
    55.8 minutes
  • Training time reduced to 23.8 minutes for
    di-nucleotide model
  • Training time reduced to 18.9 minutes for
    tri-nucleotide model
  • Time for testing reduces from 22.9 minutes to
    15.4 and 14.0 minutes for di-nucleotide and
    tri-nucleotide models respectively

21
Conclusion
  • Configurable tool to explore the generalizations
    of Markov models incorporating the joint and long
    range dependencies of sequence elements
  • Evaluation done to 4 classes of sequences
  • Compared two special cases i.e., the
    di-nucleotide model and the tri-nucleotide model
    vs. the traditional single nucleotide model
  • Evaluation shows improved classification accuracy
    for di and tri nucleotide models
  • Improved running time of software for di and tri
    nucleotide models

22
  • Thank You!!!!
Write a Comment
User Comments (0)
About PowerShow.com