Title: Generalizations of Markov model to characterize biological sequences
1Generalizations of Markov model to characterize
biologicalsequences
Authors Junwen Wang and Sridhar Hannenhalli
- CISC841 Bioinformatics
- Presented By Nikhil Shirude
- November 20, 2007
2 Outline
- Motivation
- Model Implementation
- - Training
- - Testing
- Results
- Challenges
- Conclusion
3Motivation
- Markov Model statistical technique to model
sequences such that the probability of a sequence
element is based on a limited context preceding
the element - Current kth order Markov Model Generates a
single base (model unit size1) according to a
probability distribution depending on k bases
immediately preceding the generated base (gap0) - Used in DNA sequence recognition problems such as
promoter and gene prediction
4Motivation contd
- Longer range dependencies and joint dependency of
neighboring bases have been observed in protein
and DNA sequences - CG di-nucleotide characterizes CpG islands
- So, model with unit size of 2 is appropriate
to characterize this joint dependency - Longer range dependencies (gapgt0) are useful to
model the periodicity of the helix pattern
5Model Implementation
- Generalized Markov Model (GMM) ? Configurable
tool to allow for generalizations - Posterior bases - bases whose probability is to
be computed - Prior bases - bases upon which the above
probability is calculated - 6 parameters to specify the Markov Model
- Other parameters include type of biological
sequence, - threshold for min. count of prior for k-mer
elimination, pseudo count for k-mer absent in
training set
6Model Implementation contd
Prior
Posterior
L1
L1
L1
G
g1
U1
U2
Uo
X1
X2
XL2
...
...
g2
Parameters L1 ? model unit size in prior O ?
Order or the number of units g1 ? spacing between
units L2 ? model unit size in posterior g2 ?
spacing between bases G ? gap parameter
Prior
Posterior
7Model Implementation contd
A gap of length 2 within a posterior model in an
amino acid captures the joint dependency for the
first and fourth amino acid residue It is likely
to form a hydrogen bond which is vital for the
protein helix structure
For a model where each tri-nucleotide depends on
the previous 4 bases, configurable parameters
can be set as L14, O1, L23, g1g2G0
To use the 4 bases after ignoring the immediate
preceding 3 bases, set G3
8Training
- K-mer ?Refers to specific nucleic or amino acid
sequences that can be used to identify certain
regions within bio-molecules like DNA or protein - For statistical robustness consider k-mers above
a certain threshold in positive sequences - For the current model, default frequency
threshold for positive sequences set at 300 - For nucleosome sequences, the default frequency
threshold is set at 50 due to the smaller size of
the data set
9Training contd
- Slide window one base at a time along the
training sequence. - Window size is given by user-defined parameters
- For each window, extract the words corresponding
to prior and posterior -
Window Size L1O g1(O-1) G L2 g2
(L2-1)
User Defined parameters L11, O6, L22, g10,
G1, g21 Window Size 10 Say the sequence to be
? ACTGATGCAG The di-nucleotide CG represents the
posterior
10Training (contd)
- Increment the k-mer counts
- ? ACTGATCG (6th order), CTGATCG (5th
order),., - and so on till CG (0th order)
- Thus, 7 sub-models are present, one for each
order - After processing the training sequences,
calculate the transition probabilities from the
k-mer counts - - for 0th order, probability is composition
of the L2-mers - - for higher order, compute the sum of
frequencies of all the k-mers of that form - (eg, for 4th order TGATCG, compute the sum of
frequencies of all hexamers of the form TGAT)
11Training contd
- - if (sum gt threshold)
- - calculate prob. by dividing the count of
that sequence form by the sum - - else the program automatically uses the
(k-1)-mer - Finally, convert the probability for each k-mer
into a log-odds score
12Testing
- Program reads the model ? k-mer log-odds score
- Scoring - proceeds in the sliding window fashion
- - to score a window consider the
highest order -
- - if string exists, then use the
score - - else look for string corresponding to a
lower order - Sequence score is obtained by adding all the
window scores
To score ACTGATGCAG, first look at 6th order
dependence i.e., ACTGATCG in the 8-mer table
Look for 5th order and so on till the 0th order
13Results
- Tested on
- - Human Promoter Sequences
- - CpG poor promoters
- - All promoters
- - Human Exon Dataset
- - Nucleosome positioning sequences
14Model Evaluation
- 10 fold cross-validations to train test the
models - Sequences were partitioned into 10 equal parts
- Each part was tested after training on the 9
other parts - Once models were trained, a score was calculated
on the training set using the models - A cutoff was obtained based on the
Specificity-Sensitivity curve - Choose a score cutoff that results in the best
Correlation Coefficient for the training set
15Model Evaluation contd
- Score the independent test set apply this
cutoff to obtain the CC values - Calculate the mean and standard deviation over
the 10 CC values
Sensitivity (Sn) TP / (TP FN) Specificity
(Sp) TP / (TP FP) CC (TPTN
FPFN)/v(TPFP)(TPFN) (TNFP)(TNFN)
16Model Evaluation contd
- Total number of prior bases 6 for all 3 models
- Classification accuracy for the three sequence
classes was tested using the above 3
configurations
6th order single nucleotide model L1 L2 1,
O6, g10, G0, g20
3rd order di-nucleotide model L1 L2 2, O3,
g10, G0, g20
2nd order tri-nucleotide model L1 L2 3,
O2, g10, G0, g20
17Model Evaluation contd
- Classification of CpG poor promoters
Sample(size) Single nucleotide Di-nucleotide Tri-nucleotide
CpG-poor Promoters(1,466) 0.24 0.05 0.28 0.03 0.34 0.04
18Model Evaluation contd
- Classification of all promoters
- Classification of Exons
Sample(size) Single nucleotide Di-nucleotide Tri-nucleotide
All Promoters (12,333) 0.54 0.02 0.54 0.03 0.56 0.02
Sample(size) Single nucleotide Di-nucleotide Tri-nucleotide
All Exons (219,624) 0.63 0.00 0.64 0.00 0.67 0.00
19Model Evaluation contd
- Classification of nucleosome positioning
sequences (112) - Best classification accuracy at G 4, 15 25
- Worst classification accuracy at G 7 18
20Model Evaluation contd
- Compare Run-time for the three models
- Training time for single nucleotide model was
55.8 minutes - Training time reduced to 23.8 minutes for
di-nucleotide model - Training time reduced to 18.9 minutes for
tri-nucleotide model - Time for testing reduces from 22.9 minutes to
15.4 and 14.0 minutes for di-nucleotide and
tri-nucleotide models respectively
21Conclusion
- Configurable tool to explore the generalizations
of Markov models incorporating the joint and long
range dependencies of sequence elements - Evaluation done to 4 classes of sequences
- Compared two special cases i.e., the
di-nucleotide model and the tri-nucleotide model
vs. the traditional single nucleotide model - Evaluation shows improved classification accuracy
for di and tri nucleotide models - Improved running time of software for di and tri
nucleotide models
22