Title: Bayesian Segmentation of Protein Sequences
1Bayesian Segmentation of Protein Sequences
Paper A Graphical Model for Protein Secondary
Structure Prediction by Wei Chu, Zoubin
Ghahramani, and David L. Wild ICML(2003)
- Presentation by Jivko Sinapov
2Outline
- Introduction
- Background
- Naïve Bayes, HMM, HSMM
- Baysian Segmentation Model
- Experiments and Results
- Discussion
3Introduction
- Prediction of class labels for residues in a
protein sequence
Secondary structure
Sequence VTSYTLSDVVSLKDVVPEWVRIGFSATTGAEYAAHEVLSW
SFHSELS Class -EEEEEEEE--HHHH---EEEEEEEEE-----
-EEEEEEEEEEEEE-
Protein-RNA interface
Sequence DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMG
RDHTLFALSDGK Class 111111001111111001111100101
1111100000001111101000000
Protein-Protein interface
Sequence AVTTYKLVINGKTLKGETTTKAVDAETAEKAFKQYANDNG
VDGVWTYDDA Class 00000000000111111110100000000
000001011110000000000
4Background
- Naïve Assumption the label of a residue is
independent of the labels of neighboring
residues. - e.g. Naïve Bayes on windows
c1
c2
c3
c4
c5
c6
c7
c8
Class labels
s1
s2
s3
s4
s5
s6
s7
s8
Sequence
5Background
- Hidden Markov Model
- Models dependencies between adjacent class labels
- Observation can be multi-dimensional (PSSM, or
multiple sequence alignment) - Efficient algorithms for learning and prediction
of class labels - Captures some local dependencies that are obvious
from the data
c1
c2
c3
c4
c5
c6
c7
c8
Class labels
s1
s2
s3
s4
s5
s6
s7
s8
Sequence
6Previous methods
- Hidden semi-Markov Model
- Each state is a segment of the sequence
- Each state emits a sequence, rather than a single
amino-acid
coil
sheet
coil
helix
coil
V TSYTLSDV VS LKDV VPE
More formally
7Sequence segmentation
- T secondary structural type of the segment, H,
E, L - S ends of each individual structural segments
- R known amino acid sequence
- T2 E ß-strand S2 9 R2 S1 1 S2
in Bayesian Segmentation of Protein Secondary
Structure SC Schmidler, JS Liu, DL Brutlag -
Journal of Computational Biology, 2000
8Bayesian Segmentation of Protein Secondary
Structure
- Introduces a model that utilizes multiple
sequence alignments or PSSM - Efficient algorithms for learning and inference
- Improves accuracy by 10 against window-based
methods - Models long range interaction by beta-strands
9The Model
where Oi is a 20 x 1 vector containing the
occurrence counts for each amino acid at position
i.
- Segment locations determined by position of last
residue in the segments
10- The variables (m,e,T) describe the segmentation
- Bayesian approach
11where
is specified by a 3x3 transition matrix.
12and
- Likelihood function for each segment
13- Likelihood function for each residue
14 15- Likelihood function for each segment becomes
16Inference (i.e. classification)
- MAP estimate (use Viterbi)
- Marginal posterior mode estimate (use
forward-backward)
17Long Range interactions
18where r is the number of interacting pairs
is a pair of interacting segments with their
alignment information
and
- Incorporate into model
- Prior
- Conditional
- Final segmental likelihood for strand
19Parameter estimation and inference
- Discrete parameters segment lengths, occurrence
counts, state transition probabilities. - Estimated directly from data
- Weight parameters for neighboring and
long-distance dependencies - Estimated using a MAP estimate
- Uses Variational method (!!!)
- Use Markov-Chain Monte Carlo for inference
20Experiments and Results
- Dataset CB513
- 513 non-homologous protein chains with solved
structure - Removed sequences shorter than 30 and longer than
550. - Do 7-fold cross validation
- Three types of features
- Sequence only
- Multiple sequence alignment profile
- PSSM
21Experiments and Results
window-based Naïve Bayes approach 60.5
22Predicting long range interactions
- A subset of 153 proteins with long range contact
maps between beta-strands - Good performance in predicting the contact maps
between a pair of beta-strands AUC 0.89 - However, no clear difference between using MCMC
with long range interaction and exact inference
without such interactions
23Discussion
- The model achieves comparable performance to the
state of the art - Can be extended to other problems
- Can model short-distance and long-distance
interaction between segments - If using sequence only, model achieves 5
improvement over Naïve Bayes