Bayesian Segmentation of Protein Sequences - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Bayesian Segmentation of Protein Sequences

Description:

where Oi is a 20 x 1 vector containing the occurrence counts for each amino acid ... The model achieves comparable performance to the state of the art ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 24

Provided by: jsin

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian Segmentation of Protein Sequences

1
Bayesian Segmentation of Protein Sequences
Paper A Graphical Model for Protein Secondary
Structure Prediction by Wei Chu, Zoubin
Ghahramani, and David L. Wild ICML(2003)

Presentation by Jivko Sinapov

2
Outline

Introduction
Background
Naïve Bayes, HMM, HSMM
Baysian Segmentation Model
Experiments and Results
Discussion

3
Introduction

Prediction of class labels for residues in a
protein sequence

Secondary structure
Sequence VTSYTLSDVVSLKDVVPEWVRIGFSATTGAEYAAHEVLSW
SFHSELS Class -EEEEEEEE--HHHH---EEEEEEEEE-----
-EEEEEEEEEEEEE-
Protein-RNA interface
Sequence DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMG
RDHTLFALSDGK Class 111111001111111001111100101
1111100000001111101000000
Protein-Protein interface
Sequence AVTTYKLVINGKTLKGETTTKAVDAETAEKAFKQYANDNG
VDGVWTYDDA Class 00000000000111111110100000000
000001011110000000000
4
Background

Naïve Assumption the label of a residue is
independent of the labels of neighboring
residues.
e.g. Naïve Bayes on windows

c1
c2
c3
c4
c5
c6
c7
c8
Class labels
s1
s2
s3
s4
s5
s6
s7
s8
Sequence
5
Background

Hidden Markov Model
Models dependencies between adjacent class labels
Observation can be multi-dimensional (PSSM, or
multiple sequence alignment)
Efficient algorithms for learning and prediction
of class labels
Captures some local dependencies that are obvious
from the data

c1
c2
c3
c4
c5
c6
c7
c8
Class labels
s1
s2
s3
s4
s5
s6
s7
s8
Sequence
6
Previous methods

Hidden semi-Markov Model
Each state is a segment of the sequence
Each state emits a sequence, rather than a single
amino-acid

coil
sheet
coil
helix
coil
V TSYTLSDV VS LKDV VPE
More formally
7
Sequence segmentation

T secondary structural type of the segment, H,
E, L
S ends of each individual structural segments
R known amino acid sequence

T2 E ß-strand S2 9 R2 S1 1 S2

in Bayesian Segmentation of Protein Secondary
Structure SC Schmidler, JS Liu, DL Brutlag -
Journal of Computational Biology, 2000
8
Bayesian Segmentation of Protein Secondary
Structure

Introduces a model that utilizes multiple
sequence alignments or PSSM
Efficient algorithms for learning and inference
Improves accuracy by 10 against window-based
methods
Models long range interaction by beta-strands

9
The Model

Observation

where Oi is a 20 x 1 vector containing the
occurrence counts for each amino acid at position
i.

Set of structure types

Segment sequence

Segment locations determined by position of last
residue in the segments

The variables (m,e,T) describe the segmentation
Bayesian approach

Prior

where
is specified by a 3x3 transition matrix.
12

Likelihood

and

Likelihood function for each segment

Likelihood function for each residue

Dirichlet prior on

Likelihood function for each segment becomes

16
Inference (i.e. classification)

Bayes rule

MAP estimate (use Viterbi)

Marginal posterior mode estimate (use
forward-backward)

17
Long Range interactions
18

Long range interactions

where r is the number of interacting pairs
is a pair of interacting segments with their
alignment information
and

Incorporate into model
Prior
Conditional
Final segmental likelihood for strand

19
Parameter estimation and inference

Discrete parameters segment lengths, occurrence
counts, state transition probabilities.
Estimated directly from data

Weight parameters for neighboring and
long-distance dependencies
Estimated using a MAP estimate
Uses Variational method (!!!)

Use Markov-Chain Monte Carlo for inference

20
Experiments and Results

Dataset CB513
513 non-homologous protein chains with solved
structure
Removed sequences shorter than 30 and longer than
550.
Do 7-fold cross validation

Three types of features
Sequence only
Multiple sequence alignment profile
PSSM

21
Experiments and Results
window-based Naïve Bayes approach 60.5
22
Predicting long range interactions

A subset of 153 proteins with long range contact
maps between beta-strands
Good performance in predicting the contact maps
between a pair of beta-strands AUC 0.89
However, no clear difference between using MCMC
with long range interaction and exact inference
without such interactions

23
Discussion