Title: Susie Jo
1Protein Family Classification Using AI Techniques
(Profile-HMMs, SVM)
2003. 12.04 Susie Jo Bio-Information System
Laboratory BioSystem Dept., KAIST
2Table of Contents
Intro
Profile HMM
SVM-Fisher
3Protein homology detection
CS774
- Protein, DNA can be encoded in primary sequence
- (amino acid residue20 types,
nucleotideA/G/C/T) - Functionally Annotated Sequence
- Functionally Unknown Sequence
Introduction
Profile HMM
SVM-Fisher
Similar Sequence
Similar Function
SOM
Sequence Similarity
4Protein Classification SCOP (Structural
Classification of Protein Database)
CS774
- SCOP hierarchy of protein domains
Introduction
Profile HMM
SVM-Fisher
SOM
Primary Level
Varying degrees of similarity
5GPCR
CS774
- The three major subfamilies include the receptors
related to the light receptor rhodopsin and
ß2-adrenergic receptor (family A) - can be subdivided into six major subgroups
- overall homology among all type A receptors is
low - highly conserved a few key residues
- Asp-Arg-Tyr (DRY) motif
- Receptors related to the glucagon receptor
(family B) - The receptors related to the metabotropic
neurotransmitter receptors (family C) - Yeast pheromone receptor (family D, E)
- cAMP recptors (family F)
Introduction
Profile HMM
SVM-Fisher
SOM
6GPCR 3 major subfamilies
CS774
Introduction
Profile HMM
SVM-Fisher
SOM
Low Sequence Homology
7Protein remote homology detection Methods
CS774
- Pair-wise similarities between proteins
- Simple Sequence Similarity
- Using Smith-Waterman dynamic programming
- BLAST, FASTA
- )Simple, Easy
- - )Low accuracy
-
Introduction
Profile HMM
SVM-Fisher
SOM
Urotensin is very similar with 4 somatostatin gt
However, actually they have different ligands
8Protein remote homology detection Methods
CS774
- Profiles and hidden Markov models(HMMs)
- Profile-based methods by iteratively collecting
- homologous sequences from a large database and
- incorporating the resulting statistics into a
single model - PSI-BLAST and SAM-T98
- 3. SVM-Fisher method(Jaakkola et al.,
1999,2000) - couples an iterative HMM training scheme with
the SVM
Introduction
Profile HMM
SVM-Fisher
SOM
9(No Transcript)
10Motivation
CS774
- Objective Given a family of related sequences,
what is an effective way to capture what they
have in common, so that we can recognize other
members of the family. - Some standard methods for characterization
- - Multiple alignments
- - Profile
- - Regular Expressions
- - Consensus Sequences
- - Hidden Markov Models
Introduction
Profile HMM
SVM-Fisher
SOM
11A.Gaulton T.K.Attwood, Bioinformatics approach
for the classification of GPCRCurrent Opinion in
Phamacology 2003, 3114-120
Using Family Profile
CS774
- Use MSA of the family Identify the most highly
conserved regions -
Introduction
Profile HMM
SVM-Fisher
SOM
RWDAGCVN RWDSGCVN RWHHGCVQ RWKGACYN RWLWACEQ
12Method of characterizing family of nucleotide
sequences
CS774
- 1. Regular expression
- AT CG AC ACTG A TG GC
- But, cannot distinguish between
- highly implausible T G C T - - A G G
- and consensus A C A C - - A T C
- 2. Consensus sequence
- A C A C - - A T C
- Unclear what consensus means
- Need some kind of similarity table between
nucleotides to measure the probability of a
sequence
Introduction
Profile HMM
SVM-Fisher
SOM
13Sean R. Eddy, Profile hidden Markov models
Bioinformatics vol.14, no.9 1998, 755-763
HMM
CS774
- A model that generates Sequence
- A symbol seq. (or observations) is generated
moving of states. - The state seq. is hidden.
- - States
- - Symbol emission probabilities
- - State transition probabilities
Introduction
Profile HMM
SVM-Fisher
SOM
Hidden state sequence, S
Observed symbol sequence, X
P( X,S HMM )
143. HMM(Ex. gene sequence)
CS774
Introduction
Insertion State
Profile HMM
SVM-Fisher
SOM
M M M I M M M
15Deriving HMM Scoring HMM
CS774
- Deriving the HMM from a known alignment
- Each column in the alignment generates a state
- Count the occurrence of ATGC in each column to
determine emission probabilities for each state - Transition probabilities to insertion states in a
similar way (need some caution)
Introduction
Profile HMM
SVM-Fisher
SOM
16Probability Log-odds
CS774
- Probability sequence length(L) dependent
- Penalize insertion favor deletion
- Log-odds is computed using null model
- Considers the overall sequence of nucleotides as
random - Better estimate use overall frequency of
nucleotides(or amino acids) in organisms genome
Introduction
Profile HMM
SVM-Fisher
SOM
17Sean R. Eddy, Profile hidden Markov models
Bioinformatics vol.14, no.9 1998, 755-763
Profile HMM
CS774
- HMM architecture for representing profiles of
multiple sequence alignments - Linear left-right model
- Match state
- Insert state
- Delete state
Introduction
Profile HMM
SVM-Fisher
SOM
1 2 3C A FC G WC D YC V F C K Y
18Visual Recognition Tutorial. Thad Starner, Alex
Pentland. Visual Recognition of American sign
Language Using HMMs. In International Workshop on
Automatic Face and Gesture Recognition, pages
189-194, 1995
Elements of Profile HMMs
CS774
Introduction
- N the number of hidden states
- Q set of states Q1,2,,N
- M the number of symbols
- V set of symbols V 1,2,,M
- A the state-transition probability matrix
- B Observation probability distribution
- p - the initial state distribution
- l the entire model
Profile HMM
SVM-Fisher
SOM
19Three Basic Problems
CS774
- EVALUATION given observation O(o1 , o2 ,,oT )
and model , efficiently
compute - Hidden states complicate the evaluation
- Given two models l1 and l2, this can be used to
choose the better one. - DECODING - given observation O(o1 , o2 ,,oT )
and model l find the optimal state sequence q(q1
, q2 ,,qT ) . - Optimality criterion has to be decided (e.g.
maximum likelihood) - Explanation of the data.
- LEARNING given O(o1 , o2 ,,oT ), estimate
model parameters that
maximize
Introduction
Profile HMM
SVM-Fisher
SOM
20Solution to problem 1
CS774
Introduction
- Define forward variable as
- is the probability of observing the
partial sequence - such that the state
qt is i. - Induction
- Initialization
- Induction
- Termination
Profile HMM
SVM-Fisher
SOM
21Solution to problem 1
CS774
2. Backward Algorithm
Introduction
- Define backward variable as
- is the probability of observing the
partial sequence - such that the state
qt is i. - Induction
- 1. Initialization
- 2. Induction
Profile HMM
SVM-Fisher
SOM
22Solution to problem 2
CS774
- Choose the most likely path
- Find the path (q1 , q2 ,,qT ) that maximizes the
likelihood - Solution by Dynamic Programming
- Define
- is the highest prob. Path ending in state
I - By induction we have
Introduction
Profile HMM
SVM-Fisher
SOM
23Solution to problem 2
CS774
Viterbi Algorithm
Introduction
- Initialization
- Recursion
- Termination
- Path (state sequence) backtracking
Profile HMM
SVM-Fisher
SOM
24Solution to problem 3
CS774
Baum-Welch Algorithm
Introduction
Profile HMM
- Estimate to maximize
- No analytic method because of complexity
iterative solution. - Baum-Welch Algorithm (actually EM algorithm)
- Let initial model be l0
- Compute new l based on l0 and observation O.
- If
- Else set l0 l and go to step 2
-
SVM-Fisher
SOM
25Preventing Overfitting
CS774
- Pseudocount (fake count)
- Dangerous to estimate a probability distribution
from just a few examples - pretend you saw an a.a. in a position even
though it wasnt there - Sequence Weighting
- Some sequences are more frequent than others
- Get more Data!
Introduction
Profile HMM
SVM-Fisher
SOM
26Pseudocount
CS774
Introduction
Profile HMM
SVM-Fisher
SOM
27SAM-T98 (software tool)
CS774
Introduction
Profile HMM
SVM-Fisher
SOM
28SVM-Fisher
29 Jakkola et al, A Discriminative Framework for
Detecting Remote Protein Homologies, Journal of
Computational Biology
Discriminative Framework for Detecting Remote
Protein Homologies
CS774
- variant of support vector machines using a new
kernel function - Kernel function
- derived from a generative statistical model for
a protein family, in this case HMM - Use generative statistical models built from
multiple sequences, in this case HMMs, as a way
of extracting features from protein sequences.
This maps all protein sequences to points in a
Euclidean feature space of fixed dimension.
Introduction
Profile HMM
SVM-Fisher
SOM
30Method
CS774
- Xx1,xn protein sequence, xi is an amino
acid residue - H1 estimated HMM for particular protein family
- P(XH1) corresponding probability model
- Likelihood ratio score used in place of a
simple probability P(XH1) -
Introduction
Profile HMM
SVM-Fisher
SOM
31Method
CS774
- 1. Discriminative approaches
- By bayes rule,
-
- P(H1X) posterior probability of the model
- Posterior probability that the sequence X
belongs to the protein family being modeled. - Score Function L(X) log posterior odds score
Introduction
Profile HMM
SVM-Fisher
SOM
32Method
CS774
- 2. Kernel methods
- K(Xi,X) Kernel function
- a measure of pairwise similarity between the
training example Xi and the new example X - Sign of the discriminant function L(X) determines
the predicted class for any sequence X
Introduction
Profile HMM
SVM-Fisher
SOM
33Method
CS774
- 3. The Fisher kernel
- Fisher score
- the gradients w.r.t the parameters of the HMM
- P(XH1,?) corresponding probability model ,
estimated an HMM for a particular
family of proteins - T include the output and the transition
probabilities of an HMM trained to model
Introduction
Profile HMM
SVM-Fisher
SOM
34Method
CS774
- Fisher score
- the gradients w.r.t the parameters of the HMM
- Probability value of HMM for each sequence
-
Introduction
Profile HMM
SVM-Fisher
SOM
35Method
CS774
- Fisher score
- Derivatives of w.r.t emission
probabilities -
-
Introduction
Profile HMM
SVM-Fisher
SOM
36Method
CS774
- Fisher score vector relative to the
emission probabilities - a vector whose components indexed by (x,s) and
the corresponding values given by - Dim. Of fisher score vector 20m (m of
state) -
- - Expected posterior frequency of visiting state
and generating residue -
-
-
Introduction
Profile HMM
SVM-Fisher
SOM
37Method
CS774
- A natural (squared) distance between the gradient
vectors - Quantify the similarity between two fixed length
gradient vectors Ux and Ux corresponding to two
sequences X and X - Gaussian Kernel
Introduction
Profile HMM
SVM-Fisher
SOM
38Method
CS774
- Method Summary (SVM-Fisher Method)
- 1. Begin with an HMM trained from positive
examples to model a given protein family. - 2. Use this HMM to map each new protein sequence
X we want to classify into a fixed length vector,
its Fisher score - 3. Compute the kernel function on the basis of
the Euclidean distance between the score vector
for X and the score vectors for known positive
and negative examples Xi of the protein family. - 4. The resulting discriminant function is
given by
Introduction
Profile HMM
SVM-Fisher
SOM
39Method
CS774
- 4. Combination of scores
- In many cases, we can construct more than one HMM
model for the family or superfamily of interest. - Combine the scores from the multiple models
rather than selecting just one. - Li(X) the score for the query sequence X based
on the ith model
Introduction
Profile HMM
SVM-Fisher
SOM
40Result
CS774
Introduction
Profile HMM
SVM-Fisher
SOM
41Result 2
CS774
- GPCR level 1 subfamily recognition
- GPCR level 2 subfamily recognition
Introduction
Profile HMM
SVM-Fisher
SOM
42(No Transcript)
43Reference
- A.Gaulton T.K.Attwood, Bioinformatics approach
for the classification of GPCRCurrent Opinion in
Phamacology 2003, 3114-120 - Sean R. Eddy, Profile hidden Markov models
Bioinformatics vol.14, no.9 1998, 755-763 - Sean R. Eddy, Profile hidden Markov models
Bioinformatics vol.14, no.9 1998, 755-763 - Jakkola et al, A Discriminative Framework for
Detecting Remote Protein Homologies, Journal of
Computational Biology - Visual Recognition Tutorial. Thad Starner, Alex
Pentland. Visual Recognition of American sign
Language Using HMMs. In International Workshop on
Automatic Face and Gesture Recognition, pages
189-194, 1995
44Thank You !