Title: Pedro Ferreira, Paulo Azevedo
1Protein Sequence Classification Through Relevant
Sequence Mining and Bayes Classifiers
Pedro Ferreira, Paulo Azevedo Dep. Informatics -
University of Minho
12th EPIA 2005 CMB workshop Covilhã, Portugal
5 of December 2005
2Outline
- Motivation
- Types of Patterns
- Method
- Results
- Conclusions
1
3Motivation
Protein Sequence Classification is one of the
most important problem in protein sequence
analysis, having application in many area
domains. Due to the exponential growth of newly
generated sequences it requires automatic and
efficient methods. Sequence Patterns or Motifs
are elements conserved across different
proteins. Since these patterns are tightly
related to function and structure of the
proteins, they can be used as a tool to classify
the function or family of the proteins. Automatic
classification of protein sequence patterns
concentrates large effort from BIO DM
communities!
2
4Some Notations
A linear sequence is a sequence composed by
successive atomic elements, generically called
events (amino-acids). Frequent Sequence Pattern
if it is subsequence of a number of sequences in
the dataset greater or equal to a specified
threshold value, minimum support.
3
5Types of Patterns
Patterns or Motifs are typically classified in
two types Deterministic Patterns consist in
words over a defined syntax. Besides the protein
alphabet (amino-acids) they may contain
wild-cards, fixed or variable length gaps to
enhance the expressive power. Ex PROSITE
database. C - x(2,4) - C - x(3) - LIVMFYWC -
x(8) - H - x(3,5)-H Probabilistic Patterns
describe a model that assigns a probability of
the pattern matching a given sequence. EX PWM
Position Weight Matrix We will only consider
deterministic patterns!!
4
6Types of Patterns
Consider patterns in the form A1 - x(p1 q1) -
A2 - x(p2 q2) -- An Flexible Gap Patterns
contains gaps with a size equal or greater to
zero, pi qi for any i. From biological point of
view FPs allow to find relations in larger sets
of proteins with larger span! Rigid Gap Patterns
gaps contain a fixed size for all the database
occurrences of the sequence pattern, pi qi for
any i. RPs express strongly conserved regions,
tightly related with function or structure of the
proteins! Relevant Pattern Frequent Satisfy a
Minimal Length
5
7Patterns Constraints
Event Constraints define the set of allowed
events. Gap Constraints maxGap and minGap
min and max distance between adjacent
events. Window Constraints define the window
distance of the pattern.
6
8Example
1 2 2 3 4 1 2 3 4 5 1 6 3 7 5 Flexible Pattern
1 - x(1, 2)- 3 4 1 2 2 3 4 1 2 3 4 5 1 6 3 7
5 Rigid Pattern 1 . 3 . 5
7
9Method Goal
- Our goal is to suggest a robust and adaptable
classification method using a straightforward
algorithm. - Our method
- performs multi-class classification
- does not require sequence transformation (direct
sequence classifier) - does not require multiple alignment or
background knowledge
8
10Method formulation
- Given a collection of classified sequences D, a
query sequence Q, a minimum support s and a
minimal length L, determine the similarity of Q
w.r.t to all the classes in D. - We used a query driven sequence algorithm that
for each Q, D, s and L extracts the number of
relevant patterns and the average length of the
patterns.
9
11Method Bayes Classifier
The goal is to assign a probability to Q w.r.t.
all of the classes C1, C2, ..., Cn based on the
vector of observed parameters
This can be achieved through the
conditional probability Using the Bayes
Theorem (Eq 1)
- Apriori probability of the class
- Probability of the parameters (class
independent)
10
12Method Bayes Classifier
We weight the parameters and rewrite the Bayes
formula. We assume that parameters are
statistically independent (not entirely
true). (Eq 2) is a constant for the
respective class Ci
11
13Method Bayes Classifier
We suggest three models based on the model of Eq
2. (A) The apriori probability of the classes
is not taken into account. (B) where
(inverse apriori prob) To avoid bias due to
different family sizes, apriori probability is
normalized by the lenght of the
class. (C) is raised to a power of
three. Parameter number of patterns is given a
greater relative weight.
12
14Method Bayes Classifier
Given a query sequence Q and the respective
parameter vector the classification
is simply given by
13
15Results Setup
- Use Query Driven Miner to extract Rigid Gap
Patterns, maxGap 15 and WindowSize 20. - Three Collections of protein families
- Pfam version 17.0 (26 families mostly taken from
top-20 list April 2005) - Pfam version 1.0 (50 families)
- Prosite Receptors Group (27 families)
- Competitors Probabilistic Suffix Trees (PST) and
Sparse Markov Transducers (SMT). - Evaluation based on leave-one-out methodology
according to the precision rate (PR)
14
16Results
Pfam version 17.0 (26 families)
Average results
15
17Results
Prosite receptors (27 families) Similarity
matrix based on True Positives (main diagonal)
and False Negatives.
15
18Results
Pfam 1.0 (50 families)
- Applied a 2-tailed signed rank test, to test the
null hypothesis that medians of pairs of
classifiers C and PST and C and SMT are equal. - The medians for C and PST are significantly
different. - For C and SMT null hypothesis is accepted, there
is no significant difference. - Previously published results
16
19Conclusions Factors Performance
- Model C has higher Precision Rate
- Average length parameter has bigger impact
- Lower support value result in higher Precision
Rate (allows to find patterns between smaller
subsets of sequences) - Support values used are a trade-off
precision/performance - Patterns reveal local and global similarity
- The method does not discriminate patterns based
on biological or statistical relevance
17
20Conclusions
- We propose a straightforward method to perform
multi-class and multi-domain classification. - Based on Bayesian classifier, three
probabilistic models are suggested. - Shows equivalent performance to state-of-the-art
methods. - Greatest drawback apriori determination of
support values. - In order to improve the precision of the method,
patterns need to be discriminated according to
their biological and statistical relevance.
18
21THANKS QUESTIONS ???