Pedro Ferreira, Paulo Azevedo

About This Presentation

Title:

Pedro Ferreira, Paulo Azevedo

Description:

'Protein Sequence Classification' is one of the most important problem in protein ... composed by successive atomic elements, generically called events (amino-acids) ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 22

Provided by: pedrogabri

Category:

more less

Transcript and Presenter's Notes

Title: Pedro Ferreira, Paulo Azevedo

1
Protein Sequence Classification Through Relevant
Sequence Mining and Bayes Classifiers
Pedro Ferreira, Paulo Azevedo Dep. Informatics -
University of Minho
12th EPIA 2005 CMB workshop Covilhã, Portugal
5 of December 2005
2
Outline

Motivation
Types of Patterns
Method
Results
Conclusions

1
3
Motivation
Protein Sequence Classification is one of the
most important problem in protein sequence
analysis, having application in many area
domains. Due to the exponential growth of newly
generated sequences it requires automatic and
efficient methods. Sequence Patterns or Motifs
are elements conserved across different
proteins. Since these patterns are tightly
related to function and structure of the
proteins, they can be used as a tool to classify
the function or family of the proteins. Automatic
classification of protein sequence patterns
concentrates large effort from BIO DM
communities!
2
4
Some Notations
A linear sequence is a sequence composed by
successive atomic elements, generically called
events (amino-acids). Frequent Sequence Pattern
if it is subsequence of a number of sequences in
the dataset greater or equal to a specified
threshold value, minimum support.
3
5
Types of Patterns
Patterns or Motifs are typically classified in
two types Deterministic Patterns consist in
words over a defined syntax. Besides the protein
alphabet (amino-acids) they may contain
wild-cards, fixed or variable length gaps to
enhance the expressive power. Ex PROSITE
database. C - x(2,4) - C - x(3) - LIVMFYWC -
x(8) - H - x(3,5)-H Probabilistic Patterns
describe a model that assigns a probability of
the pattern matching a given sequence. EX PWM
Position Weight Matrix We will only consider
deterministic patterns!!
4
6
Types of Patterns
Consider patterns in the form A1 - x(p1 q1) -
A2 - x(p2 q2) -- An Flexible Gap Patterns
contains gaps with a size equal or greater to
zero, pi qi for any i. From biological point of
view FPs allow to find relations in larger sets
of proteins with larger span! Rigid Gap Patterns
gaps contain a fixed size for all the database
occurrences of the sequence pattern, pi qi for
any i. RPs express strongly conserved regions,
tightly related with function or structure of the
proteins! Relevant Pattern Frequent Satisfy a
Minimal Length
5
7
Patterns Constraints
Event Constraints define the set of allowed
events. Gap Constraints maxGap and minGap
min and max distance between adjacent
events. Window Constraints define the window
distance of the pattern.
6
8
Example
1 2 2 3 4 1 2 3 4 5 1 6 3 7 5 Flexible Pattern
1 - x(1, 2)- 3 4 1 2 2 3 4 1 2 3 4 5 1 6 3 7
5 Rigid Pattern 1 . 3 . 5
7
9
Method Goal

Our goal is to suggest a robust and adaptable
classification method using a straightforward
algorithm.
Our method
performs multi-class classification
does not require sequence transformation (direct
sequence classifier)
does not require multiple alignment or
background knowledge

8
10
Method formulation

Given a collection of classified sequences D, a
query sequence Q, a minimum support s and a
minimal length L, determine the similarity of Q
w.r.t to all the classes in D.
We used a query driven sequence algorithm that
for each Q, D, s and L extracts the number of
relevant patterns and the average length of the
patterns.

9
11
Method Bayes Classifier
The goal is to assign a probability to Q w.r.t.
all of the classes C1, C2, ..., Cn based on the
vector of observed parameters
This can be achieved through the
conditional probability Using the Bayes
Theorem (Eq 1)
- Apriori probability of the class
- Probability of the parameters (class
independent)
10
12
Method Bayes Classifier
We weight the parameters and rewrite the Bayes
formula. We assume that parameters are
statistically independent (not entirely
true). (Eq 2) is a constant for the
respective class Ci
11
13
Method Bayes Classifier
We suggest three models based on the model of Eq
2. (A) The apriori probability of the classes
is not taken into account. (B) where
(inverse apriori prob) To avoid bias due to
different family sizes, apriori probability is
normalized by the lenght of the
class. (C) is raised to a power of
three. Parameter number of patterns is given a
greater relative weight.
12
14
Method Bayes Classifier
Given a query sequence Q and the respective
parameter vector the classification
is simply given by
13
15
Results Setup

Use Query Driven Miner to extract Rigid Gap
Patterns, maxGap 15 and WindowSize 20.
Three Collections of protein families
Pfam version 17.0 (26 families mostly taken from
top-20 list April 2005)
Pfam version 1.0 (50 families)
Prosite Receptors Group (27 families)
Competitors Probabilistic Suffix Trees (PST) and
Sparse Markov Transducers (SMT).
Evaluation based on leave-one-out methodology
according to the precision rate (PR)

14
16
Results
Pfam version 17.0 (26 families)
Average results
15
17
Results
Prosite receptors (27 families) Similarity
matrix based on True Positives (main diagonal)
and False Negatives.
15
18
Results
Pfam 1.0 (50 families)

Applied a 2-tailed signed rank test, to test the
null hypothesis that medians of pairs of
classifiers C and PST and C and SMT are equal.
The medians for C and PST are significantly
different.
For C and SMT null hypothesis is accepted, there
is no significant difference.
Previously published results

16
19
Conclusions Factors Performance

Model C has higher Precision Rate
Average length parameter has bigger impact
Lower support value result in higher Precision
Rate (allows to find patterns between smaller
subsets of sequences)
Support values used are a trade-off
precision/performance
Patterns reveal local and global similarity
The method does not discriminate patterns based
on biological or statistical relevance

17
20
Conclusions

We propose a straightforward method to perform
multi-class and multi-domain classification.
Based on Bayesian classifier, three
probabilistic models are suggested.
Shows equivalent performance to state-of-the-art
methods.
Greatest drawback apriori determination of
support values.
In order to improve the precision of the method,
patterns need to be discriminated according to
their biological and statistical relevance.

18
21
THANKS QUESTIONS ???

Write a Comment

User Comments (0)