Pedro Ferreira, Paulo Azevedo

About This Presentation

Title:

Pedro Ferreira, Paulo Azevedo

Description:

SPADE needs two frequent k-patterns p1 and p2 having the same (k1)- pattern as ... Flexible Patterns: SPAM, has shown to outperform SPADE and. PREFIXSPAN. ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 24

Provided by: pedrogabri

Category:

more less

Transcript and Presenter's Notes

Title: Pedro Ferreira, Paulo Azevedo

1
Protein Sequence Pattern Mining with Constraints
Pedro Ferreira, Paulo Azevedo Dep. Informatics -
University of Minho
9th PKDD 2005 Porto, Portugal 5 of October 2005
2
Outline

Motivation
Types of Patterns
Method
Results
Conclusions

1
3
Motivation
Sequence Pattern Discovery is one of the most
important problem in protein sequence analysis,
having application in many area domains. Sequence
Patterns or Motifs are elements conserved across
different proteins. Since these patterns are
tightly related to function and structure of the
proteins, they can be used as a tool to predict
the function or family of the proteins. Automatic
extraction of sequence patterns concentrates
large effort from BIO DM communities!
2
4
Some Notations
A linear sequence is a sequence composed by
successive atomic elements, generically called
events (amino-acids). Frequent Sequence Pattern
if it is subsequence of a number of sequences in
the dataset greater or equal to a specified
threshold value, minimum support. The cover
represents the list of sequence identifiers
where the pattern occurs. Offset Lists contain
the positions of the events in the database
sequences.
3
5
Types of Patterns
Patterns or Motifs are typically classified in
two types Deterministic Patterns consist in
words over a defined syntax. Besides the protein
alphabet (amino-acids) they may contain
wild-cards, fixed or variable length gaps to
enhance the expressive power. Ex PROSITE
database. C - x(2,4) - C - x(3) - LIVMFYWC -
x(8) - H - x(3,5)-H Probabilistic Patterns
describe a model that assigns a probability of
the pattern matching a given sequence. EX PWM
Position Weight Matrix We will only consider
deterministic patterns!!
4
6
Types of Patterns
Consider patterns in the form A1 - x(p1 q1) -
A2 - x(p2 q2) -- An Concrete Patterns only
contain contiguous events. Ambiguous Patterns
only contain contiguous events and each position
may be occupied by two or more events. Arbitrary
Gap Patterns contains gaps with a size equal or
greater to zero, pi qi for any i. Rigid gap
pattern gaps contain a fixed size for all the
database occurrences of the sequence pattern, pi
qi for any i.
5
7
Example
1 2 2 3 4 1 2 3 4 5 1 6 3 7 5 Flexible Pattern
1 - x(1, 2)- 3 4 1 2 2 3 4 1 2 3 4 5 1 6 3 7
5 Rigid Pattern 1 . 3 . 5
6
8
Method Problem Issues (I)
Sequence Pattern Mining algorithms two
communities. Data Mining methods best suited
for data with many (from hundred of thousands to
millions) sequences with a relative small length
(from 10 to 20), and an alphabet of thousands of
events. Ex GSP, SPAN, SPAM. Bioinformatics
need to be very efficient when mining a small
number of sequences (in the order of hundreds)
with large length (few hundreds). The alphabet
size is very small (ex 4 for DNA and 20 for
protein sequences). Ex Teiresias.
7
9
Method Problem Issues (II)
Major Problem These methods usually generate
too many patterns!!! When mining biological
data (proteins or DNA), too many patterns makes
difficult the user interpretation, since
interesting patterns are blurred into spurious
ones. Solution Introduce new interesting
measures or impose user restrictions
(Constraints) into the mining process. Interest
measures can be considered independent of the
mining process. Constraints enhance user queries
and help to confine the search.
8
10
Method Constraints Substitution Sets
Event Constraints define the set of allowed
events. Gap Constraints maxGap and minGap min
and max distance between adjacent events. Window
Constraints define the window distance of the
pattern. Substitution Sets allow events to be
substituted by other related events without lost
of meaning. Especially useful in biology!!!!
9
11
Method Properties
Property 1 (Anti-Monotonic) All supersequences
of an infrequent sequence are infrequent. Propert
y 2 (Sequence Transitive Extension) Let S s1
s2 sn, where Cs is the cover list and Os is the
offset list of S in the database sequences. Let
P sj ? sm, where Cp and Op are the cover list
and the offset list of P. If sn sj then E
s1 s2 sn sm, where CE X in
intersect(CS,CP), Op(X) gt Os(X)
10
12
Method Basic Ideia
Basic Idea successively extend a frequent
sequence with a frequent pair, as long as the
predecessor of the second is equal to the
successor of the first. Ex ABC comes from A?B
and B?C. Note Different from APRIORI based
algorithms where the ABC pattern results from the
joining of AB with AC. SPADE needs two frequent
k-patterns p1 and p2 having the same (k-1)-
pattern as prefix to generate a (k 1)-pattern
p. This requires maintaining in memory all
k-patterns We propose a new algorithm called gIL!
11
13
Method Algorithm Phase One
Scanning Phase Only performed once! BM Bitmap
Matrix (each cell contain the cover list and the
support count) OM Offset Matrix (vertical
representation of DB) Example (Sid 5) 1 ? 2, 1
? 3, 1 ? 4, 2 ? 2, 2 ? 3, 2 ? 4 and 3 ? 4
12
14
Method Algorithm Phase Two

Sequence Extension Phase
Guided by the Bitmap Matrix the search space can
be traversed using two possible approaches bfs
or dfs.
The Offset Matrix is used to ensure the order of
the events.
The set of the frequent sequences starts as the
set of frequent pairs.
DFS start with a sequence of size 2 and
successively expand it until it can not be
further extended. backtrack and start again.

13
15
Method Algorithm Phase Two
Sequence Extension Phase If a new sequence
respects the properties 1 Support Test
has minimum support 2 Order Test respects
the order (property 2) by property 3, it can be
considered frequent. Property 3 Given a minimum
support, a frequent sequence S s1 s2 sn, a
pair P sk? sm. If sn sk then E s1 s2 sn
x(p, q)- sm if in flexible gap mode or E s1
s2 sn r(p) sm if in rigid gap mode is
frequent if it satisfies prop 1 and 2.
14
16
Method Implementing Constraints (I)
Events Exclusion set the support count at zero
of the rows and columns of the events to be
excluded in the BM. Start Events in the initial
set of pairs only consider pairs that start with
the defined events. Substitution Sets one or
more sets of equivalent events are available. For
each set of equivalent events do In BM union
of the rows (horizontal union) and columns
(vertical union) in the Bitmap Matrix, where
those events occur. In OM pairwisely intersect
the sequences where they occur and then perform
the union of the offsetLists. The new offsetLists
are assigned to the events.
15
17
Method Implementing Constraints (II)
Max, Min, Window Constraints In the order
test (supported by property 2) verify these three
conditions. Note The maxGap constraints is not
on the pattern itself but in its occurrences.
Boulicaut Jeudy 2005 consider it as neither
monotonic or anti-monotonic. This type of
constraint needs an examination of the database
sequences. Ex A? C is infrequent w.r.t maxGap, B
? C is frequent and A ? B ? C is frequent. By
means of the Offset Matrix this check is
straightforward in gIL!!!
16
18
Results Competitors
Rigid Patterns Teiresias is a reference in the
bioinformatics community. This algorithm accepts
the parameters minimum support, L (number of
non-wild cards events), W (max span between
events). We set L 2 and W to the maxGap
value. Flexible Patterns SPAM, has shown to
outperform SPADE and PREFIXSPAN. The datasets
suffer a conversion into the transactional
dataset format.
17
19
Results Flexible Gap Patterns (I)
Evaluated with real and synthetic datasets,
always having in mind the characteristics of the
proteins datasets!
18
20
Results Flexible Gap Patterns (II)
Mining for Flexible gap patterns expands the
search space and considerable increases the
number of candidates. From biological point of
view FPs allow to find relations in larger sets
of proteins with larger span!
19
21
Results Rigid Gap Patterns
RPs express strongly conserved regions, tightly
related with function or structure of the
proteins!
20
22
Conclusions

The algorithm has a high adaptability
mines two types of patterns.
Data organization allows a straightforward
implementation of constraints and substitution
sets.
Algorithm specially suitable to be applied in
proteins datasets but can be used in other
domains.
gIL combines some of the most efficient
techniques from itemset and sequence mining!
Extensions event at a time can efficiently
handle explosive nature of pattern search in
protein sequence datasets!!!