Title: Applying Pruning Techniques to SingleClass Emerging Substring Mining
1Applying Pruning Techniques to Single-Class
Emerging Substring Mining
- Speaker Sarah Chan
- Supervisor Dr. B. C. M. Kao
- M.Phil. Probation Talk
- CSIS DB Seminar
- Aug 30, 2002
2Presentation Outline
- Introduction
- The single-class ES mining problem
- Data structure merged suffix tree
- Algorithms baseline, s-pruning, g-pruning,
l-pruning - Performance evaluation
- Conclusions
3Introduction
- Emerging Substrings (ESs)
- A new type of KDD patterns
- Substrings whose supports (or frequencies)
increase significantly from one class to another
(measured by a growth rate) - Motivation Emerging Patterns (EPs) by Dong and
Li - Jumping Emerging Substrings (JESs) as a
specialization of ESs - Substrings which can only be found in one class
but not others
4Introduction
- Emerging Substrings (ESs)
- Usefulness
- Capture sharp contrasts between datasets, or
trends over time - Provide knowledge for building sequence
classifiers - Applications (virtually endless)
- Language identification, purchase behavior
analysis, financial data analysis,
bioinformatics, melody track selection,web-log
mining, content-based e-mail processing systems,
5Introduction
- Mining ESs
- Brute-force approach
- To enumerate all possible substrings in the
database, find their support counts in each
class, and check growth rate - But a huge sequence database contains millions of
sequences (GenBank has 15 million sequences in
2001), and - No. of substrings in a sequence increases
exponentially with sequence length (A typical
human genome has 3 billion characters) - ? Too many candidates
- ? Expensive in terms of time ( O(D2 n3) )
and memory - Other shortcomings repeated substrings, common
substrings, (Please refer to seminar020201)
6Introduction
- Mining ESs
- An Apriori-like approach
- E.g. if both abcd bcde are frequent in D,
generate candidate abcde - Find frequent substrings and check growth rate
- Still requires many database scans
- A candidate may not be contained in any sequence
in D - Apriori property does not hold for ESs abcde can
be an ES even if both abcd bcde are not
- We need algorithms which are more efficient
and - which allow us to filter out ES candidates
7Introduction
- Mining ESs
- Our approach A suffix tree-based framework
- A compact way of storing all substrings, with
support counters maintained - Deal with suffixes (not substrings) of sequences
- Do not consider substrings not existing in the
database - Time complexity O( lg(?) D n2 )
- Techniques for pruning of ES candidates can be
easily applied
8Basic Definitions
- Sequence
- An ordered set of symbols over an alphabet ?
- Class
- In a sequence database, each sequence ?i has a
class label Ci ? C the set of all class labels - ? does not belong to Ck ? ? belongs to Ck
- Dataset
- If database D is associated with m class labels,
we can partition D into m datasets, such that all
sequences in dataset Di have class label Ci - ? ? Dk ? ? ? Dk
9Basic Definitions
- Count and support of string s in dataset D
- countD(s) no. of sequences in D that contain s
- suppD(s) countD(s) / D
- Growth rate of string s from D1 to D2
- growthRateD1?D2(s) suppD2(s) / suppD1(s)
- growth rate 0 if suppD1(s) suppD1(s) 0
- growth rate 8 if suppD1(s) 0 and suppD2(s) gt 0
10ES and JES
- Emerging Substring (ES)
- Given ?s and ?g, a string s is an ES from Dk to
Dk (or s is an ES of Ck) if these hold - support condition suppDk(s) ?s
- growth rate condition growthRateDk?Dk(s) ?g
- Jumping Emerging Substring (JES)
- It is an ES with 8 growth rate
- JES of Ck suppDk(s) 0 and suppDk(s) gt 0
11ES and JES
With ?g 1.5 ESs from D2 to D1 a, abc, bcd,
abcd ESs from D1 to D2 b, abd
12ES and JES
With ?g 1.5 ESs from D2 to D1 a, abc, bcd,
abcd ESs from D1 to D2 b, abd growthRateD1?D2(b
) (3/4) / (2/4) 1.5
13ES and JES
With ?g 1.5 ESs from D2 to D1 a, abc, bcd,
abcd ESs from D1 to D2 b, abd JESs are
underlined
14The ES Mining Problem
- The ES mining problem
- Given a database D, the set C of all class
labels, a support threshold ?s and a growth rate
threshold ?g, to discover the set of all ESs for
each class Cj ? C - The single-class ES mining problem
- A target class Ck is specified and our goal is to
discover the set of all ESs of Ck - Ck opponent class
15Merged Suffix Tree
- Suffix tree
- Represent all the substrings of a length-n
sequence in O(n) space - Merged suffix tree
- Represent all the substrings of all sequences in
a dataset Dk in O(Dk n) space - Each node has a support counter for each dataset
- Each node is associated with a substring and
related to one or more substrings - Each edge is denoted as an index range istart,
iend) - E.g. if ? abcd, then ?1, 3) ab
16Merged Suffix Tree
- Example
- (c1, c2) (count in Ck, count in Ck)
A
17Merged Suffix Tree
- Example
- (c1, c2) (count in Ck, count in Ck)
countDk(a) 2, countDk(a) 1
A
18Merged Suffix Tree
- Example
- Node Y is associated with abcd (concatenation)
- and related to abc abcd (all share Ys
counters) - An implicit node Z is associated with abc
Z
Y
19Algorithms
- The baseline algorithm
- Consists of 3 phases
- Three pruning techniques
- Support threshold pruning (s-pruning algorithm)
- Growth rate threshold pruning (g-pruning
algorithm) - Length threshold pruning (l-pruning algorithm)
20Baseline Algorithm
- 1. Construction Phase (C-Phase)
- A merged tree MT is built from all the sequences
of the target class Ck each suffix sj of each
sequence is matched against substrings in the
tree - Update c1 counter for substrings contained in sj
(but a sequence should not contribute twice to
the same counter) - Explicitize implicit nodes when necessary
- When a mismatch occurs, add a new edge and a new
leaf to represent the unmatched part of sj
21Baseline Algorithm
- 1. Construction Phase (C-Phase)
- Example
ab
3
Update of c1 counter
(2, 0)
abc
c
Explicitization of implicit node Update of edges
cd
(2, 0)
d
(1, 0)
22Baseline Algorithm
- 1. Construction Phase (C-Phase)
- Example
ab
4
(3, 0)
e
Addition of new edge and leaf node
c
(1, 0)
(2, 0)
abe
d
(1, 0)
23Baseline Algorithm
- 2. Update Phase (U-Phase)
- MT is updated with all the sequences of the
opponent class Ck - Only update c2 counter for substrings that are
already present in the tree, but not introduce
any substring that is only present in Dk - Only internal nodes will be added (no new leaf
nodes) - Resultant tree MT
24Baseline Algorithm
- 3. eXtraction Phase (X-Phase)
- All ESs of Ck are extracted by a pre-order tree
traversal on MT - At each node X, we check the values of its
counters, ?s and ?g, to determine whether its
related substrings can satisfy both the support
and growth rate conditions - If the related substrings of a node X cannot
fulfill the support condition, we can ignore the
subtree rooted at X - Baseline algorithm C-U-X phases
25s -Pruning Algorithm
- Observations
- The c2 counter of each substring ? in MT would be
updated in the U-Phase if it is contained in some
sequence in Dk - If ? is infrequent with respect to Dk, it is not
qualified to be an ES of Ck and all its
descendent nodes will not even be visited in the
X-Phase - Pruning idea
- To prune infrequent substrings in MT after the
C-Phase
26s -Pruning Algorithm
- ?s-Pruning Phase (Ps-Phase)
- With the use of ?s, all substrings being
infrequent in Dk are pruned by a pre-order
traversal on MT - Resultant tree MTs (input to the U-Phase)
- s-pruning algorithm C-Ps-U-X phases
27g -Pruning Algorithm
- Observations
- As sequences in Dk are being added to MT, value
of the c2 counter of some nodes would become
larger - ? Support of these nodes' related substrings in
Dk is monotonically increasing - ? Ratio of the support of these substrings in Dk
to that in Dk is monotonically decreasing - At some point, this ratio may become less than
?g. When this happens, these substrings have
actually lost their candidature for being ESs of
Ck
28g -Pruning Algorithm
- Pruning idea
- To prune substrings in MT as soon as they are
found to be failing the growth rate requirement - ?g-Update Phase (Ug-Phase)
- When the support count of a substring in Dk
increases, check if it still satisfies the growth
rate condition. If not, prune substring by path
compression or node deletion - Supported by istart, iq, iend) representation of
edges - g-pruning algorithm C-Ug-X phases
29l-Pruning Algorithm
- Observations
- Longer substrings often have lower support than
shorter ones ? less likely to fulfill the support
condition for ESs - It is not desirable to append these longer
substrings to the tree in the C-Phase and
subsequently prune them in the Ps-Phase (for the
s-pruning algorithm) - Pruning idea
- To limit the length of substrings to be added to
MT in the tree construction phase
30l-Pruning Algorithm
- ?l-Construction Phase (Cl-Phase)
- Only match (min(sj, ?l) symbols of each suffix
against the tree (ignore the remainder) ? a
smaller MT is built - Unlike the previous two pruning approaches, it
may result in ES loss - l-pruning algorithm Cl-U-X phases
31Summary of Phases
- Baseline C-U-X
- s-pruning C-Ps-U-X (earlier use of ?s)
- g-pruning C-Ug-X (earlier use of ?g)
- l-pruning Cl-U-X (addition of ?l)
- Combination of the use of pruning techniques
- lts , ggt, ltl , sgt, ltl , ggt, ltl , s , ggt
32Performance Evaluation
- Dataset CI3 (music feature in midi tracks)
- Goal to extract ESs from target class melody
(opponent class non-melody) - Assumptions all sequences are pre-stored in
memory (appended in a vector, starting ending
positions of each sequence recorded)
33Number of ESs Mined
34Take a look at the tree size
35Baseline Algorithm C-U-X
- Performance same for all ?s and ?g
- Time about 35s
36s -Pruning Algorithm C-Ps-U-X
- Faster than baseline alg.by 25-45
- But reduction in time lt reduction in tree size
- Performance improve with ? in ?s, same for all ?g
37g -Pruning Algorithm C-Ug-X
- When ?g ?, faster than baseline alg. by 2-5
- When ?g 2 or 5, slower than baseline alg. by
1-4 - Performance improve with ? in ?g, same for all ?s
38sg -Pruning Algorithm C-Ps-Ug-X
- Faster than baseline, s-pruning, g-pruning
alg.(all cases) - Faster than baseline alg. for 31-54(2 or 5),
47-81(?) - Performance improve with ? in ?s and ?g
39Target Class Melody
(?g 2)
- Performance of algorithms
- (fastest) sg-pruning gt s-pruning gt baseline gt
g-pruning
40What If Target Class Non-Melody?
(?g 2)
- Performance of algorithms
- (fastest) s-pruning gt sg-pruning gt baseline gt
g-pruning
41What If Target Class Non-Melody?
- sg-pruning performs worse than s-pruning
- Due to overhead in node creation (g-pruning
requires one more index for each edge) - Not much performance gain with s-pruning (just
3-5) or sg-pruning (1-3) - Bottleneck formation of MT (over 93 time is
spent in the C-Phase) - In fact, these pruning techniques are very
effective since much time is saved in the U-Phase - 42-80 (for s-pruning) and 54-85 (for sg-pruning)
42l-Pruning Algorithm Loss of ESs
?l
?s, ?g
avg. seq. length 331 max. seq. length 1085
?l
- Except when ?s 0.25, there is loss of
non-jumping ESs only when ?l lt 20 (15 for the
case of JESs)
43l-Pruning Algorithm Time Saved
?l
?s, ?g
avg. seq. length 331 max. seq. length 1085
?l
- Time saved becomes obvious when ?l lt 100
- For ?s ? 0.50, can save over 30 time without ES
loss
44To be Explored . . .
- ls-pruning
- lg-pruning
- lsg-pruning
45Conclusions
- ESs of a class are substrings which occur more
frequently in that class rather than other
classes. - ESs are useful features as they capture
distinguishing characteristics of data classes. - We have proposed a suffix tree-based framework
for mining ESs.
46Conclusions
- Three basic techniques for pruning ES candidates
have been described, and most of them have been
proven effective - Future work to study whether pruning techniques
can be efficiently applied to suffix tree merging
algorithms or other ES mining models.
47Applying Pruning Techniques to Single-Class
Emerging Substring Mining
- The End -