Title: Constraintbased sequential pattern mining: the patterngrowth method
1Constraint-based sequential pattern mining the
pattern-growth method
AuthorsJian Pei, Jiawei Han, and Wei
Wang SourceJournal of Intelligent Information
Systems, Volume 28, Number 2, pp.133-160,
2007/4 ReporterChin-Chih Chan Date2007/12/17 E-m
ail m9622967
2Outline
- Introduction
- Definition
- Constraint Sequential pattern mining
- Categories of constraints
- Algorithm PG
- Computing projection
- Mining sequential patterns with prefix-monotone
constraints - Experimental results and performance
3Introduction
- Sequential pattern mining is an important data
mining task with broad applications - Sequential pattern mining finding the often
occur sequence in a large database - Often occur the support of the sequence the
user defined minimum support - Support of sequence s how many sequence in DB
contain s - Applications network traffic analysis, bio
information
4Example sequential pattern mining
- If the support threshold min_sup 2
- (ab)d is a subsequence of both the second
sequence, lt e(ab)(bc)dd gt, and the third one,
lt c(aef )(abc)ddgt - (ab)d is one of the sequential pattern
5Introduction
- Mining the complete set of sequential patterns is
still tough in both effectiveness and efficiency - We want to only mine the sequential patterns that
are highly interesting to users - Improve the effectiveness by focusing only on
interesting patterns - We want to mine sequential patterns with all
kinds of constraints
6Pattern-growth (PG) method
- We present a algorithm pattern-growth (PG)
- Constraints can be effectively and efficiently
pushed deep into the sequential data mining method
7Sequential pattern mining concepts
- I x1, , xn be a set of items
- An itemset is a non-empty subset of items
- A sequence a ltX1, , Xlgt is an ordered list of
itemsets - An itemset Xi (1 ? i ? l ) in a sequence is
called a transaction - The number of transactions in a sequence is
called the length of the sequence - For an l-sequence a, we have len(a) l
8Subsequence and super-sequence
- A transaction Xi have a special attribute,
times-stamp, denoted by Xi.time - For a sequence a lt X1, , Xl gt, we assume
Xi.time lt Xj.time for 1? I lt j ? l - lt X1, X2gt ? lt X2, X1gt
- A sequence a lt X1, , Xn gt is called a
subsequence of sequence ß lt Y1, , Ym gt (n?m),
denoted by a?ß, if there exist integers 1?
i1ltltin ? m such that Xi ? Yi1, , Xn ? Yin - And ß is a super-sequence of a
9Example subsequence
- Sequence lt(ab)dgt is a subsequence of both
lte(ab)bcddgt and ltc(aef)(abc)ddgt
e
(ab)
d
b
c
d
c
(aef)
d
(abc)
d
(ab)
d
(ab)
d
Contain and in time order
10Categories of constraints
- Constraint 1 (Item constraint)
- Example Cbookstore (a) (?i 1 i len(a),
a i ? B) - Constraint 2 (Length constraint )
- Example Clen(a) (len(a) 50)
- Constraint 3 (Super-pattern constraint)
- Example Cpat(a) lt (PC)(digital_camera) gt?a
- Constraint 4 (Aggregate constraint)
- Example Cavg(a) avg(a) 30
- Constraint 5 (Regular expression constraint)
- Example Travel ( New York New York City )
( Hotels Motels )
11(No Transcript)
12Algorithm PG
- Difference with traditional sequential pattern
- Definition
- Prefix
- Projection
- Projected DB
- Algorithm PG
13The classical sequential pattern mining
- The claaical Apriori property base algorithm
- Property any super-pattern of an infrequent
pattern cannot be frequent - A breadth-first, level-by-level search
- Just squeeze constraints into the
Apriori-framework - However, some important constraints can not be
solved with Apriori property base - EX regular expression
14Prefix growth(PG) algorithm
- A prefix-monotone property
- Can solve most of the constraints discussed so
far - PG push such constraints into sequential pattern
mining - Make it more efficiency and effectiveness
- Efficiency take less computational time and
space - Effectiveness some pattern that user didnt
interesting is pruned
15Definition order
- All items in a transaction are written with
respect to the order R - written in the form of (ade)(bc) instead of
(dae)(cb) - item x precedes item y is denoted by x ? y
- The alphabetical order is often used
16Definition the prefix
- Given a sequence a lt X1 , , Xn gt, sequence
ß lt X1 , , XkY gt is called a prefix ofa
if - (1) k lt n
- (2) Y ? Xk1
- (3) ?y ? Y, ?z ? (Xk1 - Y), y ? z
- Example sequence a lt(abc)(acd)(bef ) gt
- sequence ß lt(abc)(ac)gt is a prefix of sequence
a - sequence ? lt (abc)(ad) gt is not a prefix of a
17The concept of projected database
- For sequence a ? ß, sequence ? is said the
projection of ß with respect to a if - (1) ? ? ß
- (2) a is a prefix of ?
- (3) there exists no proper super-sequence ? of ?
such that ? ? ß and ? also has a as a prefix - Projection is also denoted by ? ß / a
18Example for projection
- For example, if a bc, ß (abc)d(ace) f , then
? ß/a b(ce) f
19Algorithm for computing projection
20Example for computing projection
- For example, if a bc, ß (abc)d(ace) f , then
? ß/a b(ce) f
J1
J2
J3
J4
d
(abc)
(ace)
f
x c
Z e
c
b
Output lt b (c ? e) f ) gt
i2
i1
A2
A1
21Algorithm for PG
22Flow for PG
Sequence DB S
Scan S for 1-item support
ltagt
ltbgt
ltcgt
If frequent then do
ltagt-projected DB
ltbgt-projected DB
ltcgt-projected DB
Scan ltagt-projected DB for 2-sequence support
ltaagt
ltabgt
ltacgt
lt(ab)gt
lt(ac)gt
23Example for mining sequential patterns with
prefix-monotone constraints
- The task be mining sequential patterns with a
regular expression constraint
C a bb(bc)ddd and min_sup 2
24Prefix_growth(ltgt, SDB)
SDBltgt
- Let l be the length of ltgt, Scan SDB ltgt,
find length-(l 1) frequent prefix in SDBltgt
ltagt 4, ltbgt 4, ltcgt 4, ltdgt 3, ltegt 3
ltf gt 1, is Infrequent item
SDBltgt
lta(bc)egt contains no subsequence satisfying the
constraint
25SDBltagt
lt a gt fails C
lt (ae) gt 1, is Infrequent item lt aa gt 1, is
Infrequent item
lt (ab) gt 2, ltabgt 3, ltacgt 3, ltadgt 3
SDBltabgt
lt (abc) gt 1, lt (abb) gt 1, is Infrequent item
lt a(bc) gt 2, ltabdgt 3
26SDBlta(bc)gt
Sequential pattern a(bc)d satisfies the
constraint
27SDBltacgt
Every sequence in the projected database contains
no subsequence satisfying the constrain
SDBltadgt
lt add gt is a sequential pattern satisfying the
constraint
It results in two final patterns a(bc)d, add
28Experimental results and performance study
- Experiment hardware
- Compare response time without constraint
- GSP, SPADE, Prefix growth
- Regular expression constraints into sequential
pattern mining - Scalability of PG
- With support threshold
- With database
29Experiment hardware
30Compare response time without constraint
- Experiment On dataset C10T5S4I1.25D200k
- Contains 100, 000 sequences with 10, 000 items.
- The expected average number of items within a
transaction is 5 - Denoted as T5
- The expected average number of transaction in
maximal sequential pattern is 4 - Denoted as S4
31Pushing regular expression constraints into
sequential pattern mining
- We randomly generate 1,000 constraints
- The support threshold is set to 0.2
- PG can prune both patterns and projected
databases, but SPIRIT(V) has to scan the whole
sequence database repeatedly
32Scalability of PG with respect to support
threshold
33Scalability of PG with respect to database size
- The support threshold is set to 0.2
34Conclusions
- We characterize constraints for sequential
pattern mining - An efficient algorithm, PG, is developed to push
prefix-monotone constraints deep into the mining
process