Title: Sequential Pattern Mining
1COMP5318/4044, Lecture 9Knowledge Discovery and
Data Mining
- Sequential Pattern Mining
2Sequence Databases and Sequential Pattern Analysis
- Transaction databases, time-series databases vs.
sequence databases - Frequent patterns vs. (frequent) sequential
patterns - Applications of sequential pattern mining
- Customer shopping sequences
- First buy computer, then CD-ROM, and then digital
camera, within 3 months. - Medical treatment, natural disasters (e.g.,
earthquakes), science engineering processes,
stocks and markets, etc. - Telephone calling patterns, Weblog click streams
- DNA sequences and gene structures
3Recall about Support and Confidence
- The support of an association rule X-gtY is the
percentage of transactions that contain X ?Y - The confidence of an association rule X-gtY is the
ratio of the number of transactions that contain
X ?Y to the number of transactions that contain X
4What Is Sequential Pattern Mining?
- Given a set of sequences, find the complete set
of frequent subsequences
A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern (cf. SID 10 30)
5Challenges on Sequential Pattern Mining
- A huge number of possible sequential patterns are
hidden in databases - A mining algorithm should
- find the complete set of patterns, when possible,
satisfying the minimum support (frequency)
threshold - be highly efficient, scalable, involving only a
small number of database scans - be able to incorporate various kinds of
user-specific constraints
6Studies on Sequential Pattern Mining
- Concept introduction and an initial Apriori-like
algorithm - R. Agrawal R. Srikant. Mining sequential
patterns, ICDE95 - GSPAn Apriori-based, influential mining method
(developed at IBM Almaden) - R. Srikant R. Agrawal. Mining sequential
patterns Generalizations and performance
improvements, EDBT96 - From sequential patterns to episodes
(Apriori-like constraints) - H. Mannila, H. Toivonen A.I. Verkamo.
Discovery of frequent episodes in event
sequences, Data Mining and Knowledge Discovery,
1997 - Mining sequential patterns with constraints
- M.N. Garofalakis, R. Rastogi, K. Shim SPIRIT
Sequential Pattern Mining with Regular Expression
Constraints. VLDB 1999
7GSPA Generalized Sequential Pattern Mining
Algorithm
- GSP (Generalized Sequential Pattern) mining
algorithm - proposed by Agrawal and Srikant, EDBT96
- Outline of the method
- Initially, every item in DB is a candidate of
length-1 - for each level (i.e., sequences of length-k) do
- scan database to collect support count for each
candidate sequence - generate candidate length-(k1) sequences from
length-k frequent sequences using Apriori - repeat until no frequent sequence or no candidate
can be found - Major strength Candidate pruning by Apriori
8A Basic Property of Sequential Patterns Apriori
- A basic property Apriori (Agrawal Sirkant94)
- If a sequence S is not frequent
- Then none of the super-sequences of S is frequent
- E.g, lthbgt is infrequent ? so do lthabgt and lt(ah)bgt
Given support threshold min_sup 2
9The GSP Algorithm
- Take sequences in form of ltxgt as length-1
candidates - Scan database once, find F1, the set of length-1
sequential patterns - Let k1 while Fk is not empty do
- Form Ck1, the set of length-(k1) candidates
from Fk - If Ck1 is not empty, scan database once, find
Fk1, the set of length-(k1) sequential patterns - Let kk1
10Finding Length-1 Sequential Patterns
- Examine GSP using an example
- Initial candidates all singleton sequences
- ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt, ltggt, lthgt
- Scan database once, count support for candidates
11Generating Length-2 Candidates
36
51 length-2 Candidates
15
Without Apriori property, 8887/292 candidates
Apriori prunes 44.57 candidates
12Generating Length-2 Candidates
13Generating Length-2 Candidates
14Length-2 Sequential Patterns
- After scanning the database to collect support
count for each length-2 candidate - There are 19 length-2 candidates which pass the
minimum support threshold - They are length-2 sequential patterns
- 16 of them in the pattern of ltxygt
- 3 of them in the pattern of lt(xy)gt
15Generating Length-3 Candidates and Finding
Length-3 Patterns
- Generate Length-3 Candidates
- Self-join length-2 sequential patterns
- Based on the Apriori property
- ltabgt, ltaagt and ltbagt are all length-2 sequential
patterns ? ltabagt is a length-3 candidate - 46 candidates are generated
- lt(bd)gt, ltbbgt and ltdbgt are all length-2 sequential
patterns ? lt(bd)bgt is a length-3 candidate - 27 candidates are generated
- Find Length-3 Sequential Patterns
- Scan database once more, collect support counts
for candidates - 19 out of 73 candidates pass support threshold
16Generating Length-3 Candidates
- ltaaagt0, ltaabgt0
- ltabagt2, ltabbgt2, ltabcgt1, ltabdgt1, ltabegt1,
ltabfgt1 - ltbaagt, ltbabgt
- ltbbagt, ltbbbgt, ltbbcgt, ltbbdgt, ltbbegt, ltbbfgt
- ltbcagt, ltbcbgt, ltbcdgt
- ltbdagt, ltbdbgt, ltbdcgt
- ltbfbgt, ltbffgt
- ltcaagt, ltcabgt
- ltcbagt, ltcbbgt, ltcbcgt, ltcbdgt, ltcbegt, ltcbfgt
- ltcdagt, ltcdbgt, ltcdcgt
- ltdaagt, ltdabgt
- ltdbagt, ltdbbgt, ltdbcgt, ltdbdgt, ltdbegt, ltdbfgt
- ltdcagt, ltdcbgt, ltdcdgt
- ltfbagt, ltfbbgt, ltfbcgt, ltfbdgt, ltfbegt, ltfbfgt
- ltffbgt, ltfffgt
- Example of generating ltxyzgt pattern for ltaagt
- Need to concatenate another Length-2 frequent
itemset - Concatenating another frequent itemsets that
start with a to form ltaaagt and ltaabgt
min_sup 2
17Generating Length-3 Candidates
- Example of generating lt(xy)zgt pattern for lt(bd)gt
- Need to concatenate another Length-2 frequent
itemset - Concatenating those patterns that end with b to
form something like - lta(bd)gt, ltb(bd)gt, ltc(bd)gt, ltd(bd)gt, ltf(bd)gt
- Concatenating those patterns that starts with d
to form something like - lt(bd)agt, lt(bd)bgt, lt(bd)cgt
18The GSP Mining Process
min_sup 2
19Bottlenecks of GSP
- A huge set of candidates could be generated
- 1,000 frequent length-1 sequences generate
length-2 candidates! - Multiple scans of database in mining
- Real challenge mining long sequential patterns
- An exponential number of short candidates
- A length-100 sequential pattern needs 1030
candidate
sequences!
20Pattern-growth methods
- A divide-and-conquer approach
- Recursively project a sequence database into a
set of smaller databases - Mine each projected database to find the subset
of patterns - Algorithms
- FreeSpan Frequent Pattern-Projected Sequential
Pattern Mining - PrefixSpan Prefix-Projected Sequential Pattern
Mining
21FreeSpan
- J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U.
Dayal, and M.-C. Hsu. FreeSpan Frequent
pattern-projected sequential pattern mining.
KDD'00, pages 355-359. - Example given a sequence database S and
min_support 2 - Step 1 find length-1 sequential patterns and
list them in support descending order - f_list a4,b4,c4,d3,e3,f3 g1
22FreeSpan (cont)
- Step 2 divide search space. The complete set of
seq. pat. can be partitioned into 6 disjoint
subsets (move down the f_list) - ones only contain item a
- ones contain item b but no items after b in
f_list - ones contain item c but no items after c in
f_list - ones contain item d but no items after d in
f_list - ones contain item e but no items after e in
f_list - ones contain item f
- find subsets of sequential patterns. They can be
mined by constructing projected databases and
mining each recursively
23FreeSpan (cont)
- Finding Seq. Patterns containing item b but no
items after b in f_list - ltbgt-projected database
- lta(ab)agt, ltabagt, lt(ab)bgt, ltabgt
- Find all the length-2 seq. pat. containing item b
but no items after b in f_list - ltabgt4, ltbagt2, lt(ab)gt2
- Further partition and mining
f_list a4,b4, c4,d3,e3,f3
24From FreeSpan to PrefixSpan
- Freespan
- Projection-based No candidate sequence needs to
be generated - But, projection can be performed at any point in
the sequence, and the projected sequences may not
shrink much. For example, the size of f-projected
database is the same as the original sequence
database - PrefixSpan
- Projection-based
- But only prefix-based projection less
projections and quickly shrinking sequences
25PrefixSpan (Prefix-projected Sequential Pattern
Mining)
- Projection-based
- But only prefix-based projection less
projections and quickly shrinking sequences - J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and
M.-C. Hsu, "PrefixSpan Mining Sequential
Patterns Efficiently by Prefix-Projected Pattern
Growth", Proc. 2001 Int. Conf. on Data
Engineering (ICDE'01), Heidelberg, Germany, April
2001.
26Prefix of A Sequence
- Given sequence slt(abc)(bd)(ace)gt
- lt(abc)gt, lt(abc)(bd)gt are prefixes of s
- Given an alphabetical order of items in each
itemset (element), lt(a)gt, lt(ab)gt, lt(abc)(b)gt,
lt(abc)(bd)(a)gt, and lt(abc)(bd)(ac)gt are also
prefixes of s - lt(ab)(bd)gt, lt(bd)(ac)gt are NOT prefixes of s
27Pattern Growth (prefixSpan)
- Prefix and Suffix (Projection)
- ltagt, ltaagt, lta(ab)gt and lta(abc)gt are prefixes of
sequence lta(abc)(ac)d(cf)gt - Given sequence lta(abc)(ac)d(cf)gt
28PrefixSpan-concepts
- Suppose all items in an element are listed
alphabetically. - Given a sequence ?lte1e2engt, ?lte1e2emgt(m?n)
- Prefix ? is the prefix of ? iff (1) eiei (i
?m-1) (2) em ? em(3) all items in (em- em) are
alphabetically after those in em. - e.g. ?lta(abc)(ac)d(cf)gt, ?lta(ab)gt
- Postfix sequence ?lte1e2emgt, ?ltemem1engt
is called the postfix of ? w.r.t. prefix ?, where
em(em-em), denoted as ??.? - e.g. ?lt(_c)(ac)d(cf)gt is the postfix of ? w.r.t.
prefix lta(ab)gt
29PrefixSpan-concepts (cont)
- Projected database let ? be a sequential pattern
in S. ?-projected database, denoted s?, is the
collection of postfixes of sequences in S w.r.t.
prefix ? - Support count in projected database let ? be a
sequential pattern in S, ? be a sequence having
prefix ?. The support count of ? in ?-projected
database is the number of sequence ? in s? such
that ???.?
30PrefixSpan-process
- Step 1 find length-1 sequential patterns
- ltagt4, ltbgt4, ltcgt4, ltdgt3, ltegt3, ltfgt3, ltggt1
- Step 2 divide search space. The complete set of
seq. pat. can be partitioned into 6 subsets - ones having prefix ltagt
- ones having prefix ltbgt
-
- ones having prefix ltfgt
- find subsets of sequential patterns. They can be
mined by constructing projected databases and
mining each recursively
31PrefixSpan-Process (cont)
- Finding Seq. Patterns with Prefix ltagt
- ltagt-projected database
- lt(abc)(ac)d(cf)gt, lt(_d)c(bc)(ae)gt, lt(_b)(df)cbgt,
lt(_f)cbcgt - Find all the length-2 seq. pat. having prefix
ltagt - ltaagt2, ltabgt4, lt(ab)gt2, ltacgt4, ltadgt2, ltafgt2
- Further partition into 6 subsets
- Having prefix ltaagt
-
- Having prefix ltafgt
32Example
An Example ( min_sup2)
33PrefixSpan Algorithm
Main Idea Use frequent prefixes to divide the
search space and to project sequence databases.
only search the relevant sequences.
PrefixSpan(?, i, S?)
- Scan S? once, find the set of frequent items b
such that - b can be assembled to the last element of ? to
form a sequential pattern or - ltbgt can be appended to ? to form a sequential
pattern. - For each frequent item b, appended it to ? to
form a sequential pattern ?, and output ? - For each ?, construct ?-projected database
S?, and call PrefixSpan(?, i1,S?).
34Example to be continued
- ltaagt-projected database
- lt(_bc)(ac)d(cf)gt
- ltabgt-projected database
- lt(_c)(ac)d(cf)gt, lt(_c)agt, and ltcgt
- lt(ab)gt-projected database
- lt(_c)(ac)d(cf)gt, lt(df)cbgt
35Example to be continued
36PrefixSpan Is Faster than GSP and FreeSpan
37Effect of Pseudo-Projection
38Completeness of PrefixSpan
SDB
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
Having prefix ltcgt, , ltfgt
Having prefix ltagt
Having prefix ltbgt
ltagt-projected database lt(abc)(ac)d(cf)gt lt(_d)c(bc)
(ae)gt lt(_b)(df)cbgt lt(_f)cbcgt
ltbgt-projected database
Length-2 sequential patterns ltaagt, ltabgt,
lt(ab)gt, ltacgt, ltadgt, ltafgt
Having prefix ltaagt
Having prefix ltafgt
ltaagt-proj. db
ltafgt-proj. db
39Efficiency of PrefixSpan
- No candidate sequence needs to be generated
- Projected databases keep shrinking
- Major cost of PrefixSpan constructing projected
databases - Can be improved by bi-level projections
40Bi-Level Projection
For each length-2 sequential pattern ?, construct
?-projected database
41ltabgt-Projected Database
- The ltabgt-projected database contains three
sequences - lt(_c)(ac)(cf)gt, lt(_c)agt, ltcgt
- Scanning it to produce three frequent items
- ltagt, lt(_c)gt, ltcgt
- Only one pattern achieved the min_supp 2, which
is lt(_c)agt - Same set of sequential patterns will be produced
- 53 projected databases for the 53 sequential
patterns produced but only 22 projections using
the bi-level approach
42Other Optimization Technique in PrefixSpan
- Pseudo-projection may reduce the effort of
projection when the projected database fits in
main memory
43Speed-up by Pseudo-projection
- Major cost of PrefixSpan projection
- Postfixes of sequences often appear repeatedly in
recursive projected databases - When (projected) database can be held in main
memory, use pointers to form projections - Pointer to the sequence
- Offset of the postfix
slta(abc)(ac)d(cf)gt
ltagt
lt(abc)(ac)d(cf)gt
sltagt ( , 2)
ltabgt
lt(_c)(ac)d(cf)gt
sltabgt ( , 4)
44Performance Comparison
- PrefixSpan-1 is PrefixSpan with level-by-level
projection - PrefixSpan-2 is PrefixSpan with bi-level
projection
45More about Pseudo-Projection
- Pseudo-projection avoids physically copying
postfixes - Efficient in running time and space when database
can be held in main memory - However, it is not efficient when database cannot
fit in main memory - Disk-based random accessing is very costly
- Suggested Approach
- Integration of physical and pseudo-projection
- Swapping to pseudo-projection when the data set
fits in memory
46The Final Word
- Sequential Pattern Mining is useful in many
application, e.g. weblog analysis, financial
market prediction and even NLP - It is similar to the frequent itemsets mining,
but with temporal (sequences) taking into
consideration - We have looked at two different approaches that
are descendants from two popular algorithms in
mining frequent itemsets - Candidates Generation GSP
- Pattern Growth PrefixSpan