Title: Classification
1Classification
2Sequence Data
Sequence Database
3Examples of Sequence Data
Element (Transaction)
Event (Item)
E1E2
E1E3
E2
E3E4
E2
Sequence
4Formal Definition of a Sequence
- A sequence is an ordered list of elements
(transactions) - s lt e1 e2 e3 gt
- Each element contains a collection of events
(items) - ei i1, i2, , ik
- Each element is attributed to a specific time or
location - Length of a sequence, s, is given by the number
of elements of the sequence - A k-sequence is a sequence that contains k events
(items)
5Formal Definition of a Subsequence
- A sequence lta1 a2 angt is contained in another
sequence ltb1 b2 bmgt (m n) if there exist
integers i1 lt i2 lt lt in such that a1 ? bi1 ,
a2 ? bi1, , an ? bin - The support of a subsequence w is defined as the
fraction of data sequences that contain w - A sequential pattern is a frequent subsequence
(i.e., a subsequence whose support is minsup)
6Sequential Pattern Mining Definition
- Given
- a database of sequences
- a user-specified minimum support threshold,
minsup - Task
- Find all subsequences with support minsup
7Sequential Pattern Mining Challenge
- Given a sequence lta b c d e f g h igt
- Examples of subsequences
- lta c d f g gt, lt c d e gt, lt b g gt,
etc. - How many k-subsequences can be extracted from a
given n-sequence? - lta b c d e f g h igt n 9
-
- k4 Y _ _ Y Y _ _ _Y
- lta d e igt
8Challenges on Sequential Pattern Mining
- A huge number of possible sequential patterns are
hidden in databases - A mining algorithm should
- find the complete set of patterns, when possible,
satisfying the minimum support (frequency)
threshold - be highly efficient, scalable, involving only a
small number of database scans - be able to incorporate various kinds of
user-specific constraints
9Sequential Pattern Mining Algorithms
- Concept introduction and an initial Apriori-like
algorithm - Agrawal Srikant. Mining sequential patterns,
ICDE95 - Apriori-based method GSP (Generalized Sequential
Patterns Srikant Agrawal _at_ EDBT96) - Pattern-growth methods FreeSpan PrefixSpan
(Han et al._at_KDD00 Pei, et al._at_ICDE01) - Vertical format-based mining SPADE (Zaki_at_Machine
Leanining00) - Constraint-based sequential pattern mining
(SPIRIT Garofalakis, Rastogi, Shim_at_VLDB99 Pei,
Han, Wang _at_ CIKM02) - Mining closed sequential patterns CloSpan (Yan,
Han Afshar _at_SDM03)
10Extracting Sequential Patterns
- Given n events i1, i2, i3, , in
- Candidate 1-subsequences
- lti1gt, lti2gt, lti3gt, , ltingt
- Candidate 2-subsequences
- lti1, i2gt, lti1, i3gt, , lti1 i1gt, lti1
i2gt, , ltin-1 ingt - Candidate 3-subsequences
- lti1, i2 , i3gt, lti1, i2 , i4gt, , lti1, i2
i1gt, lti1, i2 i2gt, , - lti1 i1 , i2gt, lti1 i1 , i3gt, , lti1 i1
i1gt, lti1 i1 i2gt,
11Generalized Sequential Pattern (GSP)
- Step 1
- Make the first pass over the sequence database D
to yield all the 1-element frequent sequences - Step 2
- Repeat until no new frequent sequences are found
- Candidate Generation
- Merge pairs of frequent subsequences found in the
(k-1)th pass to generate candidate sequences that
contain k items - Candidate Pruning
- Prune candidate k-sequences that contain
infrequent (k-1)-subsequences - Support Counting
- Make a new pass over the sequence database D to
find the support for these candidate sequences - Candidate Elimination
- Eliminate candidate k-sequences whose actual
support is less than minsup
12Candidate Generation
- Base case (k2)
- Merging two frequent 1-sequences lti1gt and
lti2gt will produce two candidate 2-sequences
lti1 i2gt and lti1 i2gt - General case (kgt2)
- A frequent (k-1)-sequence w1 is merged with
another frequent (k-1)-sequence w2 to produce a
candidate k-sequence if the subsequence obtained
by removing the first event in w1 is the same as
the subsequence obtained by removing the last
event in w2 - The resulting candidate after merging is given
by the sequence w1 extended with the last event
of w2. - If the last two events in w2 belong to the same
element, then the last event in w2 becomes part
of the last element in w1 - Otherwise, the last event in w2 becomes a
separate element appended to the end of w1
13Candidate Generation Examples
- Merging the sequences w1lt1 2 3 4gt and w2
lt2 3 4 5gt will produce the candidate
sequence lt 1 2 3 4 5gt because the last two
events in w2 (4 and 5) belong to the same element - Merging the sequences w1lt1 2 3 4gt and w2
lt2 3 4 5gt will produce the candidate
sequence lt 1 2 3 4 5gt because the last
two events in w2 (4 and 5) do not belong to the
same element - We do not have to merge the sequences w1 lt1
2 6 4gt and w2 lt1 2 4 5gt to produce
the candidate lt 1 2 6 4 5gt because if the
latter is a viable candidate, then it can be
obtained by merging w1 with lt 1 2 6 5gt
14GSP Example
15Finding Length-1 Sequential Patterns
- Examine GSP using an example
- Initial candidates all singleton sequences
- ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt, ltggt, lthgt
- Scan database once, count support for candidates
16GSP Generating Length-2 Candidates
51 length-2 Candidates
Without Apriori property, 8887/292 candidates
Apriori prunes 44.57 candidates
17The GSP Mining Process
min_sup 2
18Candidate Generate-and-test Drawbacks
- A huge set of candidate sequences generated.
- Especially 2-item candidate sequence.
- Multiple Scans of database needed.
- The length of each candidate grows by one at each
database scan. - Inefficient for mining long sequential patterns.
- A long pattern grow up from short patterns
- The number of short patterns is exponential to
the length of mined patterns.
19The SPADE Algorithm
- SPADE (Sequential PAttern Discovery using
Equivalent Class) developed by Zaki 2001 - A vertical format sequential pattern mining
method - A sequence database is mapped to a large set of
- Item ltSID, EIDgt
- Sequential pattern mining is performed by
- growing the subsequences (patterns) one item at a
time by Apriori candidate generation
20The SPADE Algorithm
21Bottlenecks of GSP and SPADE
- A huge set of candidates could be generated
- 1,000 frequent length-1 sequences generate s huge
number of length-2 candidates! - Multiple scans of database in mining
- Mining long sequential patterns
- Needs an exponential number of short candidates
- A length-100 sequential pattern needs 1030
candidate
sequences!
22Prefix and Suffix (Projection)
- ltagt, ltaagt, lta(ab)gt and lta(abc)gt are prefices of
sequence lta(abc)(ac)d(cf)gt - Given sequence lta(abc)(ac)d(cf)gt
23Mining Sequential Patterns by Prefix Projections
- Step 1 find length-1 sequential patterns
- ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt
- Step 2 divide search space. The complete set of
seq. pat. can be partitioned into 6 subsets - The ones having prefix ltagt
- The ones having prefix ltbgt
-
- The ones having prefix ltfgt
24Finding Seq. Patterns with Prefix ltagt
- Only need to consider projections w.r.t. ltagt
- ltagt-projected database lt(abc)(ac)d(cf)gt,
lt(_d)c(bc)(ae)gt, lt(_b)(df)cbgt, lt(_f)cbcgt - Find all the length-2 seq. pat. Having prefix
ltagt ltaagt, ltabgt, lt(ab)gt, ltacgt, ltadgt, ltafgt - Further partition into 6 subsets
- Having prefix ltaagt
-
- Having prefix ltafgt
25Completeness of PrefixSpan
SDB
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
Having prefix ltcgt, , ltfgt
Having prefix ltagt
Having prefix ltbgt
ltagt-projected database lt(abc)(ac)d(cf)gt lt(_d)c(bc)
(ae)gt lt(_b)(df)cbgt lt(_f)cbcgt
ltbgt-projected database
Length-2 sequential patterns ltaagt, ltabgt,
lt(ab)gt, ltacgt, ltadgt, ltafgt
Having prefix ltaagt
Having prefix ltafgt
ltaagt-proj. db
ltafgt-proj. db
26Efficiency of PrefixSpan
- No candidate sequence needs to be generated
- Projected databases keep shrinking
- Major cost of PrefixSpan constructing projected
databases - Can be improved by pseudo-projections
27Speed-up by Pseudo-projection
- Major cost of PrefixSpan projection
- Postfixes of sequences often appear repeatedly in
recursive projected databases - When (projected) database can be held in main
memory, use pointers to form projections - Pointer to the sequence
- Offset of the postfix
slta(abc)(ac)d(cf)gt
ltagt
lt(abc)(ac)d(cf)gt
sltagt ( , 2)
ltabgt
lt(_c)(ac)d(cf)gt
sltabgt ( , 4)
28Pseudo-Projection vs. Physical Projection
- Pseudo-projection avoids physically copying
postfixes - Efficient in running time and space when database
can be held in main memory - However, it is not efficient when database cannot
fit in main memory - Disk-based random accessing is very costly
- Suggested Approach
- Integration of physical and pseudo-projection
- Swapping to pseudo-projection when the data set
fits in memory
29Performance on Data Set C10T8S8I8
30CloSpan Mining Closed Sequential Patterns
- A closed sequential pattern s there exists no
superpattern s such that s ? s, and s and s
have the same support - Motivation reduces the number of (redundant)
patterns but attains the same expressive power - Using Backward Subpattern and Backward
Superpattern pruning to prune redundant search
space
31Constraint-Based Seq.-Pattern Mining
- Constraint-based sequential pattern mining
- Constraints User-specified, for focused mining
of desired patterns - How to explore efficient mining with constraints?
Optimization - Classification of constraints
- Anti-monotone E.g., value_sum(S) lt 150, min(S) gt
10 - Monotone E.g., count (S) gt 5, S ? PC,
digital_camera - Succinct E.g., length(S) ? 10, S ? Pentium,
MS/Office, MS/Money - Convertible E.g., value_avg(S) lt 25, profit_sum
(S) gt 160, max(S)/avg(S) lt 2, median(S) min(S)
gt 5 - Inconvertible E.g., avg(S) median(S) 0
32From Sequential Patterns to Structured Patterns
- Sets, sequences, trees, graphs, and other
structures - Transaction DB Sets of items
- i1, i2, , im,
- Seq. DB Sequences of sets
- lti1, i2, , im, in, ikgt,
- Sets of Sequences
- lti1, i2gt, , ltim, in, ikgt,
- Sets of trees t1, t2, , tn
- Sets of graphs (mining for frequent subgraphs)
- g1, g2, , gn
- Mining structured patterns in XML documents,
bio-chemical structures, etc.
33Episodes and Episode Pattern Mining
- Other methods for specifying the kinds of
patterns - Serial episodes A ? B
- Parallel episodes A B
- Regular expressions (A B)C(D ? E)
- Methods for episode pattern mining
- Variations of Apriori-like algorithms, e.g., GSP
- Database projection-based pattern growth
- Similar to the frequent pattern growth without
candidate generation
34Periodicity Analysis
- Periodicity is everywhere tides, seasons, daily
power consumption, etc. - Full periodicity
- Every point in time contributes (precisely or
approximately) to the periodicity - Partial periodicit A more general notion
- Only some segments contribute to the periodicity
- Jim reads NY Times 700-730 am every week day
- Cyclic association rules
- Associations which form cycles
- Methods
- Full periodicity FFT, other statistical analysis
methods - Partial and cyclic periodicity Variations of
Apriori-like mining methods
35Ref Mining Sequential Patterns
- R. Srikant and R. Agrawal. Mining sequential
patterns Generalizations and performance
improvements. EDBT96. - H. Mannila, H Toivonen, and A. I. Verkamo.
Discovery of frequent episodes in event
sequences. DAMI97. - M. Zaki. SPADE An Efficient Algorithm for Mining
Frequent Sequences. Machine Learning, 2001. - J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and
M.-C. Hsu. PrefixSpan Mining Sequential Patterns
Efficiently by Prefix-Projected Pattern Growth.
ICDE'01 (TKDE04). - J. Pei, J. Han and W. Wang, Constraint-Based
Sequential Pattern Mining in Large Databases,
CIKM'02. - X. Yan, J. Han, and R. Afshar. CloSpan Mining
Closed Sequential Patterns in Large Datasets.
SDM'03. - J. Wang and J. Han, BIDE Efficient Mining of
Frequent Closed Sequences, ICDE'04. - H. Cheng, X. Yan, and J. Han, IncSpan
Incremental Mining of Sequential Patterns in
Large Database, KDD'04. - J. Han, G. Dong and Y. Yin, Efficient Mining of
Partial Periodic Patterns in Time Series
Database, ICDE'99. - J. Yang, W. Wang, and P. S. Yu, Mining
asynchronous periodic patterns in time series
data, KDD'00.