Classification

About This Presentation

Title:

Classification

Description:

Vertical format-based mining: SPADE (Zaki_at_Machine Leanining'00) ... Bottlenecks of GSP and SPADE. A huge set of candidates could be generated ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 36

Provided by: jiaw198

Learn more at: https://www.cs.bu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Classification

1
Classification
2
Sequence Data
Sequence Database
3
Examples of Sequence Data
Element (Transaction)
Event (Item)
E1E2
E1E3
E2
E3E4
E2
Sequence
4
Formal Definition of a Sequence

A sequence is an ordered list of elements
(transactions)
s lt e1 e2 e3 gt
Each element contains a collection of events
(items)
ei i1, i2, , ik
Each element is attributed to a specific time or
location
Length of a sequence, s, is given by the number
of elements of the sequence
A k-sequence is a sequence that contains k events
(items)

5
Formal Definition of a Subsequence

A sequence lta1 a2 angt is contained in another
sequence ltb1 b2 bmgt (m n) if there exist
integers i1 lt i2 lt lt in such that a1 ? bi1 ,
a2 ? bi1, , an ? bin
The support of a subsequence w is defined as the
fraction of data sequences that contain w
A sequential pattern is a frequent subsequence
(i.e., a subsequence whose support is minsup)

6
Sequential Pattern Mining Definition

Given
a database of sequences
a user-specified minimum support threshold,
minsup
Task
Find all subsequences with support minsup

7
Sequential Pattern Mining Challenge

Given a sequence lta b c d e f g h igt
Examples of subsequences
lta c d f g gt, lt c d e gt, lt b g gt,
etc.
How many k-subsequences can be extracted from a
given n-sequence?
lta b c d e f g h igt n 9
k4 Y _ _ Y Y _ _ _Y
lta d e igt

8
Challenges on Sequential Pattern Mining

A huge number of possible sequential patterns are
hidden in databases
A mining algorithm should
find the complete set of patterns, when possible,
satisfying the minimum support (frequency)
threshold
be highly efficient, scalable, involving only a
small number of database scans
be able to incorporate various kinds of
user-specific constraints

9
Sequential Pattern Mining Algorithms

Concept introduction and an initial Apriori-like
algorithm
Agrawal Srikant. Mining sequential patterns,
ICDE95
Apriori-based method GSP (Generalized Sequential
Patterns Srikant Agrawal _at_ EDBT96)
Pattern-growth methods FreeSpan PrefixSpan
(Han et al._at_KDD00 Pei, et al._at_ICDE01)
Vertical format-based mining SPADE (Zaki_at_Machine
Leanining00)
Constraint-based sequential pattern mining
(SPIRIT Garofalakis, Rastogi, Shim_at_VLDB99 Pei,
Han, Wang _at_ CIKM02)
Mining closed sequential patterns CloSpan (Yan,
Han Afshar _at_SDM03)

10
Extracting Sequential Patterns

Given n events i1, i2, i3, , in
Candidate 1-subsequences
lti1gt, lti2gt, lti3gt, , ltingt
Candidate 2-subsequences
lti1, i2gt, lti1, i3gt, , lti1 i1gt, lti1
i2gt, , ltin-1 ingt
Candidate 3-subsequences
lti1, i2 , i3gt, lti1, i2 , i4gt, , lti1, i2
i1gt, lti1, i2 i2gt, ,
lti1 i1 , i2gt, lti1 i1 , i3gt, , lti1 i1
i1gt, lti1 i1 i2gt,

11
Generalized Sequential Pattern (GSP)

Step 1
Make the first pass over the sequence database D
to yield all the 1-element frequent sequences
Step 2
Repeat until no new frequent sequences are found
Candidate Generation
Merge pairs of frequent subsequences found in the
(k-1)th pass to generate candidate sequences that
contain k items
Candidate Pruning
Prune candidate k-sequences that contain
infrequent (k-1)-subsequences
Support Counting
Make a new pass over the sequence database D to
find the support for these candidate sequences
Candidate Elimination
Eliminate candidate k-sequences whose actual
support is less than minsup

12
Candidate Generation

Base case (k2)
Merging two frequent 1-sequences lti1gt and
lti2gt will produce two candidate 2-sequences
lti1 i2gt and lti1 i2gt
General case (kgt2)
A frequent (k-1)-sequence w1 is merged with
another frequent (k-1)-sequence w2 to produce a
candidate k-sequence if the subsequence obtained
by removing the first event in w1 is the same as
the subsequence obtained by removing the last
event in w2
The resulting candidate after merging is given
by the sequence w1 extended with the last event
of w2.
If the last two events in w2 belong to the same
element, then the last event in w2 becomes part
of the last element in w1
Otherwise, the last event in w2 becomes a
separate element appended to the end of w1

13
Candidate Generation Examples

Merging the sequences w1lt1 2 3 4gt and w2
lt2 3 4 5gt will produce the candidate
sequence lt 1 2 3 4 5gt because the last two
events in w2 (4 and 5) belong to the same element
Merging the sequences w1lt1 2 3 4gt and w2
lt2 3 4 5gt will produce the candidate
sequence lt 1 2 3 4 5gt because the last
two events in w2 (4 and 5) do not belong to the
same element
We do not have to merge the sequences w1 lt1
2 6 4gt and w2 lt1 2 4 5gt to produce
the candidate lt 1 2 6 4 5gt because if the
latter is a viable candidate, then it can be
obtained by merging w1 with lt 1 2 6 5gt

14
GSP Example
15
Finding Length-1 Sequential Patterns

Examine GSP using an example
Initial candidates all singleton sequences
ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt, ltggt, lthgt
Scan database once, count support for candidates

16
GSP Generating Length-2 Candidates
51 length-2 Candidates
Without Apriori property, 8887/292 candidates
Apriori prunes 44.57 candidates
17
The GSP Mining Process
min_sup 2
18
Candidate Generate-and-test Drawbacks

A huge set of candidate sequences generated.
Especially 2-item candidate sequence.
Multiple Scans of database needed.
The length of each candidate grows by one at each
database scan.
Inefficient for mining long sequential patterns.
A long pattern grow up from short patterns
The number of short patterns is exponential to
the length of mined patterns.

19
The SPADE Algorithm

SPADE (Sequential PAttern Discovery using
Equivalent Class) developed by Zaki 2001
A vertical format sequential pattern mining
method
A sequence database is mapped to a large set of
Item ltSID, EIDgt
Sequential pattern mining is performed by
growing the subsequences (patterns) one item at a
time by Apriori candidate generation

20
The SPADE Algorithm
21
Bottlenecks of GSP and SPADE

A huge set of candidates could be generated
1,000 frequent length-1 sequences generate s huge
number of length-2 candidates!
Multiple scans of database in mining
Mining long sequential patterns
Needs an exponential number of short candidates
A length-100 sequential pattern needs 1030
candidate
sequences!

22
Prefix and Suffix (Projection)

ltagt, ltaagt, lta(ab)gt and lta(abc)gt are prefices of
sequence lta(abc)(ac)d(cf)gt
Given sequence lta(abc)(ac)d(cf)gt

23
Mining Sequential Patterns by Prefix Projections

Step 1 find length-1 sequential patterns
ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt
Step 2 divide search space. The complete set of
seq. pat. can be partitioned into 6 subsets
The ones having prefix ltagt
The ones having prefix ltbgt
The ones having prefix ltfgt

24
Finding Seq. Patterns with Prefix ltagt

Only need to consider projections w.r.t. ltagt
ltagt-projected database lt(abc)(ac)d(cf)gt,
lt(_d)c(bc)(ae)gt, lt(_b)(df)cbgt, lt(_f)cbcgt
Find all the length-2 seq. pat. Having prefix
ltagt ltaagt, ltabgt, lt(ab)gt, ltacgt, ltadgt, ltafgt
Further partition into 6 subsets
Having prefix ltaagt
Having prefix ltafgt

25
Completeness of PrefixSpan
SDB
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
Having prefix ltcgt, , ltfgt
Having prefix ltagt
Having prefix ltbgt
ltagt-projected database lt(abc)(ac)d(cf)gt lt(_d)c(bc)
(ae)gt lt(_b)(df)cbgt lt(_f)cbcgt
ltbgt-projected database

Length-2 sequential patterns ltaagt, ltabgt,
lt(ab)gt, ltacgt, ltadgt, ltafgt

Having prefix ltaagt
Having prefix ltafgt

ltaagt-proj. db
ltafgt-proj. db
26
Efficiency of PrefixSpan

No candidate sequence needs to be generated
Projected databases keep shrinking
Major cost of PrefixSpan constructing projected
databases
Can be improved by pseudo-projections

27
Speed-up by Pseudo-projection

Major cost of PrefixSpan projection
Postfixes of sequences often appear repeatedly in
recursive projected databases
When (projected) database can be held in main
memory, use pointers to form projections
Pointer to the sequence
Offset of the postfix

slta(abc)(ac)d(cf)gt
ltagt
lt(abc)(ac)d(cf)gt
sltagt ( , 2)
ltabgt
lt(_c)(ac)d(cf)gt
sltabgt ( , 4)
28
Pseudo-Projection vs. Physical Projection

Pseudo-projection avoids physically copying
postfixes
Efficient in running time and space when database
can be held in main memory
However, it is not efficient when database cannot
fit in main memory
Disk-based random accessing is very costly
Suggested Approach
Integration of physical and pseudo-projection
Swapping to pseudo-projection when the data set
fits in memory

29
Performance on Data Set C10T8S8I8
30
CloSpan Mining Closed Sequential Patterns

A closed sequential pattern s there exists no
superpattern s such that s ? s, and s and s
have the same support
Motivation reduces the number of (redundant)
patterns but attains the same expressive power
Using Backward Subpattern and Backward
Superpattern pruning to prune redundant search
space

31
Constraint-Based Seq.-Pattern Mining

Constraint-based sequential pattern mining
Constraints User-specified, for focused mining
of desired patterns
How to explore efficient mining with constraints?
Optimization
Classification of constraints
Anti-monotone E.g., value_sum(S) lt 150, min(S) gt
10
Monotone E.g., count (S) gt 5, S ? PC,
digital_camera
Succinct E.g., length(S) ? 10, S ? Pentium,
MS/Office, MS/Money
Convertible E.g., value_avg(S) lt 25, profit_sum
(S) gt 160, max(S)/avg(S) lt 2, median(S) min(S)
gt 5
Inconvertible E.g., avg(S) median(S) 0

32
From Sequential Patterns to Structured Patterns

Sets, sequences, trees, graphs, and other
structures
Transaction DB Sets of items
i1, i2, , im,
Seq. DB Sequences of sets
lti1, i2, , im, in, ikgt,
Sets of Sequences
lti1, i2gt, , ltim, in, ikgt,
Sets of trees t1, t2, , tn
Sets of graphs (mining for frequent subgraphs)
g1, g2, , gn
Mining structured patterns in XML documents,
bio-chemical structures, etc.

33
Episodes and Episode Pattern Mining

Other methods for specifying the kinds of
patterns
Serial episodes A ? B
Parallel episodes A B
Regular expressions (A B)C(D ? E)
Methods for episode pattern mining
Variations of Apriori-like algorithms, e.g., GSP
Database projection-based pattern growth
Similar to the frequent pattern growth without
candidate generation

34
Periodicity Analysis

Periodicity is everywhere tides, seasons, daily
power consumption, etc.
Full periodicity
Every point in time contributes (precisely or
approximately) to the periodicity
Partial periodicit A more general notion
Only some segments contribute to the periodicity
Jim reads NY Times 700-730 am every week day
Cyclic association rules
Associations which form cycles
Methods
Full periodicity FFT, other statistical analysis
methods
Partial and cyclic periodicity Variations of
Apriori-like mining methods

35
Ref Mining Sequential Patterns

R. Srikant and R. Agrawal. Mining sequential
patterns Generalizations and performance
improvements. EDBT96.
H. Mannila, H Toivonen, and A. I. Verkamo.
Discovery of frequent episodes in event
sequences. DAMI97.
M. Zaki. SPADE An Efficient Algorithm for Mining
Frequent Sequences. Machine Learning, 2001.
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and
M.-C. Hsu. PrefixSpan Mining Sequential Patterns
Efficiently by Prefix-Projected Pattern Growth.
ICDE'01 (TKDE04).
J. Pei, J. Han and W. Wang, Constraint-Based
Sequential Pattern Mining in Large Databases,
CIKM'02.
X. Yan, J. Han, and R. Afshar. CloSpan Mining
Closed Sequential Patterns in Large Datasets.
SDM'03.
J. Wang and J. Han, BIDE Efficient Mining of
Frequent Closed Sequences, ICDE'04.
H. Cheng, X. Yan, and J. Han, IncSpan
Incremental Mining of Sequential Patterns in
Large Database, KDD'04.
J. Han, G. Dong and Y. Yin, Efficient Mining of
Partial Periodic Patterns in Time Series
Database, ICDE'99.
J. Yang, W. Wang, and P. S. Yu, Mining
asynchronous periodic patterns in time series
data, KDD'00.