Classification - PowerPoint PPT Presentation

About This Presentation
Title:

Classification

Description:

Vertical format-based mining: SPADE (Zaki_at_Machine Leanining'00) ... Bottlenecks of GSP and SPADE. A huge set of candidates could be generated ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 36
Provided by: jiaw198
Learn more at: https://www.cs.bu.edu
Category:

less

Transcript and Presenter's Notes

Title: Classification


1
Classification
2
Sequence Data
Sequence Database
3
Examples of Sequence Data
Element (Transaction)
Event (Item)
E1E2
E1E3
E2
E3E4
E2
Sequence
4
Formal Definition of a Sequence
  • A sequence is an ordered list of elements
    (transactions)
  • s lt e1 e2 e3 gt
  • Each element contains a collection of events
    (items)
  • ei i1, i2, , ik
  • Each element is attributed to a specific time or
    location
  • Length of a sequence, s, is given by the number
    of elements of the sequence
  • A k-sequence is a sequence that contains k events
    (items)

5
Formal Definition of a Subsequence
  • A sequence lta1 a2 angt is contained in another
    sequence ltb1 b2 bmgt (m n) if there exist
    integers i1 lt i2 lt lt in such that a1 ? bi1 ,
    a2 ? bi1, , an ? bin
  • The support of a subsequence w is defined as the
    fraction of data sequences that contain w
  • A sequential pattern is a frequent subsequence
    (i.e., a subsequence whose support is minsup)

6
Sequential Pattern Mining Definition
  • Given
  • a database of sequences
  • a user-specified minimum support threshold,
    minsup
  • Task
  • Find all subsequences with support minsup

7
Sequential Pattern Mining Challenge
  • Given a sequence lta b c d e f g h igt
  • Examples of subsequences
  • lta c d f g gt, lt c d e gt, lt b g gt,
    etc.
  • How many k-subsequences can be extracted from a
    given n-sequence?
  • lta b c d e f g h igt n 9
  • k4 Y _ _ Y Y _ _ _Y
  • lta d e igt

8
Challenges on Sequential Pattern Mining
  • A huge number of possible sequential patterns are
    hidden in databases
  • A mining algorithm should
  • find the complete set of patterns, when possible,
    satisfying the minimum support (frequency)
    threshold
  • be highly efficient, scalable, involving only a
    small number of database scans
  • be able to incorporate various kinds of
    user-specific constraints

9
Sequential Pattern Mining Algorithms
  • Concept introduction and an initial Apriori-like
    algorithm
  • Agrawal Srikant. Mining sequential patterns,
    ICDE95
  • Apriori-based method GSP (Generalized Sequential
    Patterns Srikant Agrawal _at_ EDBT96)
  • Pattern-growth methods FreeSpan PrefixSpan
    (Han et al._at_KDD00 Pei, et al._at_ICDE01)
  • Vertical format-based mining SPADE (Zaki_at_Machine
    Leanining00)
  • Constraint-based sequential pattern mining
    (SPIRIT Garofalakis, Rastogi, Shim_at_VLDB99 Pei,
    Han, Wang _at_ CIKM02)
  • Mining closed sequential patterns CloSpan (Yan,
    Han Afshar _at_SDM03)

10
Extracting Sequential Patterns
  • Given n events i1, i2, i3, , in
  • Candidate 1-subsequences
  • lti1gt, lti2gt, lti3gt, , ltingt
  • Candidate 2-subsequences
  • lti1, i2gt, lti1, i3gt, , lti1 i1gt, lti1
    i2gt, , ltin-1 ingt
  • Candidate 3-subsequences
  • lti1, i2 , i3gt, lti1, i2 , i4gt, , lti1, i2
    i1gt, lti1, i2 i2gt, ,
  • lti1 i1 , i2gt, lti1 i1 , i3gt, , lti1 i1
    i1gt, lti1 i1 i2gt,

11
Generalized Sequential Pattern (GSP)
  • Step 1
  • Make the first pass over the sequence database D
    to yield all the 1-element frequent sequences
  • Step 2
  • Repeat until no new frequent sequences are found
  • Candidate Generation
  • Merge pairs of frequent subsequences found in the
    (k-1)th pass to generate candidate sequences that
    contain k items
  • Candidate Pruning
  • Prune candidate k-sequences that contain
    infrequent (k-1)-subsequences
  • Support Counting
  • Make a new pass over the sequence database D to
    find the support for these candidate sequences
  • Candidate Elimination
  • Eliminate candidate k-sequences whose actual
    support is less than minsup

12
Candidate Generation
  • Base case (k2)
  • Merging two frequent 1-sequences lti1gt and
    lti2gt will produce two candidate 2-sequences
    lti1 i2gt and lti1 i2gt
  • General case (kgt2)
  • A frequent (k-1)-sequence w1 is merged with
    another frequent (k-1)-sequence w2 to produce a
    candidate k-sequence if the subsequence obtained
    by removing the first event in w1 is the same as
    the subsequence obtained by removing the last
    event in w2
  • The resulting candidate after merging is given
    by the sequence w1 extended with the last event
    of w2.
  • If the last two events in w2 belong to the same
    element, then the last event in w2 becomes part
    of the last element in w1
  • Otherwise, the last event in w2 becomes a
    separate element appended to the end of w1

13
Candidate Generation Examples
  • Merging the sequences w1lt1 2 3 4gt and w2
    lt2 3 4 5gt will produce the candidate
    sequence lt 1 2 3 4 5gt because the last two
    events in w2 (4 and 5) belong to the same element
  • Merging the sequences w1lt1 2 3 4gt and w2
    lt2 3 4 5gt will produce the candidate
    sequence lt 1 2 3 4 5gt because the last
    two events in w2 (4 and 5) do not belong to the
    same element
  • We do not have to merge the sequences w1 lt1
    2 6 4gt and w2 lt1 2 4 5gt to produce
    the candidate lt 1 2 6 4 5gt because if the
    latter is a viable candidate, then it can be
    obtained by merging w1 with lt 1 2 6 5gt

14
GSP Example
15
Finding Length-1 Sequential Patterns
  • Examine GSP using an example
  • Initial candidates all singleton sequences
  • ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt, ltggt, lthgt
  • Scan database once, count support for candidates

16
GSP Generating Length-2 Candidates
51 length-2 Candidates
Without Apriori property, 8887/292 candidates
Apriori prunes 44.57 candidates
17
The GSP Mining Process
min_sup 2
18
Candidate Generate-and-test Drawbacks
  • A huge set of candidate sequences generated.
  • Especially 2-item candidate sequence.
  • Multiple Scans of database needed.
  • The length of each candidate grows by one at each
    database scan.
  • Inefficient for mining long sequential patterns.
  • A long pattern grow up from short patterns
  • The number of short patterns is exponential to
    the length of mined patterns.

19
The SPADE Algorithm
  • SPADE (Sequential PAttern Discovery using
    Equivalent Class) developed by Zaki 2001
  • A vertical format sequential pattern mining
    method
  • A sequence database is mapped to a large set of
  • Item ltSID, EIDgt
  • Sequential pattern mining is performed by
  • growing the subsequences (patterns) one item at a
    time by Apriori candidate generation

20
The SPADE Algorithm
21
Bottlenecks of GSP and SPADE
  • A huge set of candidates could be generated
  • 1,000 frequent length-1 sequences generate s huge
    number of length-2 candidates!
  • Multiple scans of database in mining
  • Mining long sequential patterns
  • Needs an exponential number of short candidates
  • A length-100 sequential pattern needs 1030
    candidate
    sequences!

22
Prefix and Suffix (Projection)
  • ltagt, ltaagt, lta(ab)gt and lta(abc)gt are prefices of
    sequence lta(abc)(ac)d(cf)gt
  • Given sequence lta(abc)(ac)d(cf)gt

23
Mining Sequential Patterns by Prefix Projections
  • Step 1 find length-1 sequential patterns
  • ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt
  • Step 2 divide search space. The complete set of
    seq. pat. can be partitioned into 6 subsets
  • The ones having prefix ltagt
  • The ones having prefix ltbgt
  • The ones having prefix ltfgt

24
Finding Seq. Patterns with Prefix ltagt
  • Only need to consider projections w.r.t. ltagt
  • ltagt-projected database lt(abc)(ac)d(cf)gt,
    lt(_d)c(bc)(ae)gt, lt(_b)(df)cbgt, lt(_f)cbcgt
  • Find all the length-2 seq. pat. Having prefix
    ltagt ltaagt, ltabgt, lt(ab)gt, ltacgt, ltadgt, ltafgt
  • Further partition into 6 subsets
  • Having prefix ltaagt
  • Having prefix ltafgt

25
Completeness of PrefixSpan
SDB
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
Having prefix ltcgt, , ltfgt
Having prefix ltagt
Having prefix ltbgt
ltagt-projected database lt(abc)(ac)d(cf)gt lt(_d)c(bc)
(ae)gt lt(_b)(df)cbgt lt(_f)cbcgt
ltbgt-projected database

Length-2 sequential patterns ltaagt, ltabgt,
lt(ab)gt, ltacgt, ltadgt, ltafgt

Having prefix ltaagt
Having prefix ltafgt

ltaagt-proj. db
ltafgt-proj. db
26
Efficiency of PrefixSpan
  • No candidate sequence needs to be generated
  • Projected databases keep shrinking
  • Major cost of PrefixSpan constructing projected
    databases
  • Can be improved by pseudo-projections

27
Speed-up by Pseudo-projection
  • Major cost of PrefixSpan projection
  • Postfixes of sequences often appear repeatedly in
    recursive projected databases
  • When (projected) database can be held in main
    memory, use pointers to form projections
  • Pointer to the sequence
  • Offset of the postfix

slta(abc)(ac)d(cf)gt
ltagt
lt(abc)(ac)d(cf)gt
sltagt ( , 2)
ltabgt
lt(_c)(ac)d(cf)gt
sltabgt ( , 4)
28
Pseudo-Projection vs. Physical Projection
  • Pseudo-projection avoids physically copying
    postfixes
  • Efficient in running time and space when database
    can be held in main memory
  • However, it is not efficient when database cannot
    fit in main memory
  • Disk-based random accessing is very costly
  • Suggested Approach
  • Integration of physical and pseudo-projection
  • Swapping to pseudo-projection when the data set
    fits in memory

29
Performance on Data Set C10T8S8I8
30
CloSpan Mining Closed Sequential Patterns
  • A closed sequential pattern s there exists no
    superpattern s such that s ? s, and s and s
    have the same support
  • Motivation reduces the number of (redundant)
    patterns but attains the same expressive power
  • Using Backward Subpattern and Backward
    Superpattern pruning to prune redundant search
    space

31
Constraint-Based Seq.-Pattern Mining
  • Constraint-based sequential pattern mining
  • Constraints User-specified, for focused mining
    of desired patterns
  • How to explore efficient mining with constraints?
    Optimization
  • Classification of constraints
  • Anti-monotone E.g., value_sum(S) lt 150, min(S) gt
    10
  • Monotone E.g., count (S) gt 5, S ? PC,
    digital_camera
  • Succinct E.g., length(S) ? 10, S ? Pentium,
    MS/Office, MS/Money
  • Convertible E.g., value_avg(S) lt 25, profit_sum
    (S) gt 160, max(S)/avg(S) lt 2, median(S) min(S)
    gt 5
  • Inconvertible E.g., avg(S) median(S) 0

32
From Sequential Patterns to Structured Patterns
  • Sets, sequences, trees, graphs, and other
    structures
  • Transaction DB Sets of items
  • i1, i2, , im,
  • Seq. DB Sequences of sets
  • lti1, i2, , im, in, ikgt,
  • Sets of Sequences
  • lti1, i2gt, , ltim, in, ikgt,
  • Sets of trees t1, t2, , tn
  • Sets of graphs (mining for frequent subgraphs)
  • g1, g2, , gn
  • Mining structured patterns in XML documents,
    bio-chemical structures, etc.

33
Episodes and Episode Pattern Mining
  • Other methods for specifying the kinds of
    patterns
  • Serial episodes A ? B
  • Parallel episodes A B
  • Regular expressions (A B)C(D ? E)
  • Methods for episode pattern mining
  • Variations of Apriori-like algorithms, e.g., GSP
  • Database projection-based pattern growth
  • Similar to the frequent pattern growth without
    candidate generation

34
Periodicity Analysis
  • Periodicity is everywhere tides, seasons, daily
    power consumption, etc.
  • Full periodicity
  • Every point in time contributes (precisely or
    approximately) to the periodicity
  • Partial periodicit A more general notion
  • Only some segments contribute to the periodicity
  • Jim reads NY Times 700-730 am every week day
  • Cyclic association rules
  • Associations which form cycles
  • Methods
  • Full periodicity FFT, other statistical analysis
    methods
  • Partial and cyclic periodicity Variations of
    Apriori-like mining methods

35
Ref Mining Sequential Patterns
  • R. Srikant and R. Agrawal. Mining sequential
    patterns Generalizations and performance
    improvements. EDBT96.
  • H. Mannila, H Toivonen, and A. I. Verkamo.
    Discovery of frequent episodes in event
    sequences. DAMI97.
  • M. Zaki. SPADE An Efficient Algorithm for Mining
    Frequent Sequences. Machine Learning, 2001.
  • J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and
    M.-C. Hsu. PrefixSpan Mining Sequential Patterns
    Efficiently by Prefix-Projected Pattern Growth.
    ICDE'01 (TKDE04).
  • J. Pei, J. Han and W. Wang, Constraint-Based
    Sequential Pattern Mining in Large Databases,
    CIKM'02.
  • X. Yan, J. Han, and R. Afshar. CloSpan Mining
    Closed Sequential Patterns in Large Datasets.
    SDM'03.
  • J. Wang and J. Han, BIDE Efficient Mining of
    Frequent Closed Sequences, ICDE'04.
  • H. Cheng, X. Yan, and J. Han, IncSpan
    Incremental Mining of Sequential Patterns in
    Large Database, KDD'04.
  • J. Han, G. Dong and Y. Yin, Efficient Mining of
    Partial Periodic Patterns in Time Series
    Database, ICDE'99.
  • J. Yang, W. Wang, and P. S. Yu, Mining
    asynchronous periodic patterns in time series
    data, KDD'00.
Write a Comment
User Comments (0)
About PowerShow.com