Title: Jen-Wei Huang
1?????,?????
- Jen-Wei Huang
- ???
- jwhuang_at_gmail.com
- National Taiwan University
2(No Transcript)
3 http//www.wretch.cc/blog/EtudeBIKE
4 http//www.giant-bicycles.com/zh-TW/
5(No Transcript)
6(No Transcript)
7 http//cape7.pixnet.net/blog
8 http//cape7.pixnet.net/blog
9 http//cape7.pixnet.net/blog
10 http//www.wretch.cc/blog/orzboyz
http//blog.sina.com.tw/9winds/
http//atomcinema.pixnet.net/blog
11(No Transcript)
12 http//www.amazon.com
13 http//www.amazon.com
14 http//www.hq.nasa.gov/office/pao/History/ap11an
n/kippsphotos/apollo.html
15A General Model for Sequential Pattern Mining
with a Progressive Database
- Jen-Wei Huang, Chi-Yao Tseng,
- Jian-Chih Ou and Ming-Syan Chen
- National Taiwan University
IEEE Trans. on Knowledge and Data Engineering,
Vol. 20, No. 6, June 2008
16Outlines
- Introduction
- Preliminaries
- Algorithm Pisa
- Experiments
- Conclusions
- Q A
16
17Introduction to SPM
- Mining of frequently occurring patterns related
to time or other sequences. - J. Han, Data Mining Concepts and Techniques
- Given a set of sequences, find the complete set
of frequent subsequences - J. Pei, PrefixSpan
- Ex) What items one will buy if he/she has bought
some certain items
17
18Time-related data
- Customers buying behavior
- Natural phenomena
- Sensor network data
- Web access patterns
- Stock price changes
- DNA sequence applications
18
19Definition
- Let I x1, x2, ..., xn be a set of different
items. - An element e, denoted by (xi xj ...), is a subset
of items ? I of which items appear in a sequence
at the same time. - A sequence s, denoted by lt e1, e2, ..., em gt, is
an ordered list of elements. - A sequence database Db contains a set of
sequences and Db represents the number of
sequences in Db.
19
20Definition
- A sequence a lt a1, a2, ..., an gt is a
subsequence of another sequence ß lt b1, b2,
..., bm gt if - there exists a set of integers,
- 1 i1 lt i2 lt ... lt in m, such that
- a1 ? bi1 , a2 ? bi2 , ..., and an ? bin .
20
21Definition
- The sequential pattern mining can be defined as
- "Given a sequence database, Db, and a
user-defined minimum support, min_sup, find the
complete set of subsequences whose occurrence
frequencies min_sup Db."
21
22Three Categories
- Depending on the management of the corresponding
database, sequential pattern mining can be
divided into three categories, namely sequential
pattern mining with - a static database.
- an incremental database.
- a progressive database.
22
23How To Do Sequential Pattern Mining on a Static
Database
24How?
- Apriori-like algorithms
- AprioriAll by Agrawal et al
- GSP by R. Srikant et al
- Partition-based algorithms
- FreeSpan by J. Han et al
- PrefixSpan by J. Pei et al
- Vertical format algorithms
- SPADE by Zaki et al
- SPAM by Ayres et al
25Apriori-like Algorithms
- 1.Sort phase
- Sort the database
- Customer id as the primary key and time as the
second key - 2.Litemset phase
- Count the frequency of each itemset
- The fraction of customers who bought the itemset
26Apriori-like Algorithms
- 3.Transformation phase
- Transform each tx to all litemsets in the form of
- C01 lt(1,5) (2) (3) (4)gt
- C02 lt(1) (3) (4) (3,5)gt
- C03 lt(1) (2) (3) (4gt
- C04 lt(1) (3) (5)gt
- C05 lt(4) (5)gt
27Itemset
10 3
20 3
30 4
40 3
50 1
60 1
70 4
90 4
10 20 1
40 60 1
40 70 3
60 70 1
40 60 70 1
30 50 1
30 70 1
50 70 1
30 50 70 1
CID Items
2 10 20
5 90
2 30
2 40 60 70
4 30
3 30 50 70
1 30
1 90
4 40 70
4 90
3 10
5 10
1 40 70
5 20
2 90
3 20
CID Items
1 30 90 40 70
2 10 20 30 40 60 70 90
3 30 50 70 10 20
4 30 40 70 90
5 90 10 20
28Itemset New
10 3 1
20 3 2
30 4 3
40 3 4
70 4 5
90 4 6
40 70 3 7
CID Items
1 3 6 4, 5, 7
2 1, 2 3 4, 5, 7 6
3 3, 5 1 2
4 3 4, 5, 7 6
5 6 1 2
29Apriori-like Algorithms
- 4.Mining phase
- Apriori-like algorithm
- 5.Maximal phase
- Find the maximum patterns
30CID Items
1 3 6 4, 5, 7
2 1, 2 3 4, 5, 7 6
3 3, 5 1 2
4 3 4, 5, 7 6
5 6 1 2
Itemset
1 2 2
1 3 1
1 4 1
1 5 1
1 6 1
1 7 1
2 1 0
2 3 1
2 4 1
2 5 1
2 6 1
2 7 1
3 1 1
3 2 1
Itemset
3 4 3
3 5 3
3 6 3
3 7 3
4 1 0
4 2 0
4 3 0
4 5 0
4 6 2
4 7 0
5 1 1
5 2 1
5 3 0
5 4 0
Itemset
5 6 2
5 7 0
6 1 1
6 2 1
6 3 0
6 4 1
6 5 1
6 7 1
7 1 0
7 2 0
7 3 0
7 4 0
7 5 0
7 6 2
31Itemset
10 3 1
20 3 2
30 4 3
40 3 4
70 4 5
90 4 6
40 70 3 7
CID Items
1 3 6 4, 5, 7
2 1, 2 3 4, 5, 7 6
3 3, 5 1 2
4 3 4, 5, 7 6
5 6 1 2
Itemset
3 4 6 2
3 5 6 2
3 7 6 2
Therefore, frequent sequential patterns are lt1
2gt lt3 4gt lt3 5gt lt3 6gt lt3 7gt lt4 6gt lt5 6gt lt7 6gt lt3 4
6gt lt3 5 6gt lt3 7 6gt
According to mappings, original frequent
sequential patterns are lt10 20gt lt30 40gt lt30 70gt
lt30 90gt lt30 40 70gt lt40 90gt lt70 90gt lt40 70 90gt
lt30 40 90gt lt30 70 90gt lt30 40 70 90gt
32According to mappings, original frequent
sequential patterns are lt10 20gt lt30 40gt lt30 70gt
lt30 90gt lt30 40 70gt lt40 90gt lt70 90gt lt40 70 90gt
lt30 40 90gt lt30 70 90gt lt30 40 70 90gt
Because lt30 40gt and lt30 70gt are contained by lt30
40 70gt lt40 90gt and lt70 90gt are contained by
lt40 70 90gt lt30 40 90gt and lt30 70 90gt are
contained by lt30 40 70 90gt,
final maximal sequential patterns are lt10 20gt
lt30 90gt lt30 40 70gt lt40 70 90gt lt30 40 70 90gt
33Related Works
- Static database
- AprioriAll by Agrawal et al
- GSP by R. Srikant et al
- SPADE by Zaki et al
- FreeSpan by J. Han et al
- PrefixSpan by J. Pei et al
- SPAM by Ayres et al
33
34Related Works
- Incremental database
- ISM by Parthasarathy et al
- IncSP by Lin et al
- ISE by Masseglia et al
- IncSpan by Cheng et al
- MILE by Chen et al
34
35Motivation
- The assumption of having a static database may
not hold in practice. - The data in real world change on the fly.
- Finding sequential patterns in an incremental
database may lack of interest to the users. - It is noted that users are usually more
interested in the recent data than the old ones.
35
36Motivation
- If a certain sequence does not have any newly
arriving elements, this sequence will still stay
in the database and undesirably contribute to
Db. - New sequential patterns which appear frequently
in the recent sequences may not be considered as
frequent sequential patterns.
36
37Definition -- Period of Interest
- Period of Interest (abbreviated as POI) is a
sliding window - whose length is a user-specified time interval,
- continuously advancing as the time goes by.
- The sequences having elements whose timestamps
fall into this period, POI, contribute to the
Db for current sequential patterns.
37
38A
C
AD
B
B
C
BD
AD
B
A
C
A
A
B
C
BC
D
BC
D
C
A
D
C
D
B
D
A
A
C
SID
time
POI5, min_supp0.5
39Outlines
- Introduction
- Preliminaries
- Algorithm Pisa
- Experiments
- Conclusions
- Q A
39
40Progressive Sequential Pattern
- Progressive sequential pattern mining problem is
defined as follows - "Given a progressive sequence database, a
user-specified period of interest, and a
user-defined minimum support threshold, find the
complete set of frequent subsequences whose
occurrence frequencies are greater than or equal
to the minimum support times the number of
sequences in every period of interest of the
database."
40
41Naïve Algorithm
- Use conventional static sequential pattern mining
algorithms to mine sequential patterns separately
from all combination of POIs - e.g., Db1,5, Db2,6, Db3,7, Db4,8, Db5,9, etc.
- For the sequence database which has the elements
appearing in the interval of n timestamps, the
total number of POIs in this interval is equal to
(n - POI 1).
41
42Prior Work
- The only prior work on progressive database is
GSP and MFS proposed by Zhang based on static
algorithms GSP and MFS (also derived by the same
authors). - However, these algorithms still have to re-mine
each sub-database using the static algorithms GSP
and MFS. - Nevertheless, the performance improvement of GSP
and MFS over GSP and MFS is only within 15 as
reported by their authors.
42
43Algorithm DirApp
- Stands for Direct Append.
- Consists of two procedures
- Progressively Updating
- abbreviated as PrUp
- Immediately Filtering
- abbreviated as ImFi
43
44Procedure PrUp
- When progressively reading newly incoming
elements, Procedure PrUp can - update each sequence in the sequence database
- generate candidate sequential patterns
- calculate occurrence frequencies of all candidate
equential patterns in the current POI.
44
45Procedure ImFi
- DirApp uses Procedure ImFi to
- filter out obsolete data from the existing
sequence database - prune away obsolete candidate sequential patterns
from the candidate set. - report the most up-to-date frequent sequential
patterns to the user in every POI
45
46A
B
C
AD
B
47Example
47
48(1)
(4)
Db1,1
A1
Db1,4
A1
B2
AB1
C4
AC1
BC2
ABC1
(2)
Db1,2
A1
B2
AB1
(3)
Db1,3
A1
B2
AB1
49(4)
(5)
Db1,4
A1
B2
AB1
C4
AC1
BC2
ABC1
Db1,5 Db1,5
A5 B(AD)2
B2 ABD1
AB1 AB(AD)1
C4 CA4
AC1 CD4
BC2 C(AD)4
ABC1 ACD1
D5 AC(AD)1
(AD)5 BCA2
AD1 BCD2
A(AD)1 BC(AD)2
BA2 ABCD1
BD2 ABC(AD)1
50(5)
(6)
Db2,6
A5
B2
C4
BC2
D5
(AD)5
BA2
BD2
B(AD)2
CA4
CD4
C(AD)4
BCA2
BCD2
BC(AD)2
Db1,5 Db1,5
A5 B(AD)2
B2 ABD1
AB1 AB(AD)1
C4 CA4
AC1 CD4
BC2 C(AD)4
ABC1 ACD1
D5 AC(AD)1
(AD)5 BCA2
AD1 BCD2
A(AD)1 BC(AD)2
BA2 ABCD1
BD2 ABC(AD)1
51(6)
(7)
Db3,7
A5
C4
D5
(AD)5
CA4
CD4
C(AD)4
B7
AB5
CB4
DB5
(AD)B5
CAB4
CDB4
C(AD)B4
Db2,6
A5
B2
C4
BC2
D5
(AD)5
BA2
BD2
B(AD)2
CA4
CD4
C(AD)4
BCA2
BCD2
BC(AD)2
52(1)
(4)
(5)
(6)
(7)
Db1,1
A1
Db1,4
A1
B2
AB1
C4
AC1
BC2
ABC1
Db2,6
A5
B2
C4
BC2
D5
(AD)5
BA2
BD2
B(AD)2
CA4
CD4
C(AD)4
BCA2
BCD2
BC(AD)2
Db3,7
A5
C4
D5
(AD)5
CA4
CD4
C(AD)4
B7
AB5
CB4
DB5
(AD)B5
CAB4
CDB4
C(AD)B4
Db1,5 Db1,5
A5 B(AD)2
B2 ABD1
AB1 AB(AD)1
C4 CA4
AC1 CD4
BC2 C(AD)4
ABC1 ACD1
D5 AC(AD)1
(AD)5 BCA2
AD1 BCD2
A(AD)1 BC(AD)2
BA2 ABCD1
BD2 ABC(AD)1
(2)
Db1,2
A1
B2
AB1
(3)
Db1,3
A1
B2
AB1
53S01
Db1,2(4) Db1,2(4)
AB1 3
A(BC)1 1
AC1 1
(AD)B1 1
DB1 1
S02
S03
S04
Db1,2
A1
B2
AB1
Db1,2 Db1,2
A1 AB1
D1 DB1
(AD)1 (AD)B1
B2
Db1,2 Db1,2
A1 AB1
B2 AC1
C2 A(BC)1
(BC)2
Db1,2
D2
AB1(3)
54(2)
(3)
(4)
(5)
Db1,2(4) Db1,2(4)
AB1 3
A(BC)1 1
AC1 1
(AD)B1 1
DB1 1
Db1,3(5) Db1,3(5)
AB1 3
A(BC)1 1
AC1 1
(AD)B1 1
DB1 1
A(BC)B1 1
ACB1 1
(BC)B2 1
CB2 1
DC2 1
Db1,4(5) Db1,4(5) Db1,4(5) Db1,4(5)
AB1 3 A(BC)BC1 1
A(BC)1 1 A(BC)C1 1
AC1 2 (AD)A1 1
(AD)B1 1 (AD)BA1 1
DB3 2 BA2 1
A(BC)B1 1 BC3 2
ACB1 1 (BC)BC2 1
(BC)B2 1 (BC)C2 1
CB2 1 DA1 1
DC2 1 DBA1 1
ABC1 2
Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5)
AB1 3 ABC1 2 DBA3 2 BCA2 1
A(BC)1 1 A(BC)BC1 1 A(AD)1 1 BC(AD)2 1
AC1 2 A(BC)C1 1 AB(AD)1 1 BCD2 1
(AD)B1 1 (AD)A1 1 ABC(AD)1 1 BD2 1
DB3 2 (AD)BA1 1 ABCD1 1 CA4 2
A(BC)B1 1 BA4 3 ABD1 1 C(AD)4 1
ACB1 1 BC3 2 AC(AD)1 1 CD4 1
(BC)B2 1 (BC)BC2 1 ACD1 1 DCA2 1
CB2 1 (BC)C2 1 AD1 1
DC2 1 DA3 3 B(AD)2 1
AB1(3)
AB1(3)
DA3(3)
BA4(3)
AB1(3)
AB1(3)
55(9)
(6)
(7)
(8)
Db5,9(5) Db5,9(5)
DB5 1
BC7 1
AB5 2
A(BC)5 1
AC8 5
(AD)B5 1
ABC5 1
(AD)BC5 1
(AD)C5 1
DBC5 1
DC5 1
ACD6 2
AD6 2
CD8 2
Db2,6(5) Db2,6(5) Db2,6(5) Db2,6(5)
DB3 1 BC(AD)2 1
(BC)B2 1 BCD2 1
CB2 1 BD2 1
DC2 1 CA4 3
BA4 4 C(AD)4 1
BC3 2 CD4 1
(BC)BC2 1 DCA2 1
(BC)C2 1 (BC)A2 1
DA3 2 (BC)BA2 1
DBA3 1 (BC)BCA2 1
B(AD)2 1 (BC)CA2 1
BCA3 2 CBA2 1
Db3,7(5) Db3,7(5) Db3,7(5) Db3,7(5)
DB5 2 (AD)B5 1
BA4 2 BAC4 1
BC4 2 CAB4 2
DA3 1 CA(BC)3 1
DBA3 1 C(AD)B4 1
BCA3 1 CB4 2
CA4 3 C(BC)3 1
C(AD)4 1 CDB4 1
CD4 1 DAC3 1
AB5 2 DBAC3 1
A(BC)5 1 DBC3 1
AC5 2 DC3 1
Db4,8(6) Db4,8(6) Db4,8(6) Db4,8(6)
DB5 1 BAC4 1
BA4 1 CAB4 1
BC7 2 C(AD)B4 1
CA4 2 CB4 1
C(AD)4 1 CDB4 1
CD4 1 ABC5 1
AB5 2 (AD)BC5 1
A(BC)5 1 (AD)C5 1
AC6 4 DBC5 1
(AD)B5 1 DC5 1
AC6(4)
BA4(4)
CA4(3)
CA4(3)
AC8(5)
56The Advantages of DirApp
- DirApp needs only one scan of newly arriving
elements and the candidate set at each timestamp
rather than quadratic scans by conventional
algorithms. - DirApp can
- maintain latest data sequences
- find the complete set of up-to-date sequential
patterns - delete obsolete data and patterns rapidly
56
57The Disadvantages of DirApp
- DirApp needs lots of working space to store the
candidate sets for all sequences. - Scanning all candidate sets induces huge
computation in execution time. - DirApp needs another data structure to calculate
the occurrence frequencies of all candidate
sequential patterns.
57
58Outlines
- Introduction
- Preliminaries
- Algorithm Pisa
- Experiments
- Conclusions
- Q A
58
59Algorithm Pisa
- Pisa stands for Progressive mIning of Sequential
pAtterns - Pisa utilizes a Progressive Sequential tree
(abbreviated as PS-tree) to maintain the
information of all sequences in each POI to - update each sequence
- find up-to-date sequential patterns
59
60PS-tree
- The nodes in PS-tree can be divided into two
different types - Root node
- Common nodes
- Each common node stores two information
- Node label element in a sequence
- Sequence list
- sequence IDs containing this element
- marked by corresponding timestamps
Root
60
61PS-tree
- Whenever there are a series of elements appearing
in the same sequence, there will be a series of
nodes labeled by each element with the same
sequence IDs in their sequence lists. - The first node will be connected to the Root node
representing the first element. - The other nodes will be connected to the first
node analogously.
61
62PS-tree
Root
Root
62
63PS-tree
- The path from Root node to any other node
represents the candidate sequential pattern
appearing in this sequence. - The appearing timestamp for each candidate
sequential pattern will be marked in the node
labeled by the last element.
63
64PS-tree
Root
Root
64
65Algorithm Pisa
- When receiving elements at timestamp t1, Pisa
traverses the PS-tree in post-order to - delete the obsolete elements from
- update current sequences in
- insert newly arriving elements into
- the PS-tree of timestamp t and
- transforms it into PS-tree of timestamp t1.
65
66For a common node
- Pisa deletes the obsolete sequences in the
sequence list of this node - If there is no sequence ID left in the sequence
list, Pisa prunes this node away from its parent - Pisa checks the sequence IDs left in the sequence
list to see if there is newly arriving element of
the sequences - If there is no newly arriving element, Pisa goes
to the next node
66
67For a common node
- Otherwise, Pisa generates all combination of
candidate elements from the arriving element - Ex) ABC -gt A, B, C, AB, AC, BC, ABC
- For each candidate element that does not exist on
the path from Root to the current node - If there is a child of the same label, Pisa
updates the timestamp of this sequence to the
timestamp of the same sequence in parents
sequence list. - Otherwise, Pisa creates a new child of this
element with the sequence ID and the timestamp of
the same sequence in parents sequence list.
67
68For Root node
- Instead of checking the sequence list, Pisa
examines all sequences that have newly arriving
elements. - After Pisa generates all combination of candidate
element, for each of them - If there is a child of the same label, Pisa
updates the timestamp of this sequence to t1. - Otherwise, Pisa creates a new child of this
element with sequence ID and timestamp t1.
68
69Algorithm Pisa
- After Pisa processes a common node, if the number
of sequence IDs in the sequence list is larger
than the min_suppDbp,q, -
- the path from Root to this node will be
outputted as a frequent sequential pattern.
69
70PS-tree
Root
Root
70
71Root
POI5, min_supp0.5
72Db1,1(3)
73Db1,2(4)
B
Db1,1(3)
B
BC
D
D
AB1(3)
74Db1,3(5)
B
C
D
AB1(3)
75C
Db1,4(5)
A
C
B
AB1(3)
76Db1,5(5)
AB1(3)
BA4(3)
DA3(3)
77Db2,6(5)
A
CA4(3)
BA4(4)
78B
Db3,7(5)
BC
C
CA4(3)
79C
Db4,8(6)
C
A
AC6(3)
80Db5,9(5)
D
D
C
AC8(4)
81BD
Db6,10(5)
D
CD8(4)
82The Advantages of Pisa
- Pisa needs only one scan of newly arriving
elements and the PS-tree at each timestamp rather
than quadratic scans by conventional algorithms. - Pisa can
- maintain latest data sequences
- find the complete set of up-to-date sequential
patterns - delete obsolete data and patterns rapidly
82
83The Advantages of Pisa
- Each path from Root to any other node on PS-tree
forms a unique candidate sequential pattern. Thus
Pisa combines the same candidate patterns
together and all patterns do not have to store
their prefix elements. - PS-tree consumes smaller space.
- Dealing with the same sequential patterns
together is also very efficient in execution
time. - Fast Pisa with approximation results.
83
84Outlines
- Introduction
- Preliminaries
- Algorithm Pisa
- Experiments
- Conclusions
- Q A
84
85Experiments
- Comparative algorithms
- GSP -- re-mining version of GSP
- SPAM -- re-mining version of SPAM
- DirApp
- Environment
- Pentium 4 3GHz CPU and 2GB RAM
- Coded in C
85
86Experiments
- The synthetic datasets are generated in the way
similar to the IBM data generator designed for
testing sequential pattern mining algorithms.
86
87Experiments
- We divide the target dataset into n timestamps.
- According to the POI, the first m timestamps (m
POI and m lt n) are viewed as the original
database and the rest of transactions in the
dataset are received by the system incrementally.
87
88Experiments
- The first run of the experiments mines the first
POI from the beginning m timestamps of the
dataset. - After that, we shift the POI forward t (tltltm)
timestamps forward for the following runs.
88
89Experiments
- The real data sets are from KDDCUP07.
- We randomly choose successive 120 days for the
performance evaluation. A timestamp is set as 3
days in order to obtain sufficient frequent
sequential patterns. - Therefore, there are total 40 timestamps and POI
is set as 10. The new datasets contain more than
5000 sequences and 2000 different items.
89
90Cumulative Execution Time
90
91Minimum Support
91
92Length of POI
92
93Number of Sequences
93
94Scalability of Pisa
94
95Real Data Set
95
96Improvement of FastPisa
96
97Information Lose of FastPisa
97
98Outlines
- Introduction
- Preliminaries
- Algorithm Pisa
- Experiments
- Conclusions
- Q A
98
99Conclusions
- We proposed a progressive algorithm Pisa to
handle the progressive sequential pattern mining
problem without re-mining all sub-databases at
each timestamp. - Pisa needs only one scan of newly arriving
elements and the PS-tree at each timestamp rather
than quadratic scans by conventional algorithms.
99
100Conclusions
- Pisa can
- maintain the latest information of sequences
- find the complete set of up-to-date sequential
patterns - delete obsolete data and patterns rapidly
- Pisa also
- consumes less space
- has high efficiency
- possesses great scalability
100
101References
- R. Srikant and R.Agrawal, Mining Sequential
Patterns Generalizations and Performance
Improvements. Proc. of ICDE, 1995 - J. Ayres, J. Gehrke, T. Yiu, and J. Flannick.
Sequential pattern mining using a bitmap
representation. Proc. of ACM SIGKDD, 2002. - M. Zhang, B. Kao, D. W.-L. Cheung, and C. L. Yip.
Efficient algorithms for incremental update of
frequent sequences. Proc. of PAKDD, 2002.
101
102Thank You !
102