Jen-Wei Huang

About This Presentation

Title:

Jen-Wei Huang

Description:

Jen-Wei Huang jwhuang_at_gmail.com National Taiwan University * Jen-Wei Huang * Outlines Introduction Preliminaries Algorithm ... – PowerPoint PPT presentation

Number of Views:120

Avg rating:3.0/5.0

Slides: 103

Provided by: JenWei

Category:

more less

Transcript and Presenter's Notes

Title: Jen-Wei Huang

1
?????,?????

Jen-Wei Huang
???
jwhuang_at_gmail.com
National Taiwan University

2
(No Transcript)
3
http//www.wretch.cc/blog/EtudeBIKE
4
http//www.giant-bicycles.com/zh-TW/
5
(No Transcript)
6
(No Transcript)
7
http//cape7.pixnet.net/blog
8
http//cape7.pixnet.net/blog
9
http//cape7.pixnet.net/blog
10
http//www.wretch.cc/blog/orzboyz
http//blog.sina.com.tw/9winds/
http//atomcinema.pixnet.net/blog
11
(No Transcript)
12
http//www.amazon.com
13
http//www.amazon.com
14
http//www.hq.nasa.gov/office/pao/History/ap11an
n/kippsphotos/apollo.html
15
A General Model for Sequential Pattern Mining
with a Progressive Database

Jen-Wei Huang, Chi-Yao Tseng,
Jian-Chih Ou and Ming-Syan Chen
National Taiwan University

IEEE Trans. on Knowledge and Data Engineering,
Vol. 20, No. 6, June 2008
16
Outlines

Introduction
Preliminaries
Algorithm Pisa
Experiments
Conclusions
Q A

16
17
Introduction to SPM

Mining of frequently occurring patterns related
to time or other sequences.
J. Han, Data Mining Concepts and Techniques
Given a set of sequences, find the complete set
of frequent subsequences
J. Pei, PrefixSpan
Ex) What items one will buy if he/she has bought
some certain items

17
18
Time-related data

Customers buying behavior
Natural phenomena
Sensor network data
Web access patterns
Stock price changes
DNA sequence applications

18
19
Definition

Let I x1, x2, ..., xn be a set of different
items.
An element e, denoted by (xi xj ...), is a subset
of items ? I of which items appear in a sequence
at the same time.
A sequence s, denoted by lt e1, e2, ..., em gt, is
an ordered list of elements.
A sequence database Db contains a set of
sequences and Db represents the number of
sequences in Db.

19
20
Definition

A sequence a lt a1, a2, ..., an gt is a
subsequence of another sequence ß lt b1, b2,
..., bm gt if
there exists a set of integers,
1 i1 lt i2 lt ... lt in m, such that
a1 ? bi1 , a2 ? bi2 , ..., and an ? bin .

20
21
Definition

The sequential pattern mining can be defined as
"Given a sequence database, Db, and a
user-defined minimum support, min_sup, find the
complete set of subsequences whose occurrence
frequencies min_sup Db."

21
22
Three Categories

Depending on the management of the corresponding
database, sequential pattern mining can be
divided into three categories, namely sequential
pattern mining with
a static database.
an incremental database.
a progressive database.

22
23
How To Do Sequential Pattern Mining on a Static
Database

An Overview

24
How?

Apriori-like algorithms
AprioriAll by Agrawal et al
GSP by R. Srikant et al
Partition-based algorithms
FreeSpan by J. Han et al
PrefixSpan by J. Pei et al
Vertical format algorithms
SPADE by Zaki et al
SPAM by Ayres et al

25
Apriori-like Algorithms

1.Sort phase
Sort the database
Customer id as the primary key and time as the
second key
2.Litemset phase
Count the frequency of each itemset
The fraction of customers who bought the itemset

26
Apriori-like Algorithms

3.Transformation phase
Transform each tx to all litemsets in the form of
C01 lt(1,5) (2) (3) (4)gt
C02 lt(1) (3) (4) (3,5)gt
C03 lt(1) (2) (3) (4gt
C04 lt(1) (3) (5)gt
C05 lt(4) (5)gt

27
Itemset
10 3
20 3
30 4
40 3
50 1
60 1
70 4
90 4
10 20 1
40 60 1
40 70 3
60 70 1
40 60 70 1
30 50 1
30 70 1
50 70 1
30 50 70 1
CID Items
2 10 20
5 90
2 30
2 40 60 70
4 30
3 30 50 70
1 30
1 90
4 40 70
4 90
3 10
5 10
1 40 70
5 20
2 90
3 20
CID Items
1 30 90 40 70
2 10 20 30 40 60 70 90
3 30 50 70 10 20
4 30 40 70 90
5 90 10 20
28
Itemset New
10 3 1
20 3 2
30 4 3
40 3 4
70 4 5
90 4 6
40 70 3 7
CID Items
1 3 6 4, 5, 7
2 1, 2 3 4, 5, 7 6
3 3, 5 1 2
4 3 4, 5, 7 6
5 6 1 2
29
Apriori-like Algorithms

4.Mining phase
Apriori-like algorithm
5.Maximal phase
Find the maximum patterns

30
CID Items
1 3 6 4, 5, 7
2 1, 2 3 4, 5, 7 6
3 3, 5 1 2
4 3 4, 5, 7 6
5 6 1 2
Itemset
1 2 2
1 3 1
1 4 1
1 5 1
1 6 1
1 7 1
2 1 0
2 3 1
2 4 1
2 5 1
2 6 1
2 7 1
3 1 1
3 2 1
Itemset
3 4 3
3 5 3
3 6 3
3 7 3
4 1 0
4 2 0
4 3 0
4 5 0
4 6 2
4 7 0
5 1 1
5 2 1
5 3 0
5 4 0
Itemset
5 6 2
5 7 0
6 1 1
6 2 1
6 3 0
6 4 1
6 5 1
6 7 1
7 1 0
7 2 0
7 3 0
7 4 0
7 5 0
7 6 2
31
Itemset
10 3 1
20 3 2
30 4 3
40 3 4
70 4 5
90 4 6
40 70 3 7
CID Items
1 3 6 4, 5, 7
2 1, 2 3 4, 5, 7 6
3 3, 5 1 2
4 3 4, 5, 7 6
5 6 1 2
Itemset
3 4 6 2
3 5 6 2
3 7 6 2
Therefore, frequent sequential patterns are lt1
2gt lt3 4gt lt3 5gt lt3 6gt lt3 7gt lt4 6gt lt5 6gt lt7 6gt lt3 4
6gt lt3 5 6gt lt3 7 6gt
According to mappings, original frequent
sequential patterns are lt10 20gt lt30 40gt lt30 70gt
lt30 90gt lt30 40 70gt lt40 90gt lt70 90gt lt40 70 90gt
lt30 40 90gt lt30 70 90gt lt30 40 70 90gt
32
According to mappings, original frequent
sequential patterns are lt10 20gt lt30 40gt lt30 70gt
lt30 90gt lt30 40 70gt lt40 90gt lt70 90gt lt40 70 90gt
lt30 40 90gt lt30 70 90gt lt30 40 70 90gt
Because lt30 40gt and lt30 70gt are contained by lt30
40 70gt lt40 90gt and lt70 90gt are contained by
lt40 70 90gt lt30 40 90gt and lt30 70 90gt are
contained by lt30 40 70 90gt,
final maximal sequential patterns are lt10 20gt
lt30 90gt lt30 40 70gt lt40 70 90gt lt30 40 70 90gt
33
Related Works

Static database
AprioriAll by Agrawal et al
GSP by R. Srikant et al
SPADE by Zaki et al
FreeSpan by J. Han et al
PrefixSpan by J. Pei et al
SPAM by Ayres et al

33
34
Related Works

Incremental database
ISM by Parthasarathy et al
IncSP by Lin et al
ISE by Masseglia et al
IncSpan by Cheng et al
MILE by Chen et al

34
35
Motivation

The assumption of having a static database may
not hold in practice.
The data in real world change on the fly.
Finding sequential patterns in an incremental
database may lack of interest to the users.
It is noted that users are usually more
interested in the recent data than the old ones.

35
36
Motivation

If a certain sequence does not have any newly
arriving elements, this sequence will still stay
in the database and undesirably contribute to
Db.
New sequential patterns which appear frequently
in the recent sequences may not be considered as
frequent sequential patterns.

36
37
Definition -- Period of Interest

Period of Interest (abbreviated as POI) is a
sliding window
whose length is a user-specified time interval,
continuously advancing as the time goes by.
The sequences having elements whose timestamps
fall into this period, POI, contribute to the
Db for current sequential patterns.

37
38
A
C
AD
B
B
C
BD
AD
B
A
C
A
A
B
C
BC
D
BC
D
C
A
D
C
D
B
D
A
A
C
SID
time
POI5, min_supp0.5
39
Outlines

Introduction
Preliminaries
Algorithm Pisa
Experiments
Conclusions
Q A

39
40
Progressive Sequential Pattern

Progressive sequential pattern mining problem is
defined as follows
"Given a progressive sequence database, a
user-specified period of interest, and a
user-defined minimum support threshold, find the
complete set of frequent subsequences whose
occurrence frequencies are greater than or equal
to the minimum support times the number of
sequences in every period of interest of the
database."

40
41
Naïve Algorithm

Use conventional static sequential pattern mining
algorithms to mine sequential patterns separately
from all combination of POIs
e.g., Db1,5, Db2,6, Db3,7, Db4,8, Db5,9, etc.
For the sequence database which has the elements
appearing in the interval of n timestamps, the
total number of POIs in this interval is equal to
(n - POI 1).

41
42
Prior Work

The only prior work on progressive database is
GSP and MFS proposed by Zhang based on static
algorithms GSP and MFS (also derived by the same
authors).
However, these algorithms still have to re-mine
each sub-database using the static algorithms GSP
and MFS.
Nevertheless, the performance improvement of GSP
and MFS over GSP and MFS is only within 15 as
reported by their authors.

42
43
Algorithm DirApp

Stands for Direct Append.
Consists of two procedures
Progressively Updating
abbreviated as PrUp
Immediately Filtering
abbreviated as ImFi

43
44
Procedure PrUp

When progressively reading newly incoming
elements, Procedure PrUp can
update each sequence in the sequence database
generate candidate sequential patterns
calculate occurrence frequencies of all candidate
equential patterns in the current POI.

44
45
Procedure ImFi

DirApp uses Procedure ImFi to
filter out obsolete data from the existing
sequence database
prune away obsolete candidate sequential patterns
from the candidate set.
report the most up-to-date frequent sequential
patterns to the user in every POI

45
46
A
B
C
AD
B
47
Example
47
48
(1)
(4)
Db1,1
A1
Db1,4
A1
B2
AB1
C4
AC1
BC2
ABC1
(2)
Db1,2
A1
B2
AB1
(3)
Db1,3
A1
B2
AB1
49
(4)
(5)
Db1,4
A1
B2
AB1
C4
AC1
BC2
ABC1
Db1,5 Db1,5
A5 B(AD)2
B2 ABD1
AB1 AB(AD)1
C4 CA4
AC1 CD4
BC2 C(AD)4
ABC1 ACD1
D5 AC(AD)1
(AD)5 BCA2
AD1 BCD2
A(AD)1 BC(AD)2
BA2 ABCD1
BD2 ABC(AD)1
50
(5)
(6)
Db2,6
A5
B2
C4
BC2
D5
(AD)5
BA2
BD2
B(AD)2
CA4
CD4
C(AD)4
BCA2
BCD2
BC(AD)2
Db1,5 Db1,5
A5 B(AD)2
B2 ABD1
AB1 AB(AD)1
C4 CA4
AC1 CD4
BC2 C(AD)4
ABC1 ACD1
D5 AC(AD)1
(AD)5 BCA2
AD1 BCD2
A(AD)1 BC(AD)2
BA2 ABCD1
BD2 ABC(AD)1
51
(6)
(7)
Db3,7
A5
C4
D5
(AD)5
CA4
CD4
C(AD)4
B7
AB5
CB4
DB5
(AD)B5
CAB4
CDB4
C(AD)B4
Db2,6
A5
B2
C4
BC2
D5
(AD)5
BA2
BD2
B(AD)2
CA4
CD4
C(AD)4
BCA2
BCD2
BC(AD)2

52
(1)
(4)
(5)
(6)
(7)
Db1,1
A1
Db1,4
A1
B2
AB1
C4
AC1
BC2
ABC1
Db2,6
A5
B2
C4
BC2
D5
(AD)5
BA2
BD2
B(AD)2
CA4
CD4
C(AD)4
BCA2
BCD2
BC(AD)2
Db3,7
A5
C4
D5
(AD)5
CA4
CD4
C(AD)4
B7
AB5
CB4
DB5
(AD)B5
CAB4
CDB4
C(AD)B4
Db1,5 Db1,5
A5 B(AD)2
B2 ABD1
AB1 AB(AD)1
C4 CA4
AC1 CD4
BC2 C(AD)4
ABC1 ACD1
D5 AC(AD)1
(AD)5 BCA2
AD1 BCD2
A(AD)1 BC(AD)2
BA2 ABCD1
BD2 ABC(AD)1
(2)
Db1,2
A1
B2
AB1
(3)
Db1,3
A1
B2
AB1
53
S01
Db1,2(4) Db1,2(4)
AB1 3
A(BC)1 1
AC1 1
(AD)B1 1
DB1 1
S02
S03
S04
Db1,2
A1
B2
AB1
Db1,2 Db1,2
A1 AB1
D1 DB1
(AD)1 (AD)B1
B2
Db1,2 Db1,2
A1 AB1
B2 AC1
C2 A(BC)1
(BC)2
Db1,2
D2
AB1(3)
54
(2)
(3)
(4)
(5)
Db1,2(4) Db1,2(4)
AB1 3
A(BC)1 1
AC1 1
(AD)B1 1
DB1 1
Db1,3(5) Db1,3(5)
AB1 3
A(BC)1 1
AC1 1
(AD)B1 1
DB1 1
A(BC)B1 1
ACB1 1
(BC)B2 1
CB2 1
DC2 1
Db1,4(5) Db1,4(5) Db1,4(5) Db1,4(5)
AB1 3 A(BC)BC1 1
A(BC)1 1 A(BC)C1 1
AC1 2 (AD)A1 1
(AD)B1 1 (AD)BA1 1
DB3 2 BA2 1
A(BC)B1 1 BC3 2
ACB1 1 (BC)BC2 1
(BC)B2 1 (BC)C2 1
CB2 1 DA1 1
DC2 1 DBA1 1
ABC1 2
Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5) Db1,5(5)
AB1 3 ABC1 2 DBA3 2 BCA2 1
A(BC)1 1 A(BC)BC1 1 A(AD)1 1 BC(AD)2 1
AC1 2 A(BC)C1 1 AB(AD)1 1 BCD2 1
(AD)B1 1 (AD)A1 1 ABC(AD)1 1 BD2 1
DB3 2 (AD)BA1 1 ABCD1 1 CA4 2
A(BC)B1 1 BA4 3 ABD1 1 C(AD)4 1
ACB1 1 BC3 2 AC(AD)1 1 CD4 1
(BC)B2 1 (BC)BC2 1 ACD1 1 DCA2 1
CB2 1 (BC)C2 1 AD1 1
DC2 1 DA3 3 B(AD)2 1
AB1(3)
AB1(3)
DA3(3)
BA4(3)
AB1(3)
AB1(3)
55
(9)
(6)
(7)
(8)
Db5,9(5) Db5,9(5)
DB5 1
BC7 1
AB5 2
A(BC)5 1
AC8 5
(AD)B5 1
ABC5 1
(AD)BC5 1
(AD)C5 1
DBC5 1
DC5 1
ACD6 2
AD6 2
CD8 2
Db2,6(5) Db2,6(5) Db2,6(5) Db2,6(5)
DB3 1 BC(AD)2 1
(BC)B2 1 BCD2 1
CB2 1 BD2 1
DC2 1 CA4 3
BA4 4 C(AD)4 1
BC3 2 CD4 1
(BC)BC2 1 DCA2 1
(BC)C2 1 (BC)A2 1
DA3 2 (BC)BA2 1
DBA3 1 (BC)BCA2 1
B(AD)2 1 (BC)CA2 1
BCA3 2 CBA2 1
Db3,7(5) Db3,7(5) Db3,7(5) Db3,7(5)
DB5 2 (AD)B5 1
BA4 2 BAC4 1
BC4 2 CAB4 2
DA3 1 CA(BC)3 1
DBA3 1 C(AD)B4 1
BCA3 1 CB4 2
CA4 3 C(BC)3 1
C(AD)4 1 CDB4 1
CD4 1 DAC3 1
AB5 2 DBAC3 1
A(BC)5 1 DBC3 1
AC5 2 DC3 1
Db4,8(6) Db4,8(6) Db4,8(6) Db4,8(6)
DB5 1 BAC4 1
BA4 1 CAB4 1
BC7 2 C(AD)B4 1
CA4 2 CB4 1
C(AD)4 1 CDB4 1
CD4 1 ABC5 1
AB5 2 (AD)BC5 1
A(BC)5 1 (AD)C5 1
AC6 4 DBC5 1
(AD)B5 1 DC5 1
AC6(4)
BA4(4)
CA4(3)
CA4(3)
AC8(5)
56
The Advantages of DirApp

DirApp needs only one scan of newly arriving
elements and the candidate set at each timestamp
rather than quadratic scans by conventional
algorithms.
DirApp can
maintain latest data sequences
find the complete set of up-to-date sequential
patterns
delete obsolete data and patterns rapidly

56
57
The Disadvantages of DirApp

DirApp needs lots of working space to store the
candidate sets for all sequences.
Scanning all candidate sets induces huge
computation in execution time.
DirApp needs another data structure to calculate
the occurrence frequencies of all candidate
sequential patterns.

57
58
Outlines

Introduction
Preliminaries
Algorithm Pisa
Experiments
Conclusions
Q A

58
59
Algorithm Pisa

Pisa stands for Progressive mIning of Sequential
pAtterns
Pisa utilizes a Progressive Sequential tree
(abbreviated as PS-tree) to maintain the
information of all sequences in each POI to
update each sequence
find up-to-date sequential patterns

59
60
PS-tree

The nodes in PS-tree can be divided into two
different types
Root node
Common nodes
Each common node stores two information
Node label element in a sequence
Sequence list
sequence IDs containing this element
marked by corresponding timestamps

Root
60
61
PS-tree

Whenever there are a series of elements appearing
in the same sequence, there will be a series of
nodes labeled by each element with the same
sequence IDs in their sequence lists.
The first node will be connected to the Root node
representing the first element.
The other nodes will be connected to the first
node analogously.

61
62
PS-tree
Root
Root
62
63
PS-tree

The path from Root node to any other node
represents the candidate sequential pattern
appearing in this sequence.
The appearing timestamp for each candidate
sequential pattern will be marked in the node
labeled by the last element.

63
64
PS-tree
Root
Root
64
65
Algorithm Pisa

When receiving elements at timestamp t1, Pisa
traverses the PS-tree in post-order to
delete the obsolete elements from
update current sequences in
insert newly arriving elements into
the PS-tree of timestamp t and
transforms it into PS-tree of timestamp t1.

65
66
For a common node

Pisa deletes the obsolete sequences in the
sequence list of this node
If there is no sequence ID left in the sequence
list, Pisa prunes this node away from its parent
Pisa checks the sequence IDs left in the sequence
list to see if there is newly arriving element of
the sequences
If there is no newly arriving element, Pisa goes
to the next node

66
67
For a common node

Otherwise, Pisa generates all combination of
candidate elements from the arriving element
Ex) ABC -gt A, B, C, AB, AC, BC, ABC
For each candidate element that does not exist on
the path from Root to the current node
If there is a child of the same label, Pisa
updates the timestamp of this sequence to the
timestamp of the same sequence in parents
sequence list.
Otherwise, Pisa creates a new child of this
element with the sequence ID and the timestamp of
the same sequence in parents sequence list.

67
68
For Root node

Instead of checking the sequence list, Pisa
examines all sequences that have newly arriving
elements.
After Pisa generates all combination of candidate
element, for each of them
If there is a child of the same label, Pisa
updates the timestamp of this sequence to t1.
Otherwise, Pisa creates a new child of this
element with sequence ID and timestamp t1.

68
69
Algorithm Pisa

After Pisa processes a common node, if the number
of sequence IDs in the sequence list is larger
than the min_suppDbp,q,
the path from Root to this node will be
outputted as a frequent sequential pattern.

69
70
PS-tree
Root
Root
70
71
Root
POI5, min_supp0.5
72
Db1,1(3)
73
Db1,2(4)
B
Db1,1(3)
B
BC
D
D
AB1(3)
74
Db1,3(5)
B
C
D
AB1(3)
75
C
Db1,4(5)
A
C
B
AB1(3)
76
Db1,5(5)
AB1(3)
BA4(3)
DA3(3)
77
Db2,6(5)
A
CA4(3)
BA4(4)
78
B
Db3,7(5)
BC
C
CA4(3)
79
C
Db4,8(6)
C
A
AC6(3)
80
Db5,9(5)
D
D
C
AC8(4)
81
BD
Db6,10(5)
D
CD8(4)
82
The Advantages of Pisa

Pisa needs only one scan of newly arriving
elements and the PS-tree at each timestamp rather
than quadratic scans by conventional algorithms.
Pisa can
maintain latest data sequences
find the complete set of up-to-date sequential
patterns
delete obsolete data and patterns rapidly

82
83
The Advantages of Pisa

Each path from Root to any other node on PS-tree
forms a unique candidate sequential pattern. Thus
Pisa combines the same candidate patterns
together and all patterns do not have to store
their prefix elements.
PS-tree consumes smaller space.
Dealing with the same sequential patterns
together is also very efficient in execution
time.
Fast Pisa with approximation results.

83
84
Outlines

Introduction
Preliminaries
Algorithm Pisa
Experiments
Conclusions
Q A

84
85
Experiments

Comparative algorithms
GSP -- re-mining version of GSP
SPAM -- re-mining version of SPAM
DirApp
Environment
Pentium 4 3GHz CPU and 2GB RAM
Coded in C

85
86
Experiments

The synthetic datasets are generated in the way
similar to the IBM data generator designed for
testing sequential pattern mining algorithms.

86
87
Experiments

We divide the target dataset into n timestamps.
According to the POI, the first m timestamps (m
POI and m lt n) are viewed as the original
database and the rest of transactions in the
dataset are received by the system incrementally.

87
88
Experiments

The first run of the experiments mines the first
POI from the beginning m timestamps of the
dataset.
After that, we shift the POI forward t (tltltm)
timestamps forward for the following runs.

88
89
Experiments

The real data sets are from KDDCUP07.
We randomly choose successive 120 days for the
performance evaluation. A timestamp is set as 3
days in order to obtain sufficient frequent
sequential patterns.
Therefore, there are total 40 timestamps and POI
is set as 10. The new datasets contain more than
5000 sequences and 2000 different items.

89
90
Cumulative Execution Time
90
91
Minimum Support
91
92
Length of POI
92
93
Number of Sequences
93
94
Scalability of Pisa
94
95
Real Data Set
95
96
Improvement of FastPisa
96
97
Information Lose of FastPisa
97
98
Outlines

Introduction
Preliminaries
Algorithm Pisa
Experiments
Conclusions
Q A

98
99
Conclusions

We proposed a progressive algorithm Pisa to
handle the progressive sequential pattern mining
problem without re-mining all sub-databases at
each timestamp.
Pisa needs only one scan of newly arriving
elements and the PS-tree at each timestamp rather
than quadratic scans by conventional algorithms.

99
100
Conclusions

Pisa can
maintain the latest information of sequences
find the complete set of up-to-date sequential
patterns
delete obsolete data and patterns rapidly
Pisa also
consumes less space
has high efficiency
possesses great scalability

100
101
References

R. Srikant and R.Agrawal, Mining Sequential
Patterns Generalizations and Performance
Improvements. Proc. of ICDE, 1995
J. Ayres, J. Gehrke, T. Yiu, and J. Flannick.
Sequential pattern mining using a bitmap
representation. Proc. of ACM SIGKDD, 2002.
M. Zhang, B. Kao, D. W.-L. Cheung, and C. L. Yip.
Efficient algorithms for incremental update of
frequent sequences. Proc. of PAKDD, 2002.