HyeChung Monica Kum

About This Presentation

Title:

HyeChung Monica Kum

Description:

Depth first pattern growth (PrefixSpan) J. Han and J. Pei : SIGKDD 2000 & ICDE 2001 ... few of them follow an exactly same pattern. Seq= (A,B,D)(B)(C,D) ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 148

Provided by: kum

Learn more at: https://www.unc.edu

more less

Transcript and Presenter's Notes

Title: HyeChung Monica Kum

1
Approximate Mining of Consensus Sequential
Patterns

Hye-Chung (Monica) Kum
University of North Carolina, Chapel Hill
Computer Science Department
School of Social Work
http//www.cs.unc.edu/kum/approxMAP

2
Knowledge Discovery Data mining (KDD)

"The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data"

3
Knowledge Discovery Data mining (KDD)

"The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data"

4
Knowledge Discovery Data mining (KDD)

"The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data"

5
Knowledge Discovery Data mining (KDD)

"The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data"

6
Knowledge Discovery Data mining (KDD)

"The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data"

7
Knowledge Discovery Data mining (KDD)

"The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data"
The goal is to discover and present knowledge in
a form, which is easily comprehensible to humans
in a timely manner

8
Knowledge Discovery Data mining (KDD)

"The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data"
The goal is to discover and present knowledge in
a form, which is easily comprehensible to humans
in a timely manner

9
Knowledge Discovery Data mining (KDD)

"The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data"
The goal is to discover and present knowledge in
a form, which is easily comprehensible to humans
in a timely manner
combining ideas drawn from databases, machine
learning, artificial intelligence,
knowledge-based systems, information retrieval,
statistics, pattern recognition, visualization,
and parallel and distributed computing
Fayyad, Piatetsky-Shapiro, Smyth 1996

10
What is KDD ?

Purpose
Extract useful information
Source
Operational or Administrative Data
Example
VIC card database for buying patterns
monthly welfare service patterns

11
Example

Analyze buying patterns for sales marketing

12
Example

VIC card 4/8 50

13
Example

VIC card 5/863

14
Overview

What is KDD (Knowledge Discovery Data mining)
Problem Sequential Pattern Mining
Method ApproxMAP
Evaluation Method
Results
Case Study
Conclusion

15
Overview

What is KDD (Knowledge Discovery Data mining)
Problem Sequential Pattern Mining
Method ApproxMAP
Evaluation Method
Results
Case Study
Conclusion

16
Sequential Pattern Mining
17
Sequential Pattern Mining

Detecting patterns in sequences of sets

18
Welfare Program Participation Patterns

What are the common participation patterns ?
What are the variations to them ?
How do different policies affect these patterns?

19
Thesis Statement

The author of this dissertation asserts that
multiple alignment is an effective model to
uncover the underlying trend in sequences of
sets.
I will show that approxMAP,
is a novel method to apply multiple alignment
techniques to sequences of sets,
will effectively extract the underlying trend in
the data
by organizing the large database into clusters
as well as give reasonable descriptors (weighted
sequences and consensus sequences) for the
clusters via multiple alignment
Furthermore, I will show that approxMAP
is robust to its input parameters,
is robust to noise and outliers in the data,
scalable with respect to the size of the
database,
and in comparison to the conventional support
model, approxMAP can better recover the
underlying pattern with little confounding
information under most circumstances.
In addition, I will demonstrate the usefulness of
approxMAP using real world data.

20
Thesis Statement

Multiple alignment is an effective model to
uncover the underlying trend in sequences of
sets.
ApproxMAP is a novel method to apply multiple
alignment techniques to sequences of sets.
ApproxMAP can recover the underlying patterns
with little confounding information under most
circumstances including those in which the
conventional methods fail.
I will demonstrate the usefulness of approxMAP
using real world data.

21
Sequential Pattern Mining

Detecting patterns in sequences of sets

Nseq Total of sequences in the Database
Lseq Avg of itemsets in a sequence
Iseq Avg of items in an itemset
Lseq Iseq Avg length of a sequence

22
Conventional Methods Support Model

Super-sequence ? Sub-sequence
(A,B,D)(B)(C,D)(B,C)

23
Conventional Methods Support Model

Super-sequence ? Sub-sequence
(A,B,D)(B)(C,D)(B,C) ? (A)(B)(C,D)

24
Conventional Methods Support Model

Super-sequence ? Sub-sequence
(A,B,D)(B)(C,D)(B,C) ? (A)(B)(C,D)
Support (P ) of super-sequences of P in D

25
Conventional Methods Support Model

Super-sequence ? Sub-sequence
(A,B,D)(B)(C,D)(B,C) ? (A)(B)(C,D)
Support (P ) of super-sequences of P in D
Given D, and user threshold, min_sup
find complete set of P s.t. Support(P )?? min_sup

26
Conventional Methods Support Model

Super-sequence ? Sub-sequence
(A,B,D)(B)(C,D)(B,C) ? (A)(B)(C,D)
Support (P ) of super-sequences of P in D
Given D, and user threshold, min_sup
find complete set of P s.t. Support(P )??
min_sup
R. Agrawal and R. Srikant ICDE 95 EBDT 96
Methods
Breadth first Apriori Principle (GSP)
R. Agrawal and R. Srikant ICDE 95 EBDT 96
Depth first pattern growth (PrefixSpan)
J. Han and J. Pei SIGKDD 2000 ICDE 2001

27
Example Support Model

Dp, Br Mk, Dp Mk, Dp, Br 2/367
2L - 1 27-1128-1127 subsequences

Dp, Br Mk, Dp Mk, Br
Dp, Br Mk, Dp Mk, Dp
Mk, Dp Mk, Dp, Br
Dp, Br Mk, Dp, Br
etc

Br Mk, Dp Mk, Dp, Br
Dp Mk, Dp Mk, Dp, Br
Dp, Br Dp Mk, Dp, Br
Dp, Br Mk Mk, Dp, Br
Dp, Br Mk, Dp Dp, Br

28
Inherent Problems the model

Support
cannot distinguish between statistically
significant patterns and random occurrences
Theoretically
Short random sequences occur often in long
sequential data simply by chance
Empirically
of spurious patterns grows exponential w.r.t.
Lseq

29
Inherent Problems exact match

A pattern gets support
the pattern is exactly contained in the sequence
Often may not find general long patterns
Example
many customers may share similar buying habits
few of them follow an exactly same pattern

30
Inherent Problems Complete set

Mines complete set
Too many trivial patterns
Given long sequences with noise
too expensive and too many patterns
2L - 1 210-11023
Finding max / closed sequential patterns
is non-trivial
In noisy environment, still too many max/closed
patterns

31
Possible Models

Support model
Patterns in sets
unordered list
Multiple alignment model
Find common patterns among strings
Simple ordered list of characters

32
Multiple Alignment

line up the sequences to detect the trend
Find common patterns among strings
DNA / bio sequences

33
Multiple Alignment

line up the sequences to detect the trend
Find common patterns among strings
DNA / bio sequences

34
Edit Distance

Pairwise Score(edit distance) dist(seq1, seq2)
Minimum of ops required to change seq1 to seq2
Ops INDEL(a) and/or REPLACE(a,b)
Recurrence relation

35
Edit Distance

Pairwise Score(edit distance) dist(seq1, seq2)
Minimum of ops required to change seq1 to seq2
Ops INDEL(a) and/or REPLACE(a,b)
Recurrence relation

Multiple Alignment Score
?PS(seqi, seqj) (? 1 i N and 1 j N)
Optimal alignment minimum score

36
Consensus Sequence
37
Consensus Sequence

Weighted Sequence
compression of aligned sequences into one sequence

38
Consensus Sequence

Weighted Sequence
compression of aligned sequences into one sequence

39
Consensus Sequence

Weighted Sequence
compression of aligned sequences into one sequence

40
Consensus Sequence

Weighted Sequence
compression of aligned sequences into one sequence

41
Consensus Sequence

Weighted Sequence
compression of aligned sequences into one
sequence
strength(i, j) of occurrences of item i in
position j
total of sequences
A 3/3 100
E 1/3 33
H 1/3 33

42
Consensus Sequence

Weighted Sequence
compression of aligned sequences into one
sequence
strength(i, j) of occurrences of item i in
position j
total of sequences
Consensus itemset (j) min_strength2
( ia ? ia?(I ? ()) strength(ia, j)
min_strength )

43
Consensus Sequence

Weighted Sequence
compression of aligned sequences into one
sequence
strength(i, j) of occurrences of item i in
position j
total of sequences
Consensus itemset (j) min_strength2
( ia ? ia?(I ? ()) strength(ia, j)
min_strength )
Consensus sequence
concatenation of the consensus itemsets

44
Consensus Sequence

Weighted Sequence
compression of aligned sequences into one
sequence
strength(i, j) of occurrences of item i in
position j
total of sequences
Consensus itemset (j) min_strength2
( ia ? ia?(I ? ()) strength(ia, j)
min_strength )
Consensus sequence
concatenation of the consensus itemsets

45
Multiple Alignment Sequential Pattern Mining

Given
N sequences of sets,
Op costs (INDEL REPLACE) for itemsets, and
Strength thresholds for consensus sequences
To
(1) partition the N sequences into K sets of
sequences such
that the sum of the K multiple alignment
scores is
minimum, and
(2) find the multiple alignment for each
partition, and
(3) find the pattern consensus sequence and the
variation
consensus sequence for each partition

46
Overview

What is KDD (Knowledge Discovery Data mining)
Problem Sequential Pattern Mining
Method ApproxMAP
Evaluation Method
Results
Case Study
Conclusion

47
ApproxMAP (Approximate Multiple Alignment
Pattern mining)

Exact solution Too expensive!
Approximation Method ApproxMAP
Organize into K partitions
Use clustering
Compress each partition into
weighted sequences
Summarize each partition into
Pattern consensus sequence
Variation consensus sequence

48
Tasks

Op costs (INDEL REPLACE) for itemsets
Organize into K partitions
Use clustering
Compress each partition into
weighted sequences
Summarize each partition into
Pattern consensus sequence
Variation consensus sequence

49
Tasks

Op costs (INDEL REPLACE) for itemsets
Organize into K partitions
Use clustering
Compress each partition into
weighted sequences
Summarize each partition into
Pattern consensus sequence
Variation consensus sequence

50
Op costs for itemsets

Normalized set difference
R(X,Y) (X-YY-X)/(XY)
0 R 1 , metric
INDEL(X) R(X,?) 1

51
Op costs for itemsets

Normalized set difference
R(X,Y) (X-YY-X)/(XY)
0 R 1 , metric
INDEL(X) R(X,?) 1

52
Op costs for itemsets

Normalized set difference
R(X,Y) (X-YY-X)/(XY)
0 R 1 , metric
INDEL(X) R(X,?) 1

53
Op costs for itemsets

Normalized set difference
R(X,Y) (X-YY-X)/(XY)
0 R 1 , metric
INDEL(X) R(X,?) 1

54
Op costs for itemsets

Normalized set difference
R(X,Y) (X-YY-X)/(XY)
0 R 1 , metric
INDEL(X) R(X,?) 1
Jaccard coefficient
1-X?Y / X?Y

55
Op costs for itemsets

Normalized set difference
R(X,Y) (X-YY-X)/(XY)
0 R 1 , metric
INDEL(X) R(X,?) 1
Jaccard coefficient
1-X?Y / X?Y
Sørensen coefficient simple index
Give greater "weight" to common elements

56
Op costs for itemsets

Normalized set difference
R(X,Y) (X-YY-X)/(XY)
0 R 1 , metric
INDEL(X) R(X,?) 1
Jaccard coefficient
1-X?Y / X?Y
1-(X?Y) / X-YY-XX?Y
Sørensen coefficient simple index
Give greater "weight" to common elements

57
Op costs for itemsets

Normalized set difference
R(X,Y) (X-YY-X)/(XY)
0 R 1 , metric
INDEL(X) R(X,?) 1
Jaccard coefficient
1-X?Y / X?Y
1-(X?Y) / X-YY-XX?Y
Sørensen coefficient simple index
Give greater "weight" to common elements
1-2(X?Y) / X-YY-X2X?Y

58
Op costs for itemsets

Normalized set difference
R(X,Y) (X-YY-X)/(XY)
0 R 1 , metric
INDEL(X) R(X,?) 1
Jaccard coefficient
1-X?Y / X?Y
1-(X?Y) / X-YY-XX?Y
Sørensen coefficient simple index
Give greater "weight" to common elements
1-2(X?Y) / X-YY-X2X?Y
(XY-2X?Y) / (XY) R(X,Y)

59
Tasks

Op costs (INDEL REPLACE) for itemsets
Organize into K partitions
Use clustering
Compress each partition into
weighted sequences
Summarize each partition into
Pattern consensus sequence
Variation consensus sequence

60
Organize Partition into K sets

Goal
To minimize the sum of the K multiple alignment
scores
Group similar sequences
Approximate
Calculate NN proximity matrix
Pairwise score edit distance
Any clustering that works best for your data

61
Organize Clustering

Desirable Properties
Form groups of arbitrary shape and size
Can estimate the number of clusters from the data

62
Density Based Clustering

k-nearest neighbor
Partition based at the valleys of the density
estimate
Density of sequence n / (Dd) ? n/d
n d Based on user defined k nearest neighbor
space
n of neighbors
d size of neighbor region
Parameter k Neighbor space
Can cluster at different resolutions as desired
General Uniform kernel k-NN clustering
Efficient O(kN)

63
Tasks

Op costs (INDEL REPLACE) for itemsets
Organize into k partitions
Use clustering
Compress each partition into
weighted sequences
Summarize each partition into
Pattern consensus sequence
Variation consensus sequence

64
Data Compression Multiple Alignment

Optimal multiple alignment too expensive !
greedy approximation
Incrementally align
in density-descending order
Pairwise alignment
Sequence to Weighted sequence

65
(No Transcript)
66
(No Transcript)
67
Multiple Alignment
68
Multiple Alignment
69
Multiple Alignment
70
Multiple Alignment
71
Multiple Alignment
72
Multiple Alignment
73
Multiple Alignment
74
Multiple Alignment
75
Op Cost for Itemset to weighted itemset

Replace ((A3,E1)3 4 , (AG) ) ?

76
Op Cost for Itemset to weighted itemset

Replace ((A3,E1)3 4 , (AG) ) ? 65/120

77
Op Cost for Itemset to weighted itemset
78
Op Cost for Itemset to weighted itemset

1 ( n wX )

79
Op Cost for Itemset to weighted itemset

Rw(Xw,Y) weight(X)YwX 2weight(X?Y)
weight(X) YwX

(XY-2X?Y) (XY)
80
Op Cost for Itemset to weighted itemset

Rw(Xw,Y) weight(X)YwX 2weight(X?Y)
weight(X) YwX
Rw(Xw,Y) wX

81
Op Cost for Itemset to weighted itemset Rw

1 ( n wX )
Rw(Xw,Y) wX
Replace ((A3,E1)3 4 , (AG) )
Rw(Xw, Y) Rw(Xw,Y) wX n wX / n

82
Op Cost for Itemset to weighted itemset Rw

1 ( n wX )
Rw(Xw,Y) wX
Replace ((A3,E1)3 4 , (AG) )
Rw(Xw, Y) Rw(Xw,Y) wX n wX / n

83
Op Cost for Itemset to weighted itemsetRw

Op cost
Rw(Xw,Y) weight(X)YwX 2weight(X?Y)
weight(X) YwX
Rw(Xw, Y) Rw wX n wX / n
0 Rw 1 , metric
INDEL(Xw) Rw (Xw,?) INDEL(Y) Rw (?, Y) 1

84
Tasks

Op costs (INDEL REPLACE) for itemsets
Organize into K partitions
Use clustering
Compress each partition into
weighted sequences
Summarize each partition into
Pattern consensus sequence
Variation consensus sequence

85
Summarize Generate and Present results

N sequences ? K weighted sequences
Weighted sequence huge
compression of all sequences

86
lt (E1, L1, R1, T1, V1, d1) (A1, B9, C8,
D8, E12, F1, L4, P1, S1, T8, V5, X1,
a1, d10, e2, f1, g1, p1) (B99, C96, D91,
E24, F2, G1, L15, P7, R2, S8, T95, V15,
X2, Y1, a2, d26, e3, g6, l1, m1) (A5,
B16, C5, D3, E13, F1, H2, L7, P1, R2,
S7, T6, V7, Y3, d3, g1) (A13, B126, C27,
D1, E32, G5, H3, J1, L1, R1, S32, T21,
V1, W3, X2, Y8, d13, e1, f8, i2, p7,
l3, g1) (A12, B6, C28, D1, E28, G5, H2,
J6, L2, S137, T10, V2, W6, X8, Y124, a1,
d6, g2, i1, l1, m2) (A135, B2, C23, E36,
G12, H124, K1, L4, O2, R2, S27, T6, V6,
W10, X3, Y8, Z2, a1, d6, g1, h2, j1,
k5, l3, m7, n1) (A11, B1, C5, E12, G3,
H10, L7, O4, S5, T1, V7, W3, X2, Y3,
a1, m2) (A31, C15, E10, G15, H25, K1,
L7, M1, O1, R4, S12, T10, V6, W3, Y3,
Z3, d7, h3, j2, l1, n1, p1, q1) (A3,
C5, E4, G7, H1, K1, R1, T1, W2, Z2, a1,
d1, h1, n1) (A20, C27, E13, G35, H7, K7,
L111, N2, O1, Q3, R11, S10, T20, V111,
W2, X2, Y3, Z8, a1, b1, d13, h9, j1,
n1, o2) (A17, B2, C14, E17, F1, G31, H8,
K13, L2, M2, N1, R22, S2, T140, U1, V2,
W2, X1, Z13, a1, b8, d6, h14, n6, p1,
q1) (A12, B7, C5, E13, G16, H5, K106,
L8, N2, O1, R32, S3, T29, V9, X2, Z9,
b16, c5, d5, h7, l1) (A7, B1, C9, E5,
G7, H3, K7, R8, S1, T10, X1, Z3, a2,
b3, c1, d5, h3) (A1, B1, H1, R1, T1,
b2, c1) (A3, B2, C2, E6, F2, G4, H2,
K20, M2, N3, R19, S3, T11, U2, X4, Z34,
a3, b11, c2, d4) (H1, Y1, a1, d1) gt 162
87
Presentation model

Frequent items
definite pattern items
Cutoff ?50
Common items
uncertain
Cutoff ?20
Rare items
Noise items

W100
?50
?20
W0
88
Visualization

Pattern Consensus Sequence
Cutoff Minimum cluster strength (?50)
Frequent items
Variation Consensus Sequence
Cutoff Minimum cluster strength (?20)
Frequent items common items

89
Example Given 10 seqs lexically sorted
90
Example Given 10 seqs lexically sorted
91
Color scheme lt100 85 70 50 35 20 gt
92
Color scheme lt100 85 70 50 35 20 gt
93
Color scheme lt100 85 70 50 35 20 gt
94
Color scheme lt100 85 70 50 35 20 gt
95
Color scheme lt100 85 70 50 35 20 gt
96
Color scheme lt100 85 70 50 35 20 gt
97
Example support Model (202 seq)
(A) (B,C) (D,E) (I,J) (K) (L,M)
98
ApproxMAP (Approximate Multiple Alignment
Pattern mining)

Approximation Method ApproxMAP
Organize into K partitions O(Nseq2Lseq2Iseq)
Proximity matrix O(Nseq2Lseq2Iseq)
Clustering O(kNseq)
Compress each partition O(nL2)
weighted sequences O(nL2)
Summarize each partition O (1)
Pattern consensus sequence
Variation consensus sequence
Time Complexity O(Nseq2Lseq2Iseq)
2 optimizations

99
Overview

What is KDD (Knowledge Discovery Data mining)
Problem Sequential Pattern Mining
Method ApproxMAP
Evaluation Method
Results
Case Study
Conclusion

100
Evaluation

Up to now Only performance / scalability
Quality?
What kind of patterns will the model generate?
Evaluate correctness of the model
Why?
Basis for comparison of different models
Essential in understanding results of approximate
solutions

101
Evaluation Method

Given
Set of Base patterns B E(FB) E(LB) ? D
D ? Set of Result patterns P
How?
Map each Pi to best Bj
based on Longest Common Subsequences
of all Bj
max res pat B(B?P)

102
Item level
Confusion Matrix
103
Item level
Confusion Matrix
104
Item level
Confusion Matrix
105
Evaluation Criteria Item level

Recoverability
Degree of pattern items in the Base Pat
(weighted)
? E(FB) max res pat B(B?P) / E(LB)
Cutoff so that 0 R 1
Precision
Degree of pattern items in the Result Pat
Pattern Items / (Pattern Items Extraneous Items)

106
Evaluation Criteria Sequence level

spurious patterns
Pattern Items ? Extraneous Items
determine max pattern for each Bj
Of all Pi map to a particular Bj
the Pi with the Longest Common Subsequence
max res pat P(B?P)
redundant patterns
All other patterns
Ntotal Nmax Nspur Nredun

107
Evaluation Example

30 (A)(BC)(DE)
70 (IJ)(K)(LM)

(A)(B)(DE)
(A)(BC)(D)
(B)(BC)(DE)
(IJ)(LM)
(J)(K)(LM)
(IJ)(KX)(LM)
(XY)(K)(Z)

108
Evaluation Example

30 (A)(BC)(DE)
70 (IJ)(K)(LM)
Ntotal 7

(A)(B)(DE)
(A)(BC)(D)
(B)(BC)(DE)
(IJ)(LM)
(J)(K)(LM)
(IJ)(KX)(LM)
(XY)(K)(Z)

109
Evaluation Example

30 (A)(BC)(DE)
70 (IJ)(K)(LM)
Ntotal 7
Spurious 1

(A)(B)(DE)
(A)(BC)(D)
(B)(BC)(DE)
(IJ)(LM)
(J)(K)(LM)
(IJ)(KX)(LM)
(XY)(K)(Z)

110
Evaluation Example

30 (A)(BC)(DE)
70 (IJ)(K)(LM)
Ntotal 7
Spurious 1
Recoverability
(30)4/5(70)5/5 94

(A)(B)(DE)
(A)(BC)(D)
(B)(BC)(DE)
(IJ)(LM)
(J)(K)(LM)
(IJ)(KX)(LM)
(XY)(K)(Z)

111
Evaluation Example

30 (A)(BC)(DE)
70 (IJ)(K)(LM)
Ntotal 7
Spurious 1
Recoverability
(30)4/5(70)5/5 94
Redundant 4

(A)(B)(DE)
(A)(BC)(D)
(B)(BC)(DE)
(IJ)(LM)
(J)(K)(LM)
(IJ)(KX)(LM)
(XY)(K)(Z)

112
Evaluation Example

30 (A)(BC)(DE)
70 (IJ)(K)(LM)
Ntotal 7
Spurious 1
Recoverability
(30)4/5(70)5/5 94
Redundant 4
Precision 1-5/3184

(A)(B)(DE)
(A)(BC)(D)
(B)(BC)(DE)
(IJ)(LM)
(J)(K)(LM)
(IJ)(KX)(LM)
(XY)(K)(Z)

113
Synthetic data

Patterned data IBM synthetic data generator
Given certain DB parameters outputs
sequence DB
base patterns used to generate it E(FB) and
E(LB)
R. Agrawal and R. Srikant ICDE 95 EBDT 96
Random data
Independence both between and across itemsets
Patterned data systematic noise
Randomly change item with probability ?
Yang, SIGMOD 2002
Patterned data systematic outliers
random sequences

114
Overview

What is KDD (Knowledge Discovery Data mining)
Problem Sequential Pattern Mining
Method ApproxMAP
Evaluation Method
Results
Case Study
Conclusion

115
Results

ApproxMAP
Pattern consensus sequence
No null or one-itemset
Machine swan
2GHz Intel Xeon processor
2GB of memory
Public machine
Difficult to get consistent running time
measurements
Thanks !

116
Database Parameter
117
Database Parameter
118
(No Transcript)
119
(No Transcript)
120
lt (E1, L1, R1, T1, V1, d1) (A1, B9, C8,
D8, E12, F1, L4, P1, S1, T8, V5, X1,
a1, d10, e2, f1, g1, p1) (B99, C96, D91,
E24, F2, G1, L15, P7, R2, S8, T95, V15,
X2, Y1, a2, d26, e3, g6, l1, m1) (A5,
B16, C5, D3, E13, F1, H2, L7, P1, R2,
S7, T6, V7, Y3, d3, g1) (A13, B126, C27,
D1, E32, G5, H3, J1, L1, R1, S32, T21,
V1, W3, X2, Y8, d13, e1, f8, i2, p7,
l3, g1) (A12, B6, C28, D1, E28, G5, H2,
J6, L2, S137, T10, V2, W6, X8, Y124, a1,
d6, g2, i1, l1, m2) (A135, B2, C23, E36,
G12, H124, K1, L4, O2, R2, S27, T6, V6,
W10, X3, Y8, Z2, a1, d6, g1, h2, j1,
k5, l3, m7, n1) (A11, B1, C5, E12, G3,
H10, L7, O4, S5, T1, V7, W3, X2, Y3,
a1, m2) (A31, C15, E10, G15, H25, K1,
L7, M1, O1, R4, S12, T10, V6, W3, Y3,
Z3, d7, h3, j2, l1, n1, p1, q1) (A3,
C5, E4, G7, H1, K1, R1, T1, W2, Z2, a1,
d1, h1, n1) (A20, C27, E13, G35, H7, K7,
L111, N2, O1, Q3, R11, S10, T20, V111,
W2, X2, Y3, Z8, a1, b1, d13, h9, j1,
n1, o2) (A17, B2, C14, E17, F1, G31, H8,
K13, L2, M2, N1, R22, S2, T140, U1, V2,
W2, X1, Z13, a1, b8, d6, h14, n6, p1,
q1) (A12, B7, C5, E13, G16, H5, K106,
L8, N2, O1, R32, S3, T29, V9, X2, Z9,
b16, c5, d5, h7, l1) (A7, B1, C9, E5,
G7, H3, K7, R8, S1, T10, X1, Z3, a2,
b3, c1, d5, h3) (A1, B1, H1, R1, T1,
b2, c1) (A3, B2, C2, E6, F2, G4, H2,
K20, M2, N3, R19, S3, T11, U2, X4, Z34,
a3, b11, c2, d4) (H1, Y1, a1, d1) gt 162
ApproxMAP
121
ApproxMAP
122
ApproxMAP
123
(No Transcript)
124
8 patterns returned 7 max patterns 1 redundant
patterns 0 spurious patterns Recoverability
91.16 Precision 97.17 Extraneous Items 3/106
125
Comparative Study

Conventional Sequential Pattern Mining
Support Model
Empirical analysis
Totally random data
Patterned data
Patterned data noise
Patterned data outliers

126
Evaluation Comparison
127
Evaluation Comparison
128
Evaluation Comparison
129
Robustness w.r.t. noise
130
Evaluation Comparison
131
Understanding ApproxMAP

5 experiments
k in kNN clustering
Strength cutoff
Order of alignment
Optimization 1 reduced precision in prox.
matrix
Optimization 2 sample based iterative clustering

132
A realistic DB
133
Input parameters
134
The order in multiple alignment
135
Understanding ApproxMAP

Optimization 1 Lseq
Running time reduced to 40
Optimization 2 Nseq
Running time reduced to 10-40
For negligible reduction in recoverability

136
Effects of the DB param scalability

4 experiments
I of unique items in the database
Density of the database
1,000 10,000
Nseq of sequences in the data
10,000 100,000
Lseq Avg. of itemsets per data seq
10 50
Iseq Avg. of items per itemset in DB
2.5 10

137
(No Transcript)
138
Overview

What is KDD (Knowledge Discovery Data mining)
Problem Sequential Pattern Mining
Method ApproxMAP
Evaluation Method
Results
Case Study
Conclusion

139
Case Study Real data

Monthly Services to children with AN report
992 sequences
15 interpretable and useful patterns
(RPT)(INV,FC)(FC) ..11.. (FC)
419 sequences
(RPT)(INV,FC)(FC)(FC)
57 sequences
(RPT)(INV,FC,T)(FC,T)(FC,HM)(FC)(FC,HM)
39 sequences

140
Overview

What is KDD (Knowledge Discovery Data mining)
Problem Sequential Pattern Mining
Method ApproxMAP
Evaluation Method
Results
Case Study
Conclusion

141
Conclusion why does it work well?

Robust on random weak patterned noise
Very good at organizing sequences
Long sequence data that are not random have
unique signatures

142
What have I done?

defines a new model, Multiple Alignment
Sequential Pattern Mining,
describes a novel solution ApproxMAP (for
APPROXimate Multiple Alignment Pattern mining)
that introduces a new metric for itemsets
weighted sequences a new representation of
alignment information,
and the effective use of strength cutoffs to
control the level of detail included in the
consensus patterns
designs a general evaluation method to assess the
quality of results from sequential pattern mining
algorithms,

143
What have I done?

employs the evaluation method to run an extensive
set of empirical evaluations of approxMAP on
synthetic data,
employs the evaluation method to compare the
effectiveness of approxMAP to the conventional
methods based on support model,
derives the expected support of a random
sequences under the null hypothesis of no pattern
in the database to better understand the behavior
of the support based methods, and
demonstrates the usefulness of approxMAP using
real world data.

144
Future Work