Title: HyeChung Monica Kum
1Approximate Mining of Consensus Sequential
Patterns
- Hye-Chung (Monica) Kum
- University of North Carolina, Chapel Hill
- Computer Science Department
- School of Social Work
- http//www.cs.unc.edu/kum/approxMAP
2Knowledge Discovery Data mining (KDD)
- "The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data"
3Knowledge Discovery Data mining (KDD)
- "The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data"
4Knowledge Discovery Data mining (KDD)
- "The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data"
5Knowledge Discovery Data mining (KDD)
- "The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data"
6Knowledge Discovery Data mining (KDD)
- "The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data"
7Knowledge Discovery Data mining (KDD)
- "The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data" - The goal is to discover and present knowledge in
a form, which is easily comprehensible to humans
in a timely manner
8Knowledge Discovery Data mining (KDD)
- "The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data" - The goal is to discover and present knowledge in
a form, which is easily comprehensible to humans
in a timely manner
9Knowledge Discovery Data mining (KDD)
- "The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data" - The goal is to discover and present knowledge in
a form, which is easily comprehensible to humans
in a timely manner - combining ideas drawn from databases, machine
learning, artificial intelligence,
knowledge-based systems, information retrieval,
statistics, pattern recognition, visualization,
and parallel and distributed computing - Fayyad, Piatetsky-Shapiro, Smyth 1996
10What is KDD ?
- Purpose
- Extract useful information
- Source
- Operational or Administrative Data
- Example
- VIC card database for buying patterns
- monthly welfare service patterns
11Example
- Analyze buying patterns for sales marketing
12Example
13Example
14Overview
- What is KDD (Knowledge Discovery Data mining)
- Problem Sequential Pattern Mining
- Method ApproxMAP
- Evaluation Method
- Results
- Case Study
- Conclusion
15Overview
- What is KDD (Knowledge Discovery Data mining)
- Problem Sequential Pattern Mining
- Method ApproxMAP
- Evaluation Method
- Results
- Case Study
- Conclusion
16Sequential Pattern Mining
17Sequential Pattern Mining
- Detecting patterns in sequences of sets
18Welfare Program Participation Patterns
- What are the common participation patterns ?
- What are the variations to them ?
- How do different policies affect these patterns?
19Thesis Statement
- The author of this dissertation asserts that
multiple alignment is an effective model to
uncover the underlying trend in sequences of
sets. - I will show that approxMAP,
- is a novel method to apply multiple alignment
techniques to sequences of sets, - will effectively extract the underlying trend in
the data - by organizing the large database into clusters
- as well as give reasonable descriptors (weighted
sequences and consensus sequences) for the
clusters via multiple alignment - Furthermore, I will show that approxMAP
- is robust to its input parameters,
- is robust to noise and outliers in the data,
- scalable with respect to the size of the
database, - and in comparison to the conventional support
model, approxMAP can better recover the
underlying pattern with little confounding
information under most circumstances. - In addition, I will demonstrate the usefulness of
approxMAP using real world data.
20Thesis Statement
- Multiple alignment is an effective model to
uncover the underlying trend in sequences of
sets. - ApproxMAP is a novel method to apply multiple
alignment techniques to sequences of sets. - ApproxMAP can recover the underlying patterns
with little confounding information under most
circumstances including those in which the
conventional methods fail. - I will demonstrate the usefulness of approxMAP
using real world data.
21Sequential Pattern Mining
- Detecting patterns in sequences of sets
- Nseq Total of sequences in the Database
- Lseq Avg of itemsets in a sequence
- Iseq Avg of items in an itemset
- Lseq Iseq Avg length of a sequence
22Conventional Methods Support Model
- Super-sequence ? Sub-sequence
- (A,B,D)(B)(C,D)(B,C)
23Conventional Methods Support Model
- Super-sequence ? Sub-sequence
- (A,B,D)(B)(C,D)(B,C) ? (A)(B)(C,D)
24Conventional Methods Support Model
- Super-sequence ? Sub-sequence
- (A,B,D)(B)(C,D)(B,C) ? (A)(B)(C,D)
- Support (P ) of super-sequences of P in D
25Conventional Methods Support Model
- Super-sequence ? Sub-sequence
- (A,B,D)(B)(C,D)(B,C) ? (A)(B)(C,D)
- Support (P ) of super-sequences of P in D
- Given D, and user threshold, min_sup
- find complete set of P s.t. Support(P )?? min_sup
26Conventional Methods Support Model
- Super-sequence ? Sub-sequence
- (A,B,D)(B)(C,D)(B,C) ? (A)(B)(C,D)
- Support (P ) of super-sequences of P in D
- Given D, and user threshold, min_sup
- find complete set of P s.t. Support(P )??
min_sup - R. Agrawal and R. Srikant ICDE 95 EBDT 96
- Methods
- Breadth first Apriori Principle (GSP)
- R. Agrawal and R. Srikant ICDE 95 EBDT 96
- Depth first pattern growth (PrefixSpan)
- J. Han and J. Pei SIGKDD 2000 ICDE 2001
27Example Support Model
- Dp, Br Mk, Dp Mk, Dp, Br 2/367
- 2L - 1 27-1128-1127 subsequences
- Dp, Br Mk, Dp Mk, Br
- Dp, Br Mk, Dp Mk, Dp
- Mk, Dp Mk, Dp, Br
- Dp, Br Mk, Dp, Br
- etc
- Br Mk, Dp Mk, Dp, Br
- Dp Mk, Dp Mk, Dp, Br
- Dp, Br Dp Mk, Dp, Br
- Dp, Br Mk Mk, Dp, Br
- Dp, Br Mk, Dp Dp, Br
28Inherent Problems the model
- Support
- cannot distinguish between statistically
significant patterns and random occurrences - Theoretically
- Short random sequences occur often in long
sequential data simply by chance - Empirically
- of spurious patterns grows exponential w.r.t.
Lseq
29Inherent Problems exact match
- A pattern gets support
- the pattern is exactly contained in the sequence
- Often may not find general long patterns
- Example
- many customers may share similar buying habits
- few of them follow an exactly same pattern
30Inherent Problems Complete set
- Mines complete set
- Too many trivial patterns
- Given long sequences with noise
- too expensive and too many patterns
- 2L - 1 210-11023
- Finding max / closed sequential patterns
- is non-trivial
- In noisy environment, still too many max/closed
patterns
31Possible Models
- Support model
- Patterns in sets
- unordered list
- Multiple alignment model
- Find common patterns among strings
- Simple ordered list of characters
32Multiple Alignment
- line up the sequences to detect the trend
- Find common patterns among strings
- DNA / bio sequences
33Multiple Alignment
- line up the sequences to detect the trend
- Find common patterns among strings
- DNA / bio sequences
34Edit Distance
- Pairwise Score(edit distance) dist(seq1, seq2)
- Minimum of ops required to change seq1 to seq2
- Ops INDEL(a) and/or REPLACE(a,b)
- Recurrence relation
35Edit Distance
- Pairwise Score(edit distance) dist(seq1, seq2)
- Minimum of ops required to change seq1 to seq2
- Ops INDEL(a) and/or REPLACE(a,b)
- Recurrence relation
- Multiple Alignment Score
- ?PS(seqi, seqj) (? 1 i N and 1 j N)
- Optimal alignment minimum score
36Consensus Sequence
37Consensus Sequence
- Weighted Sequence
- compression of aligned sequences into one sequence
38Consensus Sequence
- Weighted Sequence
- compression of aligned sequences into one sequence
39Consensus Sequence
- Weighted Sequence
- compression of aligned sequences into one sequence
40Consensus Sequence
- Weighted Sequence
- compression of aligned sequences into one sequence
41Consensus Sequence
- Weighted Sequence
- compression of aligned sequences into one
sequence - strength(i, j) of occurrences of item i in
position j - total of sequences
- A 3/3 100
- E 1/3 33
- H 1/3 33
42Consensus Sequence
- Weighted Sequence
- compression of aligned sequences into one
sequence - strength(i, j) of occurrences of item i in
position j - total of sequences
- Consensus itemset (j) min_strength2
- ( ia ? ia?(I ? ()) strength(ia, j)
min_strength )
43Consensus Sequence
- Weighted Sequence
- compression of aligned sequences into one
sequence - strength(i, j) of occurrences of item i in
position j - total of sequences
- Consensus itemset (j) min_strength2
- ( ia ? ia?(I ? ()) strength(ia, j)
min_strength ) - Consensus sequence
- concatenation of the consensus itemsets
44Consensus Sequence
- Weighted Sequence
- compression of aligned sequences into one
sequence - strength(i, j) of occurrences of item i in
position j - total of sequences
- Consensus itemset (j) min_strength2
- ( ia ? ia?(I ? ()) strength(ia, j)
min_strength ) - Consensus sequence
- concatenation of the consensus itemsets
45Multiple Alignment Sequential Pattern Mining
- Given
- N sequences of sets,
- Op costs (INDEL REPLACE) for itemsets, and
- Strength thresholds for consensus sequences
- To
- (1) partition the N sequences into K sets of
sequences such - that the sum of the K multiple alignment
scores is - minimum, and
- (2) find the multiple alignment for each
partition, and - (3) find the pattern consensus sequence and the
variation - consensus sequence for each partition
46Overview
- What is KDD (Knowledge Discovery Data mining)
- Problem Sequential Pattern Mining
- Method ApproxMAP
- Evaluation Method
- Results
- Case Study
- Conclusion
47ApproxMAP (Approximate Multiple Alignment
Pattern mining)
- Exact solution Too expensive!
- Approximation Method ApproxMAP
- Organize into K partitions
- Use clustering
- Compress each partition into
- weighted sequences
- Summarize each partition into
- Pattern consensus sequence
- Variation consensus sequence
48Tasks
- Op costs (INDEL REPLACE) for itemsets
- Organize into K partitions
- Use clustering
- Compress each partition into
- weighted sequences
- Summarize each partition into
- Pattern consensus sequence
- Variation consensus sequence
49Tasks
- Op costs (INDEL REPLACE) for itemsets
- Organize into K partitions
- Use clustering
- Compress each partition into
- weighted sequences
- Summarize each partition into
- Pattern consensus sequence
- Variation consensus sequence
50Op costs for itemsets
- Normalized set difference
- R(X,Y) (X-YY-X)/(XY)
- 0 R 1 , metric
- INDEL(X) R(X,?) 1
51Op costs for itemsets
- Normalized set difference
- R(X,Y) (X-YY-X)/(XY)
- 0 R 1 , metric
- INDEL(X) R(X,?) 1
52Op costs for itemsets
- Normalized set difference
- R(X,Y) (X-YY-X)/(XY)
- 0 R 1 , metric
- INDEL(X) R(X,?) 1
53Op costs for itemsets
- Normalized set difference
- R(X,Y) (X-YY-X)/(XY)
- 0 R 1 , metric
- INDEL(X) R(X,?) 1
54Op costs for itemsets
- Normalized set difference
- R(X,Y) (X-YY-X)/(XY)
- 0 R 1 , metric
- INDEL(X) R(X,?) 1
- Jaccard coefficient
- 1-X?Y / X?Y
55Op costs for itemsets
- Normalized set difference
- R(X,Y) (X-YY-X)/(XY)
- 0 R 1 , metric
- INDEL(X) R(X,?) 1
- Jaccard coefficient
- 1-X?Y / X?Y
- Sørensen coefficient simple index
- Give greater "weight" to common elements
56Op costs for itemsets
- Normalized set difference
- R(X,Y) (X-YY-X)/(XY)
- 0 R 1 , metric
- INDEL(X) R(X,?) 1
- Jaccard coefficient
- 1-X?Y / X?Y
- 1-(X?Y) / X-YY-XX?Y
- Sørensen coefficient simple index
- Give greater "weight" to common elements
57Op costs for itemsets
- Normalized set difference
- R(X,Y) (X-YY-X)/(XY)
- 0 R 1 , metric
- INDEL(X) R(X,?) 1
- Jaccard coefficient
- 1-X?Y / X?Y
- 1-(X?Y) / X-YY-XX?Y
- Sørensen coefficient simple index
- Give greater "weight" to common elements
- 1-2(X?Y) / X-YY-X2X?Y
58Op costs for itemsets
- Normalized set difference
- R(X,Y) (X-YY-X)/(XY)
- 0 R 1 , metric
- INDEL(X) R(X,?) 1
- Jaccard coefficient
- 1-X?Y / X?Y
- 1-(X?Y) / X-YY-XX?Y
- Sørensen coefficient simple index
- Give greater "weight" to common elements
- 1-2(X?Y) / X-YY-X2X?Y
- (XY-2X?Y) / (XY) R(X,Y)
59Tasks
- Op costs (INDEL REPLACE) for itemsets
- Organize into K partitions
- Use clustering
- Compress each partition into
- weighted sequences
- Summarize each partition into
- Pattern consensus sequence
- Variation consensus sequence
60Organize Partition into K sets
- Goal
- To minimize the sum of the K multiple alignment
scores - Group similar sequences
- Approximate
- Calculate NN proximity matrix
- Pairwise score edit distance
- Any clustering that works best for your data
61Organize Clustering
- Desirable Properties
- Form groups of arbitrary shape and size
- Can estimate the number of clusters from the data
62Density Based Clustering
- k-nearest neighbor
- Partition based at the valleys of the density
estimate - Density of sequence n / (Dd) ? n/d
- n d Based on user defined k nearest neighbor
space - n of neighbors
- d size of neighbor region
- Parameter k Neighbor space
- Can cluster at different resolutions as desired
- General Uniform kernel k-NN clustering
- Efficient O(kN)
63Tasks
- Op costs (INDEL REPLACE) for itemsets
- Organize into k partitions
- Use clustering
- Compress each partition into
- weighted sequences
- Summarize each partition into
- Pattern consensus sequence
- Variation consensus sequence
64Data Compression Multiple Alignment
- Optimal multiple alignment too expensive !
- greedy approximation
- Incrementally align
- in density-descending order
- Pairwise alignment
- Sequence to Weighted sequence
65(No Transcript)
66(No Transcript)
67Multiple Alignment
68Multiple Alignment
69Multiple Alignment
70Multiple Alignment
71Multiple Alignment
72Multiple Alignment
73Multiple Alignment
74Multiple Alignment
75Op Cost for Itemset to weighted itemset
- Replace ((A3,E1)3 4 , (AG) ) ?
76Op Cost for Itemset to weighted itemset
- Replace ((A3,E1)3 4 , (AG) ) ? 65/120
77Op Cost for Itemset to weighted itemset
78Op Cost for Itemset to weighted itemset
79Op Cost for Itemset to weighted itemset
- Rw(Xw,Y) weight(X)YwX 2weight(X?Y)
- weight(X) YwX
(XY-2X?Y) (XY)
80Op Cost for Itemset to weighted itemset
- Rw(Xw,Y) weight(X)YwX 2weight(X?Y)
- weight(X) YwX
- Rw(Xw,Y) wX
81Op Cost for Itemset to weighted itemset Rw
- 1 ( n wX )
- Rw(Xw,Y) wX
- Replace ((A3,E1)3 4 , (AG) )
- Rw(Xw, Y) Rw(Xw,Y) wX n wX / n
82Op Cost for Itemset to weighted itemset Rw
- 1 ( n wX )
- Rw(Xw,Y) wX
- Replace ((A3,E1)3 4 , (AG) )
- Rw(Xw, Y) Rw(Xw,Y) wX n wX / n
83Op Cost for Itemset to weighted itemsetRw
- Op cost
- Rw(Xw,Y) weight(X)YwX 2weight(X?Y)
- weight(X) YwX
- Rw(Xw, Y) Rw wX n wX / n
- 0 Rw 1 , metric
- INDEL(Xw) Rw (Xw,?) INDEL(Y) Rw (?, Y) 1
84Tasks
- Op costs (INDEL REPLACE) for itemsets
- Organize into K partitions
- Use clustering
- Compress each partition into
- weighted sequences
- Summarize each partition into
- Pattern consensus sequence
- Variation consensus sequence
85Summarize Generate and Present results
- N sequences ? K weighted sequences
- Weighted sequence huge
- compression of all sequences
86lt (E1, L1, R1, T1, V1, d1) (A1, B9, C8,
D8, E12, F1, L4, P1, S1, T8, V5, X1,
a1, d10, e2, f1, g1, p1) (B99, C96, D91,
E24, F2, G1, L15, P7, R2, S8, T95, V15,
X2, Y1, a2, d26, e3, g6, l1, m1) (A5,
B16, C5, D3, E13, F1, H2, L7, P1, R2,
S7, T6, V7, Y3, d3, g1) (A13, B126, C27,
D1, E32, G5, H3, J1, L1, R1, S32, T21,
V1, W3, X2, Y8, d13, e1, f8, i2, p7,
l3, g1) (A12, B6, C28, D1, E28, G5, H2,
J6, L2, S137, T10, V2, W6, X8, Y124, a1,
d6, g2, i1, l1, m2) (A135, B2, C23, E36,
G12, H124, K1, L4, O2, R2, S27, T6, V6,
W10, X3, Y8, Z2, a1, d6, g1, h2, j1,
k5, l3, m7, n1) (A11, B1, C5, E12, G3,
H10, L7, O4, S5, T1, V7, W3, X2, Y3,
a1, m2) (A31, C15, E10, G15, H25, K1,
L7, M1, O1, R4, S12, T10, V6, W3, Y3,
Z3, d7, h3, j2, l1, n1, p1, q1) (A3,
C5, E4, G7, H1, K1, R1, T1, W2, Z2, a1,
d1, h1, n1) (A20, C27, E13, G35, H7, K7,
L111, N2, O1, Q3, R11, S10, T20, V111,
W2, X2, Y3, Z8, a1, b1, d13, h9, j1,
n1, o2) (A17, B2, C14, E17, F1, G31, H8,
K13, L2, M2, N1, R22, S2, T140, U1, V2,
W2, X1, Z13, a1, b8, d6, h14, n6, p1,
q1) (A12, B7, C5, E13, G16, H5, K106,
L8, N2, O1, R32, S3, T29, V9, X2, Z9,
b16, c5, d5, h7, l1) (A7, B1, C9, E5,
G7, H3, K7, R8, S1, T10, X1, Z3, a2,
b3, c1, d5, h3) (A1, B1, H1, R1, T1,
b2, c1) (A3, B2, C2, E6, F2, G4, H2,
K20, M2, N3, R19, S3, T11, U2, X4, Z34,
a3, b11, c2, d4) (H1, Y1, a1, d1) gt 162
87Presentation model
- Frequent items
- definite pattern items
- Cutoff ?50
- Common items
- uncertain
- Cutoff ?20
- Rare items
- Noise items
W100
?50
?20
W0
88Visualization
- Pattern Consensus Sequence
- Cutoff Minimum cluster strength (?50)
- Frequent items
- Variation Consensus Sequence
- Cutoff Minimum cluster strength (?20)
- Frequent items common items
89Example Given 10 seqs lexically sorted
90Example Given 10 seqs lexically sorted
91Color scheme lt100 85 70 50 35 20 gt
92Color scheme lt100 85 70 50 35 20 gt
93Color scheme lt100 85 70 50 35 20 gt
94Color scheme lt100 85 70 50 35 20 gt
95Color scheme lt100 85 70 50 35 20 gt
96Color scheme lt100 85 70 50 35 20 gt
97Example support Model (202 seq)
(A) (B,C) (D,E) (I,J) (K) (L,M)
98ApproxMAP (Approximate Multiple Alignment
Pattern mining)
- Approximation Method ApproxMAP
- Organize into K partitions O(Nseq2Lseq2Iseq)
- Proximity matrix O(Nseq2Lseq2Iseq)
- Clustering O(kNseq)
- Compress each partition O(nL2)
- weighted sequences O(nL2)
- Summarize each partition O (1)
- Pattern consensus sequence
- Variation consensus sequence
- Time Complexity O(Nseq2Lseq2Iseq)
- 2 optimizations
99Overview
- What is KDD (Knowledge Discovery Data mining)
- Problem Sequential Pattern Mining
- Method ApproxMAP
- Evaluation Method
- Results
- Case Study
- Conclusion
100Evaluation
- Up to now Only performance / scalability
- Quality?
- What kind of patterns will the model generate?
- Evaluate correctness of the model
- Why?
- Basis for comparison of different models
- Essential in understanding results of approximate
solutions
101Evaluation Method
- Given
- Set of Base patterns B E(FB) E(LB) ? D
- D ? Set of Result patterns P
- How?
- Map each Pi to best Bj
- based on Longest Common Subsequences
- of all Bj
- max res pat B(B?P)
102Item level
Confusion Matrix
103Item level
Confusion Matrix
104Item level
Confusion Matrix
105Evaluation Criteria Item level
- Recoverability
- Degree of pattern items in the Base Pat
(weighted) - ? E(FB) max res pat B(B?P) / E(LB)
- Cutoff so that 0 R 1
- Precision
- Degree of pattern items in the Result Pat
- Pattern Items / (Pattern Items Extraneous Items)
106Evaluation Criteria Sequence level
- spurious patterns
- Pattern Items ? Extraneous Items
- determine max pattern for each Bj
- Of all Pi map to a particular Bj
- the Pi with the Longest Common Subsequence
- max res pat P(B?P)
- redundant patterns
- All other patterns
- Ntotal Nmax Nspur Nredun
107Evaluation Example
- 30 (A)(BC)(DE)
- 70 (IJ)(K)(LM)
- (A)(B)(DE)
- (A)(BC)(D)
- (B)(BC)(DE)
- (IJ)(LM)
- (J)(K)(LM)
- (IJ)(KX)(LM)
- (XY)(K)(Z)
108Evaluation Example
- 30 (A)(BC)(DE)
- 70 (IJ)(K)(LM)
- Ntotal 7
- (A)(B)(DE)
- (A)(BC)(D)
- (B)(BC)(DE)
- (IJ)(LM)
- (J)(K)(LM)
- (IJ)(KX)(LM)
- (XY)(K)(Z)
109Evaluation Example
- 30 (A)(BC)(DE)
- 70 (IJ)(K)(LM)
- Ntotal 7
- Spurious 1
- (A)(B)(DE)
- (A)(BC)(D)
- (B)(BC)(DE)
- (IJ)(LM)
- (J)(K)(LM)
- (IJ)(KX)(LM)
- (XY)(K)(Z)
110Evaluation Example
- 30 (A)(BC)(DE)
- 70 (IJ)(K)(LM)
- Ntotal 7
- Spurious 1
- Recoverability
- (30)4/5(70)5/5 94
- (A)(B)(DE)
- (A)(BC)(D)
- (B)(BC)(DE)
- (IJ)(LM)
- (J)(K)(LM)
- (IJ)(KX)(LM)
- (XY)(K)(Z)
111Evaluation Example
- 30 (A)(BC)(DE)
- 70 (IJ)(K)(LM)
- Ntotal 7
- Spurious 1
- Recoverability
- (30)4/5(70)5/5 94
- Redundant 4
- (A)(B)(DE)
- (A)(BC)(D)
- (B)(BC)(DE)
- (IJ)(LM)
- (J)(K)(LM)
- (IJ)(KX)(LM)
- (XY)(K)(Z)
112Evaluation Example
- 30 (A)(BC)(DE)
- 70 (IJ)(K)(LM)
- Ntotal 7
- Spurious 1
- Recoverability
- (30)4/5(70)5/5 94
- Redundant 4
- Precision 1-5/3184
- (A)(B)(DE)
- (A)(BC)(D)
- (B)(BC)(DE)
- (IJ)(LM)
- (J)(K)(LM)
- (IJ)(KX)(LM)
- (XY)(K)(Z)
113Synthetic data
- Patterned data IBM synthetic data generator
- Given certain DB parameters outputs
- sequence DB
- base patterns used to generate it E(FB) and
E(LB) - R. Agrawal and R. Srikant ICDE 95 EBDT 96
- Random data
- Independence both between and across itemsets
- Patterned data systematic noise
- Randomly change item with probability ?
- Yang, SIGMOD 2002
- Patterned data systematic outliers
- random sequences
114Overview
- What is KDD (Knowledge Discovery Data mining)
- Problem Sequential Pattern Mining
- Method ApproxMAP
- Evaluation Method
- Results
- Case Study
- Conclusion
115Results
- ApproxMAP
- Pattern consensus sequence
- No null or one-itemset
- Machine swan
- 2GHz Intel Xeon processor
- 2GB of memory
- Public machine
- Difficult to get consistent running time
measurements - Thanks !
116Database Parameter
117Database Parameter
118(No Transcript)
119(No Transcript)
120lt (E1, L1, R1, T1, V1, d1) (A1, B9, C8,
D8, E12, F1, L4, P1, S1, T8, V5, X1,
a1, d10, e2, f1, g1, p1) (B99, C96, D91,
E24, F2, G1, L15, P7, R2, S8, T95, V15,
X2, Y1, a2, d26, e3, g6, l1, m1) (A5,
B16, C5, D3, E13, F1, H2, L7, P1, R2,
S7, T6, V7, Y3, d3, g1) (A13, B126, C27,
D1, E32, G5, H3, J1, L1, R1, S32, T21,
V1, W3, X2, Y8, d13, e1, f8, i2, p7,
l3, g1) (A12, B6, C28, D1, E28, G5, H2,
J6, L2, S137, T10, V2, W6, X8, Y124, a1,
d6, g2, i1, l1, m2) (A135, B2, C23, E36,
G12, H124, K1, L4, O2, R2, S27, T6, V6,
W10, X3, Y8, Z2, a1, d6, g1, h2, j1,
k5, l3, m7, n1) (A11, B1, C5, E12, G3,
H10, L7, O4, S5, T1, V7, W3, X2, Y3,
a1, m2) (A31, C15, E10, G15, H25, K1,
L7, M1, O1, R4, S12, T10, V6, W3, Y3,
Z3, d7, h3, j2, l1, n1, p1, q1) (A3,
C5, E4, G7, H1, K1, R1, T1, W2, Z2, a1,
d1, h1, n1) (A20, C27, E13, G35, H7, K7,
L111, N2, O1, Q3, R11, S10, T20, V111,
W2, X2, Y3, Z8, a1, b1, d13, h9, j1,
n1, o2) (A17, B2, C14, E17, F1, G31, H8,
K13, L2, M2, N1, R22, S2, T140, U1, V2,
W2, X1, Z13, a1, b8, d6, h14, n6, p1,
q1) (A12, B7, C5, E13, G16, H5, K106,
L8, N2, O1, R32, S3, T29, V9, X2, Z9,
b16, c5, d5, h7, l1) (A7, B1, C9, E5,
G7, H3, K7, R8, S1, T10, X1, Z3, a2,
b3, c1, d5, h3) (A1, B1, H1, R1, T1,
b2, c1) (A3, B2, C2, E6, F2, G4, H2,
K20, M2, N3, R19, S3, T11, U2, X4, Z34,
a3, b11, c2, d4) (H1, Y1, a1, d1) gt 162
ApproxMAP
121ApproxMAP
122ApproxMAP
123(No Transcript)
1248 patterns returned 7 max patterns 1 redundant
patterns 0 spurious patterns Recoverability
91.16 Precision 97.17 Extraneous Items 3/106
125Comparative Study
- Conventional Sequential Pattern Mining
- Support Model
- Empirical analysis
- Totally random data
- Patterned data
- Patterned data noise
- Patterned data outliers
126Evaluation Comparison
127Evaluation Comparison
128Evaluation Comparison
129Robustness w.r.t. noise
130Evaluation Comparison
131Understanding ApproxMAP
- 5 experiments
- k in kNN clustering
- Strength cutoff
- Order of alignment
- Optimization 1 reduced precision in prox.
matrix - Optimization 2 sample based iterative clustering
132A realistic DB
133Input parameters
134The order in multiple alignment
135Understanding ApproxMAP
- Optimization 1 Lseq
- Running time reduced to 40
- Optimization 2 Nseq
- Running time reduced to 10-40
- For negligible reduction in recoverability
136Effects of the DB param scalability
- 4 experiments
- I of unique items in the database
- Density of the database
- 1,000 10,000
- Nseq of sequences in the data
- 10,000 100,000
- Lseq Avg. of itemsets per data seq
- 10 50
- Iseq Avg. of items per itemset in DB
- 2.5 10
137(No Transcript)
138Overview
- What is KDD (Knowledge Discovery Data mining)
- Problem Sequential Pattern Mining
- Method ApproxMAP
- Evaluation Method
- Results
- Case Study
- Conclusion
139Case Study Real data
- Monthly Services to children with AN report
- 992 sequences
- 15 interpretable and useful patterns
- (RPT)(INV,FC)(FC) ..11.. (FC)
- 419 sequences
- (RPT)(INV,FC)(FC)(FC)
- 57 sequences
- (RPT)(INV,FC,T)(FC,T)(FC,HM)(FC)(FC,HM)
- 39 sequences
140Overview
- What is KDD (Knowledge Discovery Data mining)
- Problem Sequential Pattern Mining
- Method ApproxMAP
- Evaluation Method
- Results
- Case Study
- Conclusion
141Conclusion why does it work well?
- Robust on random weak patterned noise
- Very good at organizing sequences
- Long sequence data that are not random have
unique signatures
142What have I done?
- defines a new model, Multiple Alignment
Sequential Pattern Mining, - describes a novel solution ApproxMAP (for
APPROXimate Multiple Alignment Pattern mining) - that introduces a new metric for itemsets
- weighted sequences a new representation of
alignment information, - and the effective use of strength cutoffs to
control the level of detail included in the
consensus patterns - designs a general evaluation method to assess the
quality of results from sequential pattern mining
algorithms,
143What have I done?
- employs the evaluation method to run an extensive
set of empirical evaluations of approxMAP on
synthetic data, - employs the evaluation method to compare the
effectiveness of approxMAP to the conventional
methods based on support model, - derives the expected support of a random
sequences under the null hypothesis of no pattern
in the database to better understand the behavior
of the support based methods, and - demonstrates the usefulness of approxMAP using
real world data.
144Future Work
- Sample based iterative clustering
- Memory management
- Distance metric
- Multisets
- Taxonomy tree
- Strength cutoff
- Automatic detection of customized
- Local alignment
145Thank You !
- Advisor
- Wei Wang (02-04)
- James Coggins (00-02)
- Prasun Dewan (99-00)
- Kye Hedlund (96-99)
- Jan Prins (95-96)
- SW advisor
- Dean Duncan (98-04)
- Other people
- Janet Jones
- Kim Flair
- Susan Paulsen
- Fellow students
- Priyank Porwal, Andrew Leaver-Fay, Leland Smith
- Committee
- Stephen Aylward
- Jan Prins
- Andrew Nobel
- Other faculty
- Jian Pei
- Jack Snoeyink
- J. S. Marron
- Stephen Pizer
- Stephen Weiss
- Colleagues
- Sang-Uok Kum, Jisung Kim
- Alexandra, Michelle,
- Aron, Chris
- Family
- Sohmee, My mom, dad, sister
146(No Transcript)
147Rw
(XY-2X?Y) (XY)
- Rw(Xw, Y) Rw wX n - wX / n
- Rw(Xw,Y) weight(X)YwX -2weight(X?Y)
- weight(X) YwX