HyeChung Monica Kum

1 / 147
About This Presentation
Title:

HyeChung Monica Kum

Description:

Depth first pattern growth (PrefixSpan) J. Han and J. Pei : SIGKDD 2000 & ICDE 2001 ... few of them follow an exactly same pattern. Seq= (A,B,D)(B)(C,D) ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 148
Provided by: kum
Learn more at: https://www.unc.edu

less

Transcript and Presenter's Notes

Title: HyeChung Monica Kum


1
Approximate Mining of Consensus Sequential
Patterns
  • Hye-Chung (Monica) Kum
  • University of North Carolina, Chapel Hill
  • Computer Science Department
  • School of Social Work
  • http//www.cs.unc.edu/kum/approxMAP

2
Knowledge Discovery Data mining (KDD)
  • "The nontrivial process of identifying valid,
    novel, potentially useful, and ultimately
    understandable patterns in data"

3
Knowledge Discovery Data mining (KDD)
  • "The nontrivial process of identifying valid,
    novel, potentially useful, and ultimately
    understandable patterns in data"

4
Knowledge Discovery Data mining (KDD)
  • "The nontrivial process of identifying valid,
    novel, potentially useful, and ultimately
    understandable patterns in data"

5
Knowledge Discovery Data mining (KDD)
  • "The nontrivial process of identifying valid,
    novel, potentially useful, and ultimately
    understandable patterns in data"

6
Knowledge Discovery Data mining (KDD)
  • "The nontrivial process of identifying valid,
    novel, potentially useful, and ultimately
    understandable patterns in data"

7
Knowledge Discovery Data mining (KDD)
  • "The nontrivial process of identifying valid,
    novel, potentially useful, and ultimately
    understandable patterns in data"
  • The goal is to discover and present knowledge in
    a form, which is easily comprehensible to humans
    in a timely manner

8
Knowledge Discovery Data mining (KDD)
  • "The nontrivial process of identifying valid,
    novel, potentially useful, and ultimately
    understandable patterns in data"
  • The goal is to discover and present knowledge in
    a form, which is easily comprehensible to humans
    in a timely manner

9
Knowledge Discovery Data mining (KDD)
  • "The nontrivial process of identifying valid,
    novel, potentially useful, and ultimately
    understandable patterns in data"
  • The goal is to discover and present knowledge in
    a form, which is easily comprehensible to humans
    in a timely manner
  • combining ideas drawn from databases, machine
    learning, artificial intelligence,
    knowledge-based systems, information retrieval,
    statistics, pattern recognition, visualization,
    and parallel and distributed computing
  • Fayyad, Piatetsky-Shapiro, Smyth 1996

10
What is KDD ?
  • Purpose
  • Extract useful information
  • Source
  • Operational or Administrative Data
  • Example
  • VIC card database for buying patterns
  • monthly welfare service patterns

11
Example
  • Analyze buying patterns for sales marketing

12
Example
  • VIC card 4/8 50

13
Example
  • VIC card 5/863

14
Overview
  • What is KDD (Knowledge Discovery Data mining)
  • Problem Sequential Pattern Mining
  • Method ApproxMAP
  • Evaluation Method
  • Results
  • Case Study
  • Conclusion

15
Overview
  • What is KDD (Knowledge Discovery Data mining)
  • Problem Sequential Pattern Mining
  • Method ApproxMAP
  • Evaluation Method
  • Results
  • Case Study
  • Conclusion

16
Sequential Pattern Mining
17
Sequential Pattern Mining
  • Detecting patterns in sequences of sets

18
Welfare Program Participation Patterns
  • What are the common participation patterns ?
  • What are the variations to them ?
  • How do different policies affect these patterns?

19
Thesis Statement
  • The author of this dissertation asserts that
    multiple alignment is an effective model to
    uncover the underlying trend in sequences of
    sets.
  • I will show that approxMAP,
  • is a novel method to apply multiple alignment
    techniques to sequences of sets,
  • will effectively extract the underlying trend in
    the data
  • by organizing the large database into clusters
  • as well as give reasonable descriptors (weighted
    sequences and consensus sequences) for the
    clusters via multiple alignment
  • Furthermore, I will show that approxMAP
  • is robust to its input parameters,
  • is robust to noise and outliers in the data,
  • scalable with respect to the size of the
    database,
  • and in comparison to the conventional support
    model, approxMAP can better recover the
    underlying pattern with little confounding
    information under most circumstances.
  • In addition, I will demonstrate the usefulness of
    approxMAP using real world data.

20
Thesis Statement
  • Multiple alignment is an effective model to
    uncover the underlying trend in sequences of
    sets.
  • ApproxMAP is a novel method to apply multiple
    alignment techniques to sequences of sets.
  • ApproxMAP can recover the underlying patterns
    with little confounding information under most
    circumstances including those in which the
    conventional methods fail.
  • I will demonstrate the usefulness of approxMAP
    using real world data.

21
Sequential Pattern Mining
  • Detecting patterns in sequences of sets
  • Nseq Total of sequences in the Database
  • Lseq Avg of itemsets in a sequence
  • Iseq Avg of items in an itemset
  • Lseq Iseq Avg length of a sequence

22
Conventional Methods Support Model
  • Super-sequence ? Sub-sequence
  • (A,B,D)(B)(C,D)(B,C)

23
Conventional Methods Support Model
  • Super-sequence ? Sub-sequence
  • (A,B,D)(B)(C,D)(B,C) ? (A)(B)(C,D)

24
Conventional Methods Support Model
  • Super-sequence ? Sub-sequence
  • (A,B,D)(B)(C,D)(B,C) ? (A)(B)(C,D)
  • Support (P ) of super-sequences of P in D

25
Conventional Methods Support Model
  • Super-sequence ? Sub-sequence
  • (A,B,D)(B)(C,D)(B,C) ? (A)(B)(C,D)
  • Support (P ) of super-sequences of P in D
  • Given D, and user threshold, min_sup
  • find complete set of P s.t. Support(P )?? min_sup

26
Conventional Methods Support Model
  • Super-sequence ? Sub-sequence
  • (A,B,D)(B)(C,D)(B,C) ? (A)(B)(C,D)
  • Support (P ) of super-sequences of P in D
  • Given D, and user threshold, min_sup
  • find complete set of P s.t. Support(P )??
    min_sup
  • R. Agrawal and R. Srikant ICDE 95 EBDT 96
  • Methods
  • Breadth first Apriori Principle (GSP)
  • R. Agrawal and R. Srikant ICDE 95 EBDT 96
  • Depth first pattern growth (PrefixSpan)
  • J. Han and J. Pei SIGKDD 2000 ICDE 2001

27
Example Support Model
  • Dp, Br Mk, Dp Mk, Dp, Br 2/367
  • 2L - 1 27-1128-1127 subsequences
  • Dp, Br Mk, Dp Mk, Br
  • Dp, Br Mk, Dp Mk, Dp
  • Mk, Dp Mk, Dp, Br
  • Dp, Br Mk, Dp, Br
  • etc
  • Br Mk, Dp Mk, Dp, Br
  • Dp Mk, Dp Mk, Dp, Br
  • Dp, Br Dp Mk, Dp, Br
  • Dp, Br Mk Mk, Dp, Br
  • Dp, Br Mk, Dp Dp, Br

28
Inherent Problems the model
  • Support
  • cannot distinguish between statistically
    significant patterns and random occurrences
  • Theoretically
  • Short random sequences occur often in long
    sequential data simply by chance
  • Empirically
  • of spurious patterns grows exponential w.r.t.
    Lseq

29
Inherent Problems exact match
  • A pattern gets support
  • the pattern is exactly contained in the sequence
  • Often may not find general long patterns
  • Example
  • many customers may share similar buying habits
  • few of them follow an exactly same pattern

30
Inherent Problems Complete set
  • Mines complete set
  • Too many trivial patterns
  • Given long sequences with noise
  • too expensive and too many patterns
  • 2L - 1 210-11023
  • Finding max / closed sequential patterns
  • is non-trivial
  • In noisy environment, still too many max/closed
    patterns

31
Possible Models
  • Support model
  • Patterns in sets
  • unordered list
  • Multiple alignment model
  • Find common patterns among strings
  • Simple ordered list of characters

32
Multiple Alignment
  • line up the sequences to detect the trend
  • Find common patterns among strings
  • DNA / bio sequences

33
Multiple Alignment
  • line up the sequences to detect the trend
  • Find common patterns among strings
  • DNA / bio sequences

34
Edit Distance
  • Pairwise Score(edit distance) dist(seq1, seq2)
  • Minimum of ops required to change seq1 to seq2
  • Ops INDEL(a) and/or REPLACE(a,b)
  • Recurrence relation

35
Edit Distance
  • Pairwise Score(edit distance) dist(seq1, seq2)
  • Minimum of ops required to change seq1 to seq2
  • Ops INDEL(a) and/or REPLACE(a,b)
  • Recurrence relation
  • Multiple Alignment Score
  • ?PS(seqi, seqj) (? 1 i N and 1 j N)
  • Optimal alignment minimum score

36
Consensus Sequence
37
Consensus Sequence
  • Weighted Sequence
  • compression of aligned sequences into one sequence

38
Consensus Sequence
  • Weighted Sequence
  • compression of aligned sequences into one sequence

39
Consensus Sequence
  • Weighted Sequence
  • compression of aligned sequences into one sequence

40
Consensus Sequence
  • Weighted Sequence
  • compression of aligned sequences into one sequence

41
Consensus Sequence
  • Weighted Sequence
  • compression of aligned sequences into one
    sequence
  • strength(i, j) of occurrences of item i in
    position j
  • total of sequences
  • A 3/3 100
  • E 1/3 33
  • H 1/3 33

42
Consensus Sequence
  • Weighted Sequence
  • compression of aligned sequences into one
    sequence
  • strength(i, j) of occurrences of item i in
    position j
  • total of sequences
  • Consensus itemset (j) min_strength2
  • ( ia ? ia?(I ? ()) strength(ia, j)
    min_strength )

43
Consensus Sequence
  • Weighted Sequence
  • compression of aligned sequences into one
    sequence
  • strength(i, j) of occurrences of item i in
    position j
  • total of sequences
  • Consensus itemset (j) min_strength2
  • ( ia ? ia?(I ? ()) strength(ia, j)
    min_strength )
  • Consensus sequence
  • concatenation of the consensus itemsets

44
Consensus Sequence
  • Weighted Sequence
  • compression of aligned sequences into one
    sequence
  • strength(i, j) of occurrences of item i in
    position j
  • total of sequences
  • Consensus itemset (j) min_strength2
  • ( ia ? ia?(I ? ()) strength(ia, j)
    min_strength )
  • Consensus sequence
  • concatenation of the consensus itemsets

45
Multiple Alignment Sequential Pattern Mining
  • Given
  • N sequences of sets,
  • Op costs (INDEL REPLACE) for itemsets, and
  • Strength thresholds for consensus sequences
  • To
  • (1) partition the N sequences into K sets of
    sequences such
  • that the sum of the K multiple alignment
    scores is
  • minimum, and
  • (2) find the multiple alignment for each
    partition, and
  • (3) find the pattern consensus sequence and the
    variation
  • consensus sequence for each partition

46
Overview
  • What is KDD (Knowledge Discovery Data mining)
  • Problem Sequential Pattern Mining
  • Method ApproxMAP
  • Evaluation Method
  • Results
  • Case Study
  • Conclusion

47
ApproxMAP (Approximate Multiple Alignment
Pattern mining)
  • Exact solution Too expensive!
  • Approximation Method ApproxMAP
  • Organize into K partitions
  • Use clustering
  • Compress each partition into
  • weighted sequences
  • Summarize each partition into
  • Pattern consensus sequence
  • Variation consensus sequence

48
Tasks
  • Op costs (INDEL REPLACE) for itemsets
  • Organize into K partitions
  • Use clustering
  • Compress each partition into
  • weighted sequences
  • Summarize each partition into
  • Pattern consensus sequence
  • Variation consensus sequence

49
Tasks
  • Op costs (INDEL REPLACE) for itemsets
  • Organize into K partitions
  • Use clustering
  • Compress each partition into
  • weighted sequences
  • Summarize each partition into
  • Pattern consensus sequence
  • Variation consensus sequence

50
Op costs for itemsets
  • Normalized set difference
  • R(X,Y) (X-YY-X)/(XY)
  • 0 R 1 , metric
  • INDEL(X) R(X,?) 1

51
Op costs for itemsets
  • Normalized set difference
  • R(X,Y) (X-YY-X)/(XY)
  • 0 R 1 , metric
  • INDEL(X) R(X,?) 1

52
Op costs for itemsets
  • Normalized set difference
  • R(X,Y) (X-YY-X)/(XY)
  • 0 R 1 , metric
  • INDEL(X) R(X,?) 1

53
Op costs for itemsets
  • Normalized set difference
  • R(X,Y) (X-YY-X)/(XY)
  • 0 R 1 , metric
  • INDEL(X) R(X,?) 1

54
Op costs for itemsets
  • Normalized set difference
  • R(X,Y) (X-YY-X)/(XY)
  • 0 R 1 , metric
  • INDEL(X) R(X,?) 1
  • Jaccard coefficient
  • 1-X?Y / X?Y

55
Op costs for itemsets
  • Normalized set difference
  • R(X,Y) (X-YY-X)/(XY)
  • 0 R 1 , metric
  • INDEL(X) R(X,?) 1
  • Jaccard coefficient
  • 1-X?Y / X?Y
  • Sørensen coefficient simple index
  • Give greater "weight" to common elements

56
Op costs for itemsets
  • Normalized set difference
  • R(X,Y) (X-YY-X)/(XY)
  • 0 R 1 , metric
  • INDEL(X) R(X,?) 1
  • Jaccard coefficient
  • 1-X?Y / X?Y
  • 1-(X?Y) / X-YY-XX?Y
  • Sørensen coefficient simple index
  • Give greater "weight" to common elements

57
Op costs for itemsets
  • Normalized set difference
  • R(X,Y) (X-YY-X)/(XY)
  • 0 R 1 , metric
  • INDEL(X) R(X,?) 1
  • Jaccard coefficient
  • 1-X?Y / X?Y
  • 1-(X?Y) / X-YY-XX?Y
  • Sørensen coefficient simple index
  • Give greater "weight" to common elements
  • 1-2(X?Y) / X-YY-X2X?Y

58
Op costs for itemsets
  • Normalized set difference
  • R(X,Y) (X-YY-X)/(XY)
  • 0 R 1 , metric
  • INDEL(X) R(X,?) 1
  • Jaccard coefficient
  • 1-X?Y / X?Y
  • 1-(X?Y) / X-YY-XX?Y
  • Sørensen coefficient simple index
  • Give greater "weight" to common elements
  • 1-2(X?Y) / X-YY-X2X?Y
  • (XY-2X?Y) / (XY) R(X,Y)

59
Tasks
  • Op costs (INDEL REPLACE) for itemsets
  • Organize into K partitions
  • Use clustering
  • Compress each partition into
  • weighted sequences
  • Summarize each partition into
  • Pattern consensus sequence
  • Variation consensus sequence

60
Organize Partition into K sets
  • Goal
  • To minimize the sum of the K multiple alignment
    scores
  • Group similar sequences
  • Approximate
  • Calculate NN proximity matrix
  • Pairwise score edit distance
  • Any clustering that works best for your data

61
Organize Clustering
  • Desirable Properties
  • Form groups of arbitrary shape and size
  • Can estimate the number of clusters from the data

62
Density Based Clustering
  • k-nearest neighbor
  • Partition based at the valleys of the density
    estimate
  • Density of sequence n / (Dd) ? n/d
  • n d Based on user defined k nearest neighbor
    space
  • n of neighbors
  • d size of neighbor region
  • Parameter k Neighbor space
  • Can cluster at different resolutions as desired
  • General Uniform kernel k-NN clustering
  • Efficient O(kN)

63
Tasks
  • Op costs (INDEL REPLACE) for itemsets
  • Organize into k partitions
  • Use clustering
  • Compress each partition into
  • weighted sequences
  • Summarize each partition into
  • Pattern consensus sequence
  • Variation consensus sequence

64
Data Compression Multiple Alignment
  • Optimal multiple alignment too expensive !
  • greedy approximation
  • Incrementally align
  • in density-descending order
  • Pairwise alignment
  • Sequence to Weighted sequence

65
(No Transcript)
66
(No Transcript)
67
Multiple Alignment
68
Multiple Alignment
69
Multiple Alignment
70
Multiple Alignment
71
Multiple Alignment
72
Multiple Alignment
73
Multiple Alignment
74
Multiple Alignment
75
Op Cost for Itemset to weighted itemset
  • Replace ((A3,E1)3 4 , (AG) ) ?

76
Op Cost for Itemset to weighted itemset
  • Replace ((A3,E1)3 4 , (AG) ) ? 65/120

77
Op Cost for Itemset to weighted itemset
78
Op Cost for Itemset to weighted itemset
  • 1 ( n wX )

79
Op Cost for Itemset to weighted itemset
  • Rw(Xw,Y) weight(X)YwX 2weight(X?Y)
  • weight(X) YwX

(XY-2X?Y) (XY)
80
Op Cost for Itemset to weighted itemset
  • Rw(Xw,Y) weight(X)YwX 2weight(X?Y)
  • weight(X) YwX
  • Rw(Xw,Y) wX

81
Op Cost for Itemset to weighted itemset Rw
  • 1 ( n wX )
  • Rw(Xw,Y) wX
  • Replace ((A3,E1)3 4 , (AG) )
  • Rw(Xw, Y) Rw(Xw,Y) wX n wX / n

82
Op Cost for Itemset to weighted itemset Rw
  • 1 ( n wX )
  • Rw(Xw,Y) wX
  • Replace ((A3,E1)3 4 , (AG) )
  • Rw(Xw, Y) Rw(Xw,Y) wX n wX / n

83
Op Cost for Itemset to weighted itemsetRw
  • Op cost
  • Rw(Xw,Y) weight(X)YwX 2weight(X?Y)
  • weight(X) YwX
  • Rw(Xw, Y) Rw wX n wX / n
  • 0 Rw 1 , metric
  • INDEL(Xw) Rw (Xw,?) INDEL(Y) Rw (?, Y) 1

84
Tasks
  • Op costs (INDEL REPLACE) for itemsets
  • Organize into K partitions
  • Use clustering
  • Compress each partition into
  • weighted sequences
  • Summarize each partition into
  • Pattern consensus sequence
  • Variation consensus sequence

85
Summarize Generate and Present results
  • N sequences ? K weighted sequences
  • Weighted sequence huge
  • compression of all sequences

86
lt (E1, L1, R1, T1, V1, d1) (A1, B9, C8,
D8, E12, F1, L4, P1, S1, T8, V5, X1,
a1, d10, e2, f1, g1, p1) (B99, C96, D91,
E24, F2, G1, L15, P7, R2, S8, T95, V15,
X2, Y1, a2, d26, e3, g6, l1, m1) (A5,
B16, C5, D3, E13, F1, H2, L7, P1, R2,
S7, T6, V7, Y3, d3, g1) (A13, B126, C27,
D1, E32, G5, H3, J1, L1, R1, S32, T21,
V1, W3, X2, Y8, d13, e1, f8, i2, p7,
l3, g1) (A12, B6, C28, D1, E28, G5, H2,
J6, L2, S137, T10, V2, W6, X8, Y124, a1,
d6, g2, i1, l1, m2) (A135, B2, C23, E36,
G12, H124, K1, L4, O2, R2, S27, T6, V6,
W10, X3, Y8, Z2, a1, d6, g1, h2, j1,
k5, l3, m7, n1) (A11, B1, C5, E12, G3,
H10, L7, O4, S5, T1, V7, W3, X2, Y3,
a1, m2) (A31, C15, E10, G15, H25, K1,
L7, M1, O1, R4, S12, T10, V6, W3, Y3,
Z3, d7, h3, j2, l1, n1, p1, q1) (A3,
C5, E4, G7, H1, K1, R1, T1, W2, Z2, a1,
d1, h1, n1) (A20, C27, E13, G35, H7, K7,
L111, N2, O1, Q3, R11, S10, T20, V111,
W2, X2, Y3, Z8, a1, b1, d13, h9, j1,
n1, o2) (A17, B2, C14, E17, F1, G31, H8,
K13, L2, M2, N1, R22, S2, T140, U1, V2,
W2, X1, Z13, a1, b8, d6, h14, n6, p1,
q1) (A12, B7, C5, E13, G16, H5, K106,
L8, N2, O1, R32, S3, T29, V9, X2, Z9,
b16, c5, d5, h7, l1) (A7, B1, C9, E5,
G7, H3, K7, R8, S1, T10, X1, Z3, a2,
b3, c1, d5, h3) (A1, B1, H1, R1, T1,
b2, c1) (A3, B2, C2, E6, F2, G4, H2,
K20, M2, N3, R19, S3, T11, U2, X4, Z34,
a3, b11, c2, d4) (H1, Y1, a1, d1) gt 162
87
Presentation model
  • Frequent items
  • definite pattern items
  • Cutoff ?50
  • Common items
  • uncertain
  • Cutoff ?20
  • Rare items
  • Noise items

W100
?50
?20
W0
88
Visualization
  • Pattern Consensus Sequence
  • Cutoff Minimum cluster strength (?50)
  • Frequent items
  • Variation Consensus Sequence
  • Cutoff Minimum cluster strength (?20)
  • Frequent items common items

89
Example Given 10 seqs lexically sorted
90
Example Given 10 seqs lexically sorted
91
Color scheme lt100 85 70 50 35 20 gt
92
Color scheme lt100 85 70 50 35 20 gt
93
Color scheme lt100 85 70 50 35 20 gt
94
Color scheme lt100 85 70 50 35 20 gt
95
Color scheme lt100 85 70 50 35 20 gt
96
Color scheme lt100 85 70 50 35 20 gt
97
Example support Model (202 seq)
(A) (B,C) (D,E) (I,J) (K) (L,M)
98
ApproxMAP (Approximate Multiple Alignment
Pattern mining)
  • Approximation Method ApproxMAP
  • Organize into K partitions O(Nseq2Lseq2Iseq)
  • Proximity matrix O(Nseq2Lseq2Iseq)
  • Clustering O(kNseq)
  • Compress each partition O(nL2)
  • weighted sequences O(nL2)
  • Summarize each partition O (1)
  • Pattern consensus sequence
  • Variation consensus sequence
  • Time Complexity O(Nseq2Lseq2Iseq)
  • 2 optimizations

99
Overview
  • What is KDD (Knowledge Discovery Data mining)
  • Problem Sequential Pattern Mining
  • Method ApproxMAP
  • Evaluation Method
  • Results
  • Case Study
  • Conclusion

100
Evaluation
  • Up to now Only performance / scalability
  • Quality?
  • What kind of patterns will the model generate?
  • Evaluate correctness of the model
  • Why?
  • Basis for comparison of different models
  • Essential in understanding results of approximate
    solutions

101
Evaluation Method
  • Given
  • Set of Base patterns B E(FB) E(LB) ? D
  • D ? Set of Result patterns P
  • How?
  • Map each Pi to best Bj
  • based on Longest Common Subsequences
  • of all Bj
  • max res pat B(B?P)

102
Item level
Confusion Matrix
103
Item level
Confusion Matrix
104
Item level
Confusion Matrix
105
Evaluation Criteria Item level
  • Recoverability
  • Degree of pattern items in the Base Pat
    (weighted)
  • ? E(FB) max res pat B(B?P) / E(LB)
  • Cutoff so that 0 R 1
  • Precision
  • Degree of pattern items in the Result Pat
  • Pattern Items / (Pattern Items Extraneous Items)

106
Evaluation Criteria Sequence level
  • spurious patterns
  • Pattern Items ? Extraneous Items
  • determine max pattern for each Bj
  • Of all Pi map to a particular Bj
  • the Pi with the Longest Common Subsequence
  • max res pat P(B?P)
  • redundant patterns
  • All other patterns
  • Ntotal Nmax Nspur Nredun

107
Evaluation Example
  • 30 (A)(BC)(DE)
  • 70 (IJ)(K)(LM)
  • (A)(B)(DE)
  • (A)(BC)(D)
  • (B)(BC)(DE)
  • (IJ)(LM)
  • (J)(K)(LM)
  • (IJ)(KX)(LM)
  • (XY)(K)(Z)

108
Evaluation Example
  • 30 (A)(BC)(DE)
  • 70 (IJ)(K)(LM)
  • Ntotal 7
  • (A)(B)(DE)
  • (A)(BC)(D)
  • (B)(BC)(DE)
  • (IJ)(LM)
  • (J)(K)(LM)
  • (IJ)(KX)(LM)
  • (XY)(K)(Z)

109
Evaluation Example
  • 30 (A)(BC)(DE)
  • 70 (IJ)(K)(LM)
  • Ntotal 7
  • Spurious 1
  • (A)(B)(DE)
  • (A)(BC)(D)
  • (B)(BC)(DE)
  • (IJ)(LM)
  • (J)(K)(LM)
  • (IJ)(KX)(LM)
  • (XY)(K)(Z)

110
Evaluation Example
  • 30 (A)(BC)(DE)
  • 70 (IJ)(K)(LM)
  • Ntotal 7
  • Spurious 1
  • Recoverability
  • (30)4/5(70)5/5 94
  • (A)(B)(DE)
  • (A)(BC)(D)
  • (B)(BC)(DE)
  • (IJ)(LM)
  • (J)(K)(LM)
  • (IJ)(KX)(LM)
  • (XY)(K)(Z)

111
Evaluation Example
  • 30 (A)(BC)(DE)
  • 70 (IJ)(K)(LM)
  • Ntotal 7
  • Spurious 1
  • Recoverability
  • (30)4/5(70)5/5 94
  • Redundant 4
  • (A)(B)(DE)
  • (A)(BC)(D)
  • (B)(BC)(DE)
  • (IJ)(LM)
  • (J)(K)(LM)
  • (IJ)(KX)(LM)
  • (XY)(K)(Z)

112
Evaluation Example
  • 30 (A)(BC)(DE)
  • 70 (IJ)(K)(LM)
  • Ntotal 7
  • Spurious 1
  • Recoverability
  • (30)4/5(70)5/5 94
  • Redundant 4
  • Precision 1-5/3184
  • (A)(B)(DE)
  • (A)(BC)(D)
  • (B)(BC)(DE)
  • (IJ)(LM)
  • (J)(K)(LM)
  • (IJ)(KX)(LM)
  • (XY)(K)(Z)

113
Synthetic data
  • Patterned data IBM synthetic data generator
  • Given certain DB parameters outputs
  • sequence DB
  • base patterns used to generate it E(FB) and
    E(LB)
  • R. Agrawal and R. Srikant ICDE 95 EBDT 96
  • Random data
  • Independence both between and across itemsets
  • Patterned data systematic noise
  • Randomly change item with probability ?
  • Yang, SIGMOD 2002
  • Patterned data systematic outliers
  • random sequences

114
Overview
  • What is KDD (Knowledge Discovery Data mining)
  • Problem Sequential Pattern Mining
  • Method ApproxMAP
  • Evaluation Method
  • Results
  • Case Study
  • Conclusion

115
Results
  • ApproxMAP
  • Pattern consensus sequence
  • No null or one-itemset
  • Machine swan
  • 2GHz Intel Xeon processor
  • 2GB of memory
  • Public machine
  • Difficult to get consistent running time
    measurements
  • Thanks !

116
Database Parameter
117
Database Parameter
118
(No Transcript)
119
(No Transcript)
120
lt (E1, L1, R1, T1, V1, d1) (A1, B9, C8,
D8, E12, F1, L4, P1, S1, T8, V5, X1,
a1, d10, e2, f1, g1, p1) (B99, C96, D91,
E24, F2, G1, L15, P7, R2, S8, T95, V15,
X2, Y1, a2, d26, e3, g6, l1, m1) (A5,
B16, C5, D3, E13, F1, H2, L7, P1, R2,
S7, T6, V7, Y3, d3, g1) (A13, B126, C27,
D1, E32, G5, H3, J1, L1, R1, S32, T21,
V1, W3, X2, Y8, d13, e1, f8, i2, p7,
l3, g1) (A12, B6, C28, D1, E28, G5, H2,
J6, L2, S137, T10, V2, W6, X8, Y124, a1,
d6, g2, i1, l1, m2) (A135, B2, C23, E36,
G12, H124, K1, L4, O2, R2, S27, T6, V6,
W10, X3, Y8, Z2, a1, d6, g1, h2, j1,
k5, l3, m7, n1) (A11, B1, C5, E12, G3,
H10, L7, O4, S5, T1, V7, W3, X2, Y3,
a1, m2) (A31, C15, E10, G15, H25, K1,
L7, M1, O1, R4, S12, T10, V6, W3, Y3,
Z3, d7, h3, j2, l1, n1, p1, q1) (A3,
C5, E4, G7, H1, K1, R1, T1, W2, Z2, a1,
d1, h1, n1) (A20, C27, E13, G35, H7, K7,
L111, N2, O1, Q3, R11, S10, T20, V111,
W2, X2, Y3, Z8, a1, b1, d13, h9, j1,
n1, o2) (A17, B2, C14, E17, F1, G31, H8,
K13, L2, M2, N1, R22, S2, T140, U1, V2,
W2, X1, Z13, a1, b8, d6, h14, n6, p1,
q1) (A12, B7, C5, E13, G16, H5, K106,
L8, N2, O1, R32, S3, T29, V9, X2, Z9,
b16, c5, d5, h7, l1) (A7, B1, C9, E5,
G7, H3, K7, R8, S1, T10, X1, Z3, a2,
b3, c1, d5, h3) (A1, B1, H1, R1, T1,
b2, c1) (A3, B2, C2, E6, F2, G4, H2,
K20, M2, N3, R19, S3, T11, U2, X4, Z34,
a3, b11, c2, d4) (H1, Y1, a1, d1) gt 162
ApproxMAP
121
ApproxMAP
122
ApproxMAP
123
(No Transcript)
124
8 patterns returned 7 max patterns 1 redundant
patterns 0 spurious patterns Recoverability
91.16 Precision 97.17 Extraneous Items 3/106
125
Comparative Study
  • Conventional Sequential Pattern Mining
  • Support Model
  • Empirical analysis
  • Totally random data
  • Patterned data
  • Patterned data noise
  • Patterned data outliers

126
Evaluation Comparison
127
Evaluation Comparison
128
Evaluation Comparison
129
Robustness w.r.t. noise
130
Evaluation Comparison
131
Understanding ApproxMAP
  • 5 experiments
  • k in kNN clustering
  • Strength cutoff
  • Order of alignment
  • Optimization 1 reduced precision in prox.
    matrix
  • Optimization 2 sample based iterative clustering

132
A realistic DB
133
Input parameters
134
The order in multiple alignment
135
Understanding ApproxMAP
  • Optimization 1 Lseq
  • Running time reduced to 40
  • Optimization 2 Nseq
  • Running time reduced to 10-40
  • For negligible reduction in recoverability

136
Effects of the DB param scalability
  • 4 experiments
  • I of unique items in the database
  • Density of the database
  • 1,000 10,000
  • Nseq of sequences in the data
  • 10,000 100,000
  • Lseq Avg. of itemsets per data seq
  • 10 50
  • Iseq Avg. of items per itemset in DB
  • 2.5 10

137
(No Transcript)
138
Overview
  • What is KDD (Knowledge Discovery Data mining)
  • Problem Sequential Pattern Mining
  • Method ApproxMAP
  • Evaluation Method
  • Results
  • Case Study
  • Conclusion

139
Case Study Real data
  • Monthly Services to children with AN report
  • 992 sequences
  • 15 interpretable and useful patterns
  • (RPT)(INV,FC)(FC) ..11.. (FC)
  • 419 sequences
  • (RPT)(INV,FC)(FC)(FC)
  • 57 sequences
  • (RPT)(INV,FC,T)(FC,T)(FC,HM)(FC)(FC,HM)
  • 39 sequences

140
Overview
  • What is KDD (Knowledge Discovery Data mining)
  • Problem Sequential Pattern Mining
  • Method ApproxMAP
  • Evaluation Method
  • Results
  • Case Study
  • Conclusion

141
Conclusion why does it work well?
  • Robust on random weak patterned noise
  • Very good at organizing sequences
  • Long sequence data that are not random have
    unique signatures

142
What have I done?
  • defines a new model, Multiple Alignment
    Sequential Pattern Mining,
  • describes a novel solution ApproxMAP (for
    APPROXimate Multiple Alignment Pattern mining)
  • that introduces a new metric for itemsets
  • weighted sequences a new representation of
    alignment information,
  • and the effective use of strength cutoffs to
    control the level of detail included in the
    consensus patterns
  • designs a general evaluation method to assess the
    quality of results from sequential pattern mining
    algorithms,

143
What have I done?
  • employs the evaluation method to run an extensive
    set of empirical evaluations of approxMAP on
    synthetic data,
  • employs the evaluation method to compare the
    effectiveness of approxMAP to the conventional
    methods based on support model,
  • derives the expected support of a random
    sequences under the null hypothesis of no pattern
    in the database to better understand the behavior
    of the support based methods, and
  • demonstrates the usefulness of approxMAP using
    real world data.

144
Future Work
  • Sample based iterative clustering
  • Memory management
  • Distance metric
  • Multisets
  • Taxonomy tree
  • Strength cutoff
  • Automatic detection of customized
  • Local alignment

145
Thank You !
  • Advisor
  • Wei Wang (02-04)
  • James Coggins (00-02)
  • Prasun Dewan (99-00)
  • Kye Hedlund (96-99)
  • Jan Prins (95-96)
  • SW advisor
  • Dean Duncan (98-04)
  • Other people
  • Janet Jones
  • Kim Flair
  • Susan Paulsen
  • Fellow students
  • Priyank Porwal, Andrew Leaver-Fay, Leland Smith
  • Committee
  • Stephen Aylward
  • Jan Prins
  • Andrew Nobel
  • Other faculty
  • Jian Pei
  • Jack Snoeyink
  • J. S. Marron
  • Stephen Pizer
  • Stephen Weiss
  • Colleagues
  • Sang-Uok Kum, Jisung Kim
  • Alexandra, Michelle,
  • Aron, Chris
  • Family
  • Sohmee, My mom, dad, sister

146
(No Transcript)
147
Rw
(XY-2X?Y) (XY)
  • Rw(Xw, Y) Rw wX n - wX / n
  • Rw(Xw,Y) weight(X)YwX -2weight(X?Y)
  • weight(X) YwX
Write a Comment
User Comments (0)