Title: Data Mining Toon Calders
1Data MiningToon Calders
2Why Data mining?
- Explosive Growth of Data from terabytes to
petabytes - Data collection and data availability
- Major sources of abundant data
3Why Data mining?
- We are drowning in data, but starving for
knowledge! - Necessity is the mother of inventionData
miningAutomated analysis of massive data sets
The Data Gap
Total new disk (TB) since 1995
Number of analysts
4What Is Data Mining?
- Data mining (knowledge discovery from data)
- Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data - Alternative names
- Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information
harvesting, business intelligence, etc.
5Current Applications
- Data analysis and decision support
- Market analysis and management
- Risk analysis and management
- Fraud detection and detection of unusual patterns
(outliers) - Other Applications
- Text mining (news group, email, documents) and
Web mining - Stream data mining
- Bioinformatics and bio-data analysis
6Ex. 1 Market Analysis Management
- Data from credit card transactions, loyalty
cards, discount coupons, customer complaint
calls, plus (public) lifestyle studies - Target marketing
- Find groups of customers who share the same
characteristics - Determine customer purchasing patterns over time
- Find associations/co-relations between product
sales, predict based on association - Customer requirement analysis
- Identify the best products for different
customers - Predict what factors will attract new customers
- Provision of summary information
7Ex. 2 Fraud Detection Unusual Patt.
- Auto insurance ring of collisions
- Money laundering suspicious monetary
transactions - Medical insurance
- Professional patients, ring of doctors, and ring
of references - Unnecessary or correlated screening tests
- Telecommunications phone-call fraud
- Phone call model destination of the call,
duration, time of day or week. Analyze patterns
that deviate from an expected norm - Tax fraud
- Belgian government succesfully applies data
mining to find fraude
8Ex. 3 Process Mining
process mining
- Process mining can be used for
- Process discovery (What is the process?)
- Delta analysis (Are we doing what was specified?)
- Performance analysis (How can we improve?)
9Ex. 3 Process Mining
case 1 task A case 2 task A case 3 task A
case 3 task B case 1 task B case 1 task
C case 2 task C case 4 task A case 2
task B case 2 task D case 5 task E case 4
task C case 1 task D case 3 task C case
3 task D case 4 task B case 5 task F
case 4 task D
10Knowledge Discovery (KDD) Process
Knowledge
- Data miningcore of knowledge discovery process
Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
11Data Mining Tasks
- Previous lectures
- Classification Predictive
- Clustering Descriptive
- This lecture
- Association Rule Discovery Descriptive
- Sequential Pattern Discovery Descriptive
- Other techniques
- Regression Predictive
- Deviation Detection Predictive
12Outline of todays lecture
- Association Rule Mining
- Frequent itemsets and association rules
- Algorithms Apriori and Eclat
- Sequential Pattern Mining
- Mining frequent episodes
- Algorithms WinEpi and MinEpi
- Other types of patterns
- strings, graphs,
- process mining
13Association Rule Mining
- Definition
- Frequent itemsets
- Association rules
- Frequent itemset mining
- breadth-first Apriori
- depth-first Eclat
- Association Rule Mining
14Association Rule Mining
- Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules
Diaper ? Beer,Milk, Bread ?
Eggs,Coke,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
15Association Rule Discovery Application
- Marketing and Sales Promotion
- Let the rule discovered be
- Bagels, --gt Potato Chips
- Potato Chips as consequent gt Can be used to
determine what should be done to boost its sales. - Bagels in the antecedent gt Can be used to see
which products would be affected if the store
discontinues selling bagels. - Bagels in antecedent and Potato chips in
consequent gt Can be used to see what products
should be sold with Bagels to promote sale of
Potato chips!
16Definition Frequent Itemset
- Itemset
- A collection of one or more items
- Example Milk, Bread, Diaper
- k-itemset
- An itemset that contains k items
- Support count (?)
- Frequency of occurrence of an itemset
- E.g. ?(Milk, Bread,Diaper) 2
- Support
- Fraction of transactions that contain an itemset
- E.g. s(Milk, Bread, Diaper) 2/5
- Frequent Itemset
- An itemset whose support is greater than or equal
to a minsup threshold
17Definition Association Rule
- Association Rule
- An implication expression of the form X ? Y,
where X and Y are itemsets - Example Milk, Diaper ? Beer
- Rule Evaluation Metrics
- Support (s)
- Fraction of transactions that contain both X and
Y - Confidence (c)
- Measures how often items in Y appear in
transactions thatcontain X
18Association Rule Mining Task
- Given a set of transactions T, the goal of
association rule mining is to find all rules
having - support minsup threshold
- confidence minconf threshold
- Brute-force approach
- List all possible association rules
- Compute the support and confidence for each rule
- Prune rules that fail the minsup and minconf
thresholds - ? Computationally prohibitive!
19Mining Association Rules
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)
- Observations
- All the above rules are binary partitions of the
same itemset Milk, Diaper, Beer - Rules originating from the same itemset have
identical support but can have different
confidence - Thus, we may decouple the support and confidence
requirements
20Mining Association Rules
- Two-step approach
- Frequent Itemset Generation
- Generate all itemsets whose support ? minsup
- Rule Generation
- Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning
of a frequent itemset - Frequent itemset generation is still
computationally expensive
21Association Rule Mining
- Definition
- Frequent itemsets
- Association rules
- Frequent itemset mining
- breadth-first Apriori
- depth-first Eclat
- Association Rule Mining
22Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
23Frequent Itemset Generation
- Brute-force approach
- Each itemset in the lattice is a candidate
frequent itemset - Count the support of each candidate by scanning
the database - Match each transaction against every candidate
- Complexity O(NMw) gt Expensive since M 2d !!!
24Frequent Itemset Generation Strategies
- Reduce the number of candidates (M)
- Complete search M2d
- Use pruning techniques to reduce M
- Reduce the number of transactions (N)
- Reduce size of N as the size of itemset increases
- Used by DHP and vertical-based mining algorithms
- Reduce the number of comparisons (NM)
- Use efficient data structures to store the
candidates or transactions - No need to match every candidate against every
transaction
25Reducing Number of Candidates
- Apriori principle
- If an itemset is frequent, then all of its
subsets must also be frequent - Apriori principle holds due to the following
property of the support measure - Support of an itemset never exceeds the support
of its subsets - This is known as the anti-monotone property of
support
26Illustrating Apriori Principle
27Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets) (No need to
generatecandidates involving Cokeor Eggs)
Minimum Support 3
Triplets (3-itemsets)
If every subset is considered, 6C1 6C2 6C3
41 With support-based pruning, 6 6 1 13
28Association Rule Mining
- Definition
- Frequent itemsets
- Association rules
- Frequent itemset mining
- breadth-first Apriori
- depth-first Eclat
- Association Rule Mining
29Apriori
- B, C
- B, C
- A, C, D
- A, B, C, D
- B, D
minsup2
Candidates
A
C
B
D
0
0
0
0
30Apriori
- B, C
- B, C
- A, C, D
- A, B, C, D
- B, D
minsup2
Candidates
A
C
B
D
0
1
1
0
31Apriori
- B, C
- B, C
- A, C, D
- A, B, C, D
- B, D
minsup2
Candidates
A
C
B
D
0
2
2
0
32Apriori
- B, C
- B, C
- A, C, D
- A, B, C, D
- B, D
minsup2
Candidates
A
C
B
D
1
2
3
1
33Apriori
- B, C
- B, C
- A, C, D
- A, B, C, D
- B, D
minsup2
Candidates
A
C
B
D
2
3
4
2
34Apriori
- B, C
- B, C
- A, C, D
- A, B, C, D
- B, D
minsup2
Candidates
A
C
B
D
2
4
4
3
35Apriori
- B, C
- B, C
- A, C, D
- A, B, C, D
- B, D
minsup2
Candidates
AB
BC
AC
AD
CD
BD
A
C
B
D
2
4
4
3
36Apriori
- B, C
- B, C
- A, C, D
- A, B, C, D
- B, D
minsup2
AB
BC
AC
AD
CD
BD
1
2
2
3
2
2
A
C
B
D
2
4
4
3
37Apriori
- B, C
- B, C
- A, C, D
- A, B, C, D
- B, D
Candidates
minsup2
ACD
BCD
AB
BC
AC
AD
CD
BD
1
2
2
3
2
2
A
C
B
D
2
4
4
3
38Apriori
- B, C
- B, C
- A, C, D
- A, B, C, D
- B, D
minsup2
ACD
BCD
2
1
AB
BC
AC
AD
CD
BD
1
2
2
3
2
2
A
C
B
D
2
4
4
3
39Apriori Algorithm
- Apriori Algorithm
- k 1
- C1 A A is an item
- Repeat until Ck
- Count the support of each candidate in Ck
- in one scan over DB
- Fk I ? Ck I is frequent
- Generate new candidates
- Ck1 I I k1 and all J ? I with Jk
are in Fk - kk1
- Return ?i1k-1 Fi
40Association Rule Mining
- Definition
- Frequent itemsets
- Association rules
- Frequent itemset mining
- breadth-first Apriori
- depth-first Eclat
- Association Rule Mining
41Depth-first strategy
- Recursive procedure
- FSET(DB) frequent sets in DB
- Based on divide-and-conquer
- Count frequency of all items
- let D be a frequent item
- FSET(DB)
- Frequent sets with item D
- Frequent sets without item D
42Depth-first strategy
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
- Frequent items
- A, B, C, D
- Frequent sets with D
- remove transactions without D and D itself from
DB - Count frequent sets A, B, C, AC
- Append D AD, BD, CD, ACD
- Frequent sets without D
- remove D from all transactions in DB
- Find frequent sets AC, BC
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
43Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
44Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
45Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
A 2 B 2 C 2
A 2 B 4 C 4 D 3
46Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
DBCD
3 A, C 4 A, B, C 5 B,
- A,
- A, B
A 2
A 2 B 2 C 2
A 2 B 4 C 4 D 3
47Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
DBCD
3 A, C 4 A, B, C 5 B,
- A,
- A, B
A 2
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
48Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
49Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
DBBD
4 A
A1
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
50Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
51Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
AD 2 BD 2 CD 2
ACD 2
52Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2
53Depth-First Algorithm
minsup2
DB
DBC
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 3
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2
54Depth-First Algorithm
minsup2
DB
DBC
DBBC
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 1
A 2 B 3
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2
55Depth-First Algorithm
minsup2
DB
DBC
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 3
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2
56Depth-First Algorithm
minsup2
DB
DBC
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 3
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2 AC 2 BC 3
57Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2 AC 2 BC 3
58Depth-First Algorithm
minsup2
DB
DBB
1 2 4 A 5
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A1
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2 AC 2 BC 3
59Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2 AC 2 BC 3
60Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
Final set of frequent itemsets
ACD 2 AC 2 BC 3
61Depth-first strategy
- FSET(DB)
- 1. Count frequency of items in DB
- 2. F A A is frequent in DB
- 3. // Remove infrequent items from DB
- DB T ? F T?DB
- 4. For all frequent items D except last one do
- // Find frequent, strict supersets of D in
DB - 4a. Let DBD T \ D T ? DB, D ? T
- 4b. F F ? (I ? D) I in FSET(DBD)
- 4c. // Remove D from DB
- DB T \ D T?DB
- 5. Return F
62Depth-first strategy
- All depth-first algorithms use this strategy
- Difference data structure for DB
- prefix-tree FPGrowth
- vertical database Eclat
63ECLAT
- For each item, store a list of transaction ids
(tids)
TID-list
64ECLAT
- Support of item A length of its tidlist
- Remove item A from DB remove tidlist of A
- Create conditional database DBE
- Intersect all other tidlists with the tidlist of
E - Only keep frequent items
A B C D E 1 1 2 2 1 4 2 3 4 3 5
5 4 5 6 6 7 8 9 7 8 9 8 10 9
A B C D 1 1 3 6
A B C 1 1 3 6
65Association Rule Mining
- Definition
- Frequent itemsets
- Association rules
- Frequent itemset mining
- breadth-first Apriori
- depth-first Eclat
- Association Rule Mining
66Association Rule Mining
- Remember
- original problem find rules X?Y such that
- support(XY)?? minsup
- support(XY) / support(X) ? minconf
- Frequent itemsets the combinations XY
- Hence
- Get XY by splitting up the frequent itemsets I
67Rule Generation
- Given a frequent itemset L, find all non-empty
subsets f ? L such that f ? L f satisfies the
minimum confidence requirement - If A,B,C,D is a frequent itemset, candidate
rules - ABC ?D, ABD ?C, ACD ?B, BCD ?A, A ?BCD, B
?ACD, C ?ABD, D ?ABCAB ?CD, AC ? BD, AD ? BC,
BC ?AD, BD ?AC, CD ?AB, - If L k, then there are 2k 2 candidate
association rules (ignoring L ? ? and ? ? L)
68Rule Generation
- How to efficiently generate rules from frequent
itemsets? - In general, confidence does not have an
anti-monotone property - c(ABC ?D) can be larger or smaller than c(AB ?D)
- But confidence of rules generated from the same
itemset has an anti-monotone property - e.g., L A,B,C,D c(ABC ? D) ? c(AB ? CD)
? c(A ? BCD) -
- Confidence is anti-monotone w.r.t. number of
items on the RHS of the rule
69Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule
70Summary Association Rule Mining
- Find associations X? Y
- rule appears in sufficient large part of the
database - conditional probability P(Y X) is high
- This problem can be split into two sub-problems
- find frequent itemsets
- split frequent itemsets to get association rules
- Finding frequent itemsets
- Apriori-property
- breadth-first vs depth-first algorithms
- From itemsets to association rules
- split up frequent sets, use anti-monotonicity
71Outline
- Association Rule Mining
- Frequent itemsets and association rules
- Algorithms Apriori and Eclat
- Sequential Pattern Mining
- Mining frequent episodes
- Algorithms WinEpi and MinEpi
- Other types of patterns
- strings, graphs,
- process mining
72Series and Sequences
- In many applications, the order and transaction
times are very important - stock prices
- events in a networking environment
- crash, starting a program, certain commands
- Specific format of the data is very important
- Goal find temporal rules, order is important.
73Series and Sequences
- Example
- 70 of the customers that buy shoes and socks,
will buy shoe polish within 5 days. - User U1 logging on, followed by User U2 starting
program P, is always followed by a crash. - Here, we will concentate on the problem of
finding frequent episodes - can be used in the same way as itemsets
- split episodes to get the rules
74Episode Mining
- Event sequence sequence of pairs (e,t), e is an
event, t an integer indicating the time of
occurrence of e. - An linear episode is a sequence of events
- lte1, , engt.
- A window of length w is an interval s,e with
- (e-s1) w.
- An episode Elte1, , engt occurs in sequence
Slt(s1,t1), , (sm,tm)gt within window Ws,e if
there exist integers s ? i1 lt lt in ? e such
that for all j1n, (ej,ij) is in S.
75Episode mining support measure
- Given a sequence SFind all linear episodes that
occur frequently in S
76Episode mining support measure
- Given a sequence SFind all linear episodes that
occur frequently in S - Given an integer w. The w-support of an episode
Elte1, , engt in a sequence Slt(s1,t1), ,
(sm,tm)gt is the number of windows W of length w
such that E occurs in S within window W. - Note If an episode occurs in a very short time
span, it will be in many subsequent windows, and
thus contribute a lot to the support count!
77Example
- S lt b a a c b a a b c gt
- E lt b a c gt
- E occurs in S within window 0,4, within 1,4,
within 5,9, - The 5-support of E in S is 3, since E is only in
the following - windows of length 5 0,4, 1,5, 5,9
- b a a c b a a b c
78- An episode E1lte1, , engt is a sub-episode of
E2ltf1,,fmgt, denoted E1 ? E2 if there exist
integers 1? i1 lt lt in ? m such that for all
j1n, ejfij. - Example
- lt b, a, a, c gt is a sub-episode of lta, b, c, a,
a, b, cgt.
79- Episode Mining Problem
- Given a sequence w, a minimal support minsup, and
a window width w, find all episodes that have a
w-support above minsup. - Monotonicity
- Let S be a sequence, E1, E2 episodes, w a number.
- If E1 ? E2, then the w-support(E2) ?
w-support(E1).
80WinEpi Algorithm
- We can again apply a level-wise algorithm like
Apriori. - Start with small episodes, only proceed with a
larger episode if all sub-episodes are frequent. - lta,a,bgt is evaluated after ltagt, ltbgt, lta,agt,
lta,bgt, and only if all these episodes were
frequent. - Counting the frequency
- slide window over stream
- use smart update technique for the supports
81Search space
Number of episodes of length k ek (e is
number of events) An episode of length k has
maximally k sub-sequences of length k-1. We can
count supports by sliding a window over the
sequence.
82Example
- S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
,9), (c,13), (a,14), (c,17), (c,18) gt - w 4, minsup 3
- C1 ltagt, ltbgt, ltcgt
0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
83Example
- S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
,9), (c,13), (a,14), (c,17), (c,18) gt - w 4, minsup 3
- C1 ltagt, ltbgt, ltcgt
- Slide window of length 4 over S
- 4-supports ltagt12, ltbgt12, ltcgt14
0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
84Example
- S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
,9), (c,13), (a,14), (c,17), (c,18) gt - w 4, minsup 3
- C1 ltagt, ltbgt, ltcgt
- Slide window of length 4 over S
- 4-supports ltagt12, ltbgt12, ltcgt14
- C2 lta,agt, lta,bgt, lta,cgt, ltb,agt, ltb,bgt, ltb,cgt,
ltc,agt, ltc,bgt, ltc,cgt
0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
85Example
- S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
,9), (c,13), (a,14), (c,17), (c,18) gt - w 4, minsup 3
- C1 ltagt, ltbgt, ltcgt
- Slide window of length 4 over S
- 4-supports ltagt12, ltbgt12, ltcgt14
- C2 lta,agt, lta,bgt, lta,cgt, ltb,agt, ltb,bgt, ltb,cgt,
ltc,agt, ltc,bgt, ltc,cgt - 4-supports lta,agt0 lta,bgt6 lta,cgt2
ltb,agt3 - ltb,bgt7 ltb,cgt3 ltc,agt3 ltc,bgt1
ltc,cgt3
0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
86Example
- S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
,9), (c,13), (a,14), (c,17), (c,18) gt - w 4, minsup 3
- C1 ltagt, ltbgt, ltcgt
- Slide window of length 4 over S
- 4-supports ltagt12, ltbgt12, ltcgt14
- C2 lta,agt, lta,bgt, lta,cgt, ltb,agt, ltb,bgt, ltb,cgt,
ltc,agt, ltc,bgt, ltc,cgt - 4-supports lta,agt0 lta,bgt6 lta,cgt2
ltb,agt3 - ltb,bgt7 ltb,cgt3 ltc,agt3 ltc,bgt1
ltc,cgt3 - C3 lta,b,bgt,ltb,a,bgt,ltb,b,agt,ltb,b,bgt,ltb,b,cgt,ltb,
c,agt, - ltb,c,cgt, ltc,c,agt, ltc,c,cgt
- 4-supports lta,b,bgt2, ltb,a,bgt2, ltb,b,agt2,
ltb,b,bgt2, - ltb,b,cgt0, ltb,c,agt0, ltb,c,cgt0, ltc,c,agt0,
ltc,c,cgt0
0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
87MinEpi
- Very similar algorithm
- based on other support measure
- minimal occurrence of sequence smallest window
in which the sequence occurs - support of E number of minimal occurrences of E
with a width less than w - S lt a b c b b a b b c a c c c b bgt window
length 5 - 5-support of lt a b b gt
-
- mo-support of lt a b b gt
-
88MinEpi
- Very similar algorithm
- based on other support measure
- minimal occurrence of sequence smallest window
in which the sequence occurs - support of E number of minimal occurrences of E
with a width less than w - S lt a b c b b a b b c a c c c b bgt window
length 5 - 5-support of lt a b b gt 5
- a b c b b a b b c a c c c b b
- mo-support of lt a b b gt
-
89MinEpi
- Very similar algorithm
- based on other support measure
- minimal occurrence of sequence smallest window
in which the sequence occurs - support of E number of minimal occurrences of E
with a width less than w - S lt a b c b b a b b c a c c c b bgt window
length 5 - 5-support of lt a b b gt 5
- a b c b b a b b c a c c c b b
- mo-support of lt a b b gt 2
- a b c b b a b b c a c c c b b
90Sequential Pattern Mining Summary
- Mining sequential episodes
- Two definitions of support
- w-support
- mo-support
- Two algorithms
- WinEpi
- MinEpi
- Based on monotonicity principle
- generate candidates levelwise
- only count candidates without infrequent
subsequences
91Outline
- Association Rule Mining
- Frequent itemsets and association rules
- Algorithms Apriori and Eclat
- Sequential Pattern Mining
- Mining frequent episodes
- Algorithms WinEpi and MinEpi
- Other types of patterns
- strings, graphs,
- process mining
92Other types of patterns
- Sequence problems
- Strings
- Other types of sequences
- Oher patterns in sequences
- Graphs
- Molecules
- WWW
- Social Networks
93Strings require different support measures
IR context What if query or document contains
typos or misspellings?
A subsequence of a string is obtained by deleting
zero or more characters.
94Other Types of Sequences
- CGATGGGCCAGTCGATACGTCGATGCCGATGTCACGA
95Other Patterns in Sequences
- Substrings
- Regular expressions (bbb2)
- Partial orders
- Directed Acyclic Graphs
96Graphs
97Patterns in Graphs
98Rules
f 7
f 8
f 5
0.5
0.8
0.57
f 4
f 4
f 4
99Summary
- What is data mining and why is it important.
- huge volumes of data
- not enough human analysts
- Pattern discovery as an important descriptive
data mining task - association rule mining
- sequential pattern mining
- Important principles
- Apriori principle
- breadth-first vs depth-first algorithms
- Many kinds and variaties of data-types, pattern
types,support measures,