Data Mining Toon Calders

About This Presentation

Title:

Data Mining Toon Calders

Description:

Toon Calders Why Data mining? Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Major sources of abundant data Why Data mining? – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 95

Provided by: Compu238

more less

Transcript and Presenter's Notes

Title: Data Mining Toon Calders

1
Data MiningToon Calders
2
Why Data mining?

Explosive Growth of Data from terabytes to
petabytes
Data collection and data availability
Major sources of abundant data

3
Why Data mining?

We are drowning in data, but starving for
knowledge!
Necessity is the mother of inventionData
miningAutomated analysis of massive data sets

The Data Gap
Total new disk (TB) since 1995
Number of analysts
4
What Is Data Mining?

Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information
harvesting, business intelligence, etc.

5
Current Applications

Data analysis and decision support
Market analysis and management
Risk analysis and management
Fraud detection and detection of unusual patterns
(outliers)
Other Applications
Text mining (news group, email, documents) and
Web mining
Stream data mining
Bioinformatics and bio-data analysis

6
Ex. 1 Market Analysis Management

Data from credit card transactions, loyalty
cards, discount coupons, customer complaint
calls, plus (public) lifestyle studies
Target marketing
Find groups of customers who share the same
characteristics
Determine customer purchasing patterns over time
Find associations/co-relations between product
sales, predict based on association
Customer requirement analysis
Identify the best products for different
customers
Predict what factors will attract new customers
Provision of summary information

7
Ex. 2 Fraud Detection Unusual Patt.

Auto insurance ring of collisions
Money laundering suspicious monetary
transactions
Medical insurance
Professional patients, ring of doctors, and ring
of references
Unnecessary or correlated screening tests
Telecommunications phone-call fraud
Phone call model destination of the call,
duration, time of day or week. Analyze patterns
that deviate from an expected norm
Tax fraud
Belgian government succesfully applies data
mining to find fraude

8
Ex. 3 Process Mining
process mining

Process mining can be used for
Process discovery (What is the process?)
Delta analysis (Are we doing what was specified?)
Performance analysis (How can we improve?)

9
Ex. 3 Process Mining
case 1 task A case 2 task A case 3 task A
case 3 task B case 1 task B case 1 task
C case 2 task C case 4 task A case 2
task B case 2 task D case 5 task E case 4
task C case 1 task D case 3 task C case
3 task D case 4 task B case 5 task F
case 4 task D
10
Knowledge Discovery (KDD) Process
Knowledge

Data miningcore of knowledge discovery process

Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
11
Data Mining Tasks

Previous lectures
Classification Predictive
Clustering Descriptive
This lecture
Association Rule Discovery Descriptive
Sequential Pattern Discovery Descriptive
Other techniques
Regression Predictive
Deviation Detection Predictive

12
Outline of todays lecture

Association Rule Mining
Frequent itemsets and association rules
Algorithms Apriori and Eclat
Sequential Pattern Mining
Mining frequent episodes
Algorithms WinEpi and MinEpi
Other types of patterns
strings, graphs,
process mining

13
Association Rule Mining

Definition
Frequent itemsets
Association rules
Frequent itemset mining
breadth-first Apriori
depth-first Eclat
Association Rule Mining

14
Association Rule Mining

Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction

Market-Basket transactions
Example of Association Rules
Diaper ? Beer,Milk, Bread ?
Eggs,Coke,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
15
Association Rule Discovery Application

Marketing and Sales Promotion
Let the rule discovered be
Bagels, --gt Potato Chips
Potato Chips as consequent gt Can be used to
determine what should be done to boost its sales.
Bagels in the antecedent gt Can be used to see
which products would be affected if the store
discontinues selling bagels.
Bagels in antecedent and Potato chips in
consequent gt Can be used to see what products
should be sold with Bagels to promote sale of
Potato chips!

16
Definition Frequent Itemset

Itemset
A collection of one or more items
Example Milk, Bread, Diaper
k-itemset
An itemset that contains k items
Support count (?)
Frequency of occurrence of an itemset
E.g. ?(Milk, Bread,Diaper) 2
Support
Fraction of transactions that contain an itemset
E.g. s(Milk, Bread, Diaper) 2/5
Frequent Itemset
An itemset whose support is greater than or equal
to a minsup threshold

17
Definition Association Rule

Association Rule
An implication expression of the form X ? Y,
where X and Y are itemsets
Example Milk, Diaper ? Beer
Rule Evaluation Metrics
Support (s)
Fraction of transactions that contain both X and
Y
Confidence (c)
Measures how often items in Y appear in
transactions thatcontain X

18
Association Rule Mining Task

Given a set of transactions T, the goal of
association rule mining is to find all rules
having
support minsup threshold
confidence minconf threshold
Brute-force approach
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
? Computationally prohibitive!

19
Mining Association Rules
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)

Observations
All the above rules are binary partitions of the
same itemset Milk, Diaper, Beer
Rules originating from the same itemset have
identical support but can have different
confidence
Thus, we may decouple the support and confidence
requirements

20
Mining Association Rules

Two-step approach
Frequent Itemset Generation
Generate all itemsets whose support ? minsup
Rule Generation
Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning
of a frequent itemset
Frequent itemset generation is still
computationally expensive

21
Association Rule Mining

Definition
Frequent itemsets
Association rules
Frequent itemset mining
breadth-first Apriori
depth-first Eclat
Association Rule Mining

22
Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
23
Frequent Itemset Generation

Brute-force approach
Each itemset in the lattice is a candidate
frequent itemset
Count the support of each candidate by scanning
the database
Match each transaction against every candidate
Complexity O(NMw) gt Expensive since M 2d !!!

24
Frequent Itemset Generation Strategies

Reduce the number of candidates (M)
Complete search M2d
Use pruning techniques to reduce M
Reduce the number of transactions (N)
Reduce size of N as the size of itemset increases
Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM)
Use efficient data structures to store the
candidates or transactions
No need to match every candidate against every
transaction

25
Reducing Number of Candidates

Apriori principle
If an itemset is frequent, then all of its
subsets must also be frequent
Apriori principle holds due to the following
property of the support measure
Support of an itemset never exceeds the support
of its subsets
This is known as the anti-monotone property of
support

26
Illustrating Apriori Principle
27
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets) (No need to
generatecandidates involving Cokeor Eggs)
Minimum Support 3
Triplets (3-itemsets)
If every subset is considered, 6C1 6C2 6C3
41 With support-based pruning, 6 6 1 13
28
Association Rule Mining

Definition
Frequent itemsets
Association rules
Frequent itemset mining
breadth-first Apriori
depth-first Eclat
Association Rule Mining

29
Apriori

B, C
B, C
A, C, D
A, B, C, D
B, D

minsup2
Candidates
A
C
B
D
0
0
0
0

30
Apriori

B, C
B, C
A, C, D
A, B, C, D
B, D

minsup2
Candidates
A
C
B
D
0
1
1
0

31
Apriori

B, C
B, C
A, C, D
A, B, C, D
B, D

minsup2
Candidates
A
C
B
D
0
2
2
0

32
Apriori

B, C
B, C
A, C, D
A, B, C, D
B, D

minsup2
Candidates
A
C
B
D
1
2
3
1

33
Apriori

B, C
B, C
A, C, D
A, B, C, D
B, D

minsup2
Candidates
A
C
B
D
2
3
4
2

34
Apriori

B, C
B, C
A, C, D
A, B, C, D
B, D

minsup2
Candidates
A
C
B
D
2
4
4
3

35
Apriori

B, C
B, C
A, C, D
A, B, C, D
B, D

minsup2
Candidates
AB
BC
AC
AD
CD
BD
A
C
B
D
2
4
4
3

36
Apriori

B, C
B, C
A, C, D
A, B, C, D
B, D

minsup2

AB
BC
AC
AD
CD
BD
1
2
2
3
2
2
A
C
B
D
2
4
4
3

37
Apriori

B, C
B, C
A, C, D
A, B, C, D
B, D

Candidates
minsup2
ACD
BCD

AB
BC
AC
AD
CD
BD
1
2
2
3
2
2
A
C
B
D
2
4
4
3

38
Apriori

B, C
B, C
A, C, D
A, B, C, D
B, D

minsup2
ACD
BCD

2
1

AB
BC
AC
AD
CD
BD
1
2
2
3
2
2
A
C
B
D
2
4
4
3

39
Apriori Algorithm

Apriori Algorithm
k 1
C1 A A is an item
Repeat until Ck
Count the support of each candidate in Ck
in one scan over DB
Fk I ? Ck I is frequent
Generate new candidates
Ck1 I I k1 and all J ? I with Jk
are in Fk
kk1
Return ?i1k-1 Fi

40
Association Rule Mining

Definition
Frequent itemsets
Association rules
Frequent itemset mining
breadth-first Apriori
depth-first Eclat
Association Rule Mining

41
Depth-first strategy

Recursive procedure
FSET(DB) frequent sets in DB
Based on divide-and-conquer
Count frequency of all items
let D be a frequent item
FSET(DB)
Frequent sets with item D
Frequent sets without item D

42
Depth-first strategy
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D

Frequent items
A, B, C, D
Frequent sets with D
remove transactions without D and D itself from
DB
Count frequent sets A, B, C, AC
Append D AD, BD, CD, ACD
Frequent sets without D
remove D from all transactions in DB
Find frequent sets AC, BC

1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
43
Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
44
Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
45
Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
A 2 B 2 C 2
A 2 B 4 C 4 D 3
46
Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
DBCD
3 A, C 4 A, B, C 5 B,

A,
A, B

A 2
A 2 B 2 C 2
A 2 B 4 C 4 D 3
47
Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
DBCD
3 A, C 4 A, B, C 5 B,

A,
A, B

A 2
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
48
Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
49
Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
DBBD
4 A
A1
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
50
Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
51
Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
AD 2 BD 2 CD 2
ACD 2
52
Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2
53
Depth-First Algorithm
minsup2
DB
DBC

B
2 B
3 A
4 A, B

1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 3
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2
54
Depth-First Algorithm
minsup2
DB
DBC
DBBC

B
2 B
3 A
4 A, B

1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 1
A 2 B 3
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2
55
Depth-First Algorithm
minsup2
DB
DBC

B
2 B
3 A
4 A, B

1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 3
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2
56
Depth-First Algorithm
minsup2
DB
DBC

B
2 B
3 A
4 A, B

1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 3
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2 AC 2 BC 3
57
Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2 AC 2 BC 3
58
Depth-First Algorithm
minsup2
DB
DBB
1 2 4 A 5
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A1
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2 AC 2 BC 3
59
Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2 AC 2 BC 3
60
Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
Final set of frequent itemsets
ACD 2 AC 2 BC 3
61
Depth-first strategy

FSET(DB)
1. Count frequency of items in DB
2. F A A is frequent in DB
3. // Remove infrequent items from DB
DB T ? F T?DB
4. For all frequent items D except last one do
// Find frequent, strict supersets of D in
DB
4a. Let DBD T \ D T ? DB, D ? T
4b. F F ? (I ? D) I in FSET(DBD)
4c. // Remove D from DB
DB T \ D T?DB
5. Return F

62
Depth-first strategy

All depth-first algorithms use this strategy
Difference data structure for DB
prefix-tree FPGrowth
vertical database Eclat

63
ECLAT

For each item, store a list of transaction ids
(tids)

TID-list
64
ECLAT

Support of item A length of its tidlist
Remove item A from DB remove tidlist of A
Create conditional database DBE
Intersect all other tidlists with the tidlist of
E
Only keep frequent items

A B C D E 1 1 2 2 1 4 2 3 4 3 5
5 4 5 6 6 7 8 9 7 8 9 8 10 9
A B C D 1 1 3 6
A B C 1 1 3 6
65
Association Rule Mining

Definition
Frequent itemsets
Association rules
Frequent itemset mining
breadth-first Apriori
depth-first Eclat
Association Rule Mining

66
Association Rule Mining

Remember
original problem find rules X?Y such that
support(XY)?? minsup
support(XY) / support(X) ? minconf
Frequent itemsets the combinations XY
Hence
Get XY by splitting up the frequent itemsets I

67
Rule Generation

Given a frequent itemset L, find all non-empty
subsets f ? L such that f ? L f satisfies the
minimum confidence requirement
If A,B,C,D is a frequent itemset, candidate
rules
ABC ?D, ABD ?C, ACD ?B, BCD ?A, A ?BCD, B
?ACD, C ?ABD, D ?ABCAB ?CD, AC ? BD, AD ? BC,
BC ?AD, BD ?AC, CD ?AB,
If L k, then there are 2k 2 candidate
association rules (ignoring L ? ? and ? ? L)

68
Rule Generation

How to efficiently generate rules from frequent
itemsets?
In general, confidence does not have an
anti-monotone property
c(ABC ?D) can be larger or smaller than c(AB ?D)
But confidence of rules generated from the same
itemset has an anti-monotone property
e.g., L A,B,C,D c(ABC ? D) ? c(AB ? CD)
? c(A ? BCD)
Confidence is anti-monotone w.r.t. number of
items on the RHS of the rule

69
Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule
70
Summary Association Rule Mining

Find associations X? Y
rule appears in sufficient large part of the
database
conditional probability P(Y X) is high
This problem can be split into two sub-problems
find frequent itemsets
split frequent itemsets to get association rules
Finding frequent itemsets
Apriori-property
breadth-first vs depth-first algorithms
From itemsets to association rules
split up frequent sets, use anti-monotonicity

71
Outline

Association Rule Mining
Frequent itemsets and association rules
Algorithms Apriori and Eclat
Sequential Pattern Mining
Mining frequent episodes
Algorithms WinEpi and MinEpi
Other types of patterns
strings, graphs,
process mining

72
Series and Sequences

In many applications, the order and transaction
times are very important
stock prices
events in a networking environment
crash, starting a program, certain commands
Specific format of the data is very important
Goal find temporal rules, order is important.

73
Series and Sequences

Example
70 of the customers that buy shoes and socks,
will buy shoe polish within 5 days.
User U1 logging on, followed by User U2 starting
program P, is always followed by a crash.
Here, we will concentate on the problem of
finding frequent episodes
can be used in the same way as itemsets
split episodes to get the rules

74
Episode Mining

Event sequence sequence of pairs (e,t), e is an
event, t an integer indicating the time of
occurrence of e.
An linear episode is a sequence of events
lte1, , engt.
A window of length w is an interval s,e with
(e-s1) w.
An episode Elte1, , engt occurs in sequence
Slt(s1,t1), , (sm,tm)gt within window Ws,e if
there exist integers s ? i1 lt lt in ? e such
that for all j1n, (ej,ij) is in S.

75
Episode mining support measure

Given a sequence SFind all linear episodes that
occur frequently in S

76
Episode mining support measure

Given a sequence SFind all linear episodes that
occur frequently in S
Given an integer w. The w-support of an episode
Elte1, , engt in a sequence Slt(s1,t1), ,
(sm,tm)gt is the number of windows W of length w
such that E occurs in S within window W.
Note If an episode occurs in a very short time
span, it will be in many subsequent windows, and
thus contribute a lot to the support count!

77
Example

S lt b a a c b a a b c gt
E lt b a c gt
E occurs in S within window 0,4, within 1,4,
within 5,9,
The 5-support of E in S is 3, since E is only in
the following
windows of length 5 0,4, 1,5, 5,9
b a a c b a a b c

An episode E1lte1, , engt is a sub-episode of
E2ltf1,,fmgt, denoted E1 ? E2 if there exist
integers 1? i1 lt lt in ? m such that for all
j1n, ejfij.
Example
lt b, a, a, c gt is a sub-episode of lta, b, c, a,
a, b, cgt.

Episode Mining Problem
Given a sequence w, a minimal support minsup, and
a window width w, find all episodes that have a
w-support above minsup.
Monotonicity
Let S be a sequence, E1, E2 episodes, w a number.
If E1 ? E2, then the w-support(E2) ?
w-support(E1).

80
WinEpi Algorithm

We can again apply a level-wise algorithm like
Apriori.
Start with small episodes, only proceed with a
larger episode if all sub-episodes are frequent.
lta,a,bgt is evaluated after ltagt, ltbgt, lta,agt,
lta,bgt, and only if all these episodes were
frequent.
Counting the frequency
slide window over stream
use smart update technique for the supports

81
Search space
Number of episodes of length k ek (e is
number of events) An episode of length k has
maximally k sub-sequences of length k-1. We can
count supports by sliding a window over the
sequence.
82
Example

S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
,9), (c,13), (a,14), (c,17), (c,18) gt
w 4, minsup 3
C1 ltagt, ltbgt, ltcgt

0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
83
Example

S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
,9), (c,13), (a,14), (c,17), (c,18) gt
w 4, minsup 3
C1 ltagt, ltbgt, ltcgt
Slide window of length 4 over S
4-supports ltagt12, ltbgt12, ltcgt14

0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
84
Example

S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
,9), (c,13), (a,14), (c,17), (c,18) gt
w 4, minsup 3
C1 ltagt, ltbgt, ltcgt
Slide window of length 4 over S
4-supports ltagt12, ltbgt12, ltcgt14
C2 lta,agt, lta,bgt, lta,cgt, ltb,agt, ltb,bgt, ltb,cgt,
ltc,agt, ltc,bgt, ltc,cgt

0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
85
Example

S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
,9), (c,13), (a,14), (c,17), (c,18) gt
w 4, minsup 3
C1 ltagt, ltbgt, ltcgt
Slide window of length 4 over S
4-supports ltagt12, ltbgt12, ltcgt14
C2 lta,agt, lta,bgt, lta,cgt, ltb,agt, ltb,bgt, ltb,cgt,
ltc,agt, ltc,bgt, ltc,cgt
4-supports lta,agt0 lta,bgt6 lta,cgt2
ltb,agt3
ltb,bgt7 ltb,cgt3 ltc,agt3 ltc,bgt1
ltc,cgt3

0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
86
Example

S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
,9), (c,13), (a,14), (c,17), (c,18) gt
w 4, minsup 3
C1 ltagt, ltbgt, ltcgt
Slide window of length 4 over S
4-supports ltagt12, ltbgt12, ltcgt14
C2 lta,agt, lta,bgt, lta,cgt, ltb,agt, ltb,bgt, ltb,cgt,
ltc,agt, ltc,bgt, ltc,cgt
4-supports lta,agt0 lta,bgt6 lta,cgt2
ltb,agt3
ltb,bgt7 ltb,cgt3 ltc,agt3 ltc,bgt1
ltc,cgt3
C3 lta,b,bgt,ltb,a,bgt,ltb,b,agt,ltb,b,bgt,ltb,b,cgt,ltb,
c,agt,
ltb,c,cgt, ltc,c,agt, ltc,c,cgt
4-supports lta,b,bgt2, ltb,a,bgt2, ltb,b,agt2,
ltb,b,bgt2,
ltb,b,cgt0, ltb,c,agt0, ltb,c,cgt0, ltc,c,agt0,
ltc,c,cgt0

0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
87
MinEpi

Very similar algorithm
based on other support measure
minimal occurrence of sequence smallest window
in which the sequence occurs
support of E number of minimal occurrences of E
with a width less than w
S lt a b c b b a b b c a c c c b bgt window
length 5
5-support of lt a b b gt
mo-support of lt a b b gt

88
MinEpi

Very similar algorithm
based on other support measure
minimal occurrence of sequence smallest window
in which the sequence occurs
support of E number of minimal occurrences of E
with a width less than w
S lt a b c b b a b b c a c c c b bgt window
length 5
5-support of lt a b b gt 5
a b c b b a b b c a c c c b b
mo-support of lt a b b gt

89
MinEpi

Very similar algorithm
based on other support measure
minimal occurrence of sequence smallest window
in which the sequence occurs
support of E number of minimal occurrences of E
with a width less than w
S lt a b c b b a b b c a c c c b bgt window
length 5
5-support of lt a b b gt 5
a b c b b a b b c a c c c b b
mo-support of lt a b b gt 2
a b c b b a b b c a c c c b b

90
Sequential Pattern Mining Summary

Mining sequential episodes
Two definitions of support
w-support
mo-support
Two algorithms
WinEpi
MinEpi
Based on monotonicity principle
generate candidates levelwise
only count candidates without infrequent
subsequences

91
Outline

Association Rule Mining
Frequent itemsets and association rules
Algorithms Apriori and Eclat
Sequential Pattern Mining
Mining frequent episodes
Algorithms WinEpi and MinEpi
Other types of patterns
strings, graphs,
process mining

92
Other types of patterns

Sequence problems
Strings
Other types of sequences
Oher patterns in sequences
Graphs
Molecules
WWW
Social Networks

93
Strings require different support measures
IR context What if query or document contains
typos or misspellings?
A subsequence of a string is obtained by deleting
zero or more characters.
94
Other Types of Sequences