Data Mining Toon Calders

1 / 94
About This Presentation
Title:

Data Mining Toon Calders

Description:

Toon Calders Why Data mining? Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Major sources of abundant data Why Data mining? – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 95
Provided by: Compu238

less

Transcript and Presenter's Notes

Title: Data Mining Toon Calders


1
Data MiningToon Calders
2
Why Data mining?
  • Explosive Growth of Data from terabytes to
    petabytes
  • Data collection and data availability
  • Major sources of abundant data

3
Why Data mining?
  • We are drowning in data, but starving for
    knowledge!
  • Necessity is the mother of inventionData
    miningAutomated analysis of massive data sets

The Data Gap
Total new disk (TB) since 1995
Number of analysts
4
What Is Data Mining?
  • Data mining (knowledge discovery from data)
  • Extraction of interesting (non-trivial, implicit,
    previously unknown and potentially useful)
    patterns or knowledge from huge amount of data
  • Alternative names
  • Knowledge discovery (mining) in databases (KDD),
    knowledge extraction, data/pattern analysis, data
    archeology, data dredging, information
    harvesting, business intelligence, etc.

5
Current Applications
  • Data analysis and decision support
  • Market analysis and management
  • Risk analysis and management
  • Fraud detection and detection of unusual patterns
    (outliers)
  • Other Applications
  • Text mining (news group, email, documents) and
    Web mining
  • Stream data mining
  • Bioinformatics and bio-data analysis

6
Ex. 1 Market Analysis Management
  • Data from credit card transactions, loyalty
    cards, discount coupons, customer complaint
    calls, plus (public) lifestyle studies
  • Target marketing
  • Find groups of customers who share the same
    characteristics
  • Determine customer purchasing patterns over time
  • Find associations/co-relations between product
    sales, predict based on association
  • Customer requirement analysis
  • Identify the best products for different
    customers
  • Predict what factors will attract new customers
  • Provision of summary information

7
Ex. 2 Fraud Detection Unusual Patt.
  • Auto insurance ring of collisions
  • Money laundering suspicious monetary
    transactions
  • Medical insurance
  • Professional patients, ring of doctors, and ring
    of references
  • Unnecessary or correlated screening tests
  • Telecommunications phone-call fraud
  • Phone call model destination of the call,
    duration, time of day or week. Analyze patterns
    that deviate from an expected norm
  • Tax fraud
  • Belgian government succesfully applies data
    mining to find fraude

8
Ex. 3 Process Mining
process mining
  • Process mining can be used for
  • Process discovery (What is the process?)
  • Delta analysis (Are we doing what was specified?)
  • Performance analysis (How can we improve?)

9
Ex. 3 Process Mining
case 1 task A case 2 task A case 3 task A
case 3 task B case 1 task B case 1 task
C case 2 task C case 4 task A case 2
task B case 2 task D case 5 task E case 4
task C case 1 task D case 3 task C case
3 task D case 4 task B case 5 task F
case 4 task D
10
Knowledge Discovery (KDD) Process
Knowledge
  • Data miningcore of knowledge discovery process

Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
11
Data Mining Tasks
  • Previous lectures
  • Classification Predictive
  • Clustering Descriptive
  • This lecture
  • Association Rule Discovery Descriptive
  • Sequential Pattern Discovery Descriptive
  • Other techniques
  • Regression Predictive
  • Deviation Detection Predictive

12
Outline of todays lecture
  • Association Rule Mining
  • Frequent itemsets and association rules
  • Algorithms Apriori and Eclat
  • Sequential Pattern Mining
  • Mining frequent episodes
  • Algorithms WinEpi and MinEpi
  • Other types of patterns
  • strings, graphs,
  • process mining

13
Association Rule Mining
  • Definition
  • Frequent itemsets
  • Association rules
  • Frequent itemset mining
  • breadth-first Apriori
  • depth-first Eclat
  • Association Rule Mining

14
Association Rule Mining
  • Given a set of transactions, find rules that will
    predict the occurrence of an item based on the
    occurrences of other items in the transaction

Market-Basket transactions
Example of Association Rules
Diaper ? Beer,Milk, Bread ?
Eggs,Coke,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
15
Association Rule Discovery Application
  • Marketing and Sales Promotion
  • Let the rule discovered be
  • Bagels, --gt Potato Chips
  • Potato Chips as consequent gt Can be used to
    determine what should be done to boost its sales.
  • Bagels in the antecedent gt Can be used to see
    which products would be affected if the store
    discontinues selling bagels.
  • Bagels in antecedent and Potato chips in
    consequent gt Can be used to see what products
    should be sold with Bagels to promote sale of
    Potato chips!

16
Definition Frequent Itemset
  • Itemset
  • A collection of one or more items
  • Example Milk, Bread, Diaper
  • k-itemset
  • An itemset that contains k items
  • Support count (?)
  • Frequency of occurrence of an itemset
  • E.g. ?(Milk, Bread,Diaper) 2
  • Support
  • Fraction of transactions that contain an itemset
  • E.g. s(Milk, Bread, Diaper) 2/5
  • Frequent Itemset
  • An itemset whose support is greater than or equal
    to a minsup threshold

17
Definition Association Rule
  • Association Rule
  • An implication expression of the form X ? Y,
    where X and Y are itemsets
  • Example Milk, Diaper ? Beer
  • Rule Evaluation Metrics
  • Support (s)
  • Fraction of transactions that contain both X and
    Y
  • Confidence (c)
  • Measures how often items in Y appear in
    transactions thatcontain X

18
Association Rule Mining Task
  • Given a set of transactions T, the goal of
    association rule mining is to find all rules
    having
  • support minsup threshold
  • confidence minconf threshold
  • Brute-force approach
  • List all possible association rules
  • Compute the support and confidence for each rule
  • Prune rules that fail the minsup and minconf
    thresholds
  • ? Computationally prohibitive!

19
Mining Association Rules
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)
  • Observations
  • All the above rules are binary partitions of the
    same itemset Milk, Diaper, Beer
  • Rules originating from the same itemset have
    identical support but can have different
    confidence
  • Thus, we may decouple the support and confidence
    requirements

20
Mining Association Rules
  • Two-step approach
  • Frequent Itemset Generation
  • Generate all itemsets whose support ? minsup
  • Rule Generation
  • Generate high confidence rules from each frequent
    itemset, where each rule is a binary partitioning
    of a frequent itemset
  • Frequent itemset generation is still
    computationally expensive

21
Association Rule Mining
  • Definition
  • Frequent itemsets
  • Association rules
  • Frequent itemset mining
  • breadth-first Apriori
  • depth-first Eclat
  • Association Rule Mining

22
Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
23
Frequent Itemset Generation
  • Brute-force approach
  • Each itemset in the lattice is a candidate
    frequent itemset
  • Count the support of each candidate by scanning
    the database
  • Match each transaction against every candidate
  • Complexity O(NMw) gt Expensive since M 2d !!!

24
Frequent Itemset Generation Strategies
  • Reduce the number of candidates (M)
  • Complete search M2d
  • Use pruning techniques to reduce M
  • Reduce the number of transactions (N)
  • Reduce size of N as the size of itemset increases
  • Used by DHP and vertical-based mining algorithms
  • Reduce the number of comparisons (NM)
  • Use efficient data structures to store the
    candidates or transactions
  • No need to match every candidate against every
    transaction

25
Reducing Number of Candidates
  • Apriori principle
  • If an itemset is frequent, then all of its
    subsets must also be frequent
  • Apriori principle holds due to the following
    property of the support measure
  • Support of an itemset never exceeds the support
    of its subsets
  • This is known as the anti-monotone property of
    support

26
Illustrating Apriori Principle
27
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets) (No need to
generatecandidates involving Cokeor Eggs)
Minimum Support 3
Triplets (3-itemsets)
If every subset is considered, 6C1 6C2 6C3
41 With support-based pruning, 6 6 1 13
28
Association Rule Mining
  • Definition
  • Frequent itemsets
  • Association rules
  • Frequent itemset mining
  • breadth-first Apriori
  • depth-first Eclat
  • Association Rule Mining

29
Apriori
  1. B, C
  2. B, C
  3. A, C, D
  4. A, B, C, D
  5. B, D

minsup2
Candidates
A
C
B
D
0
0
0
0

30
Apriori
  1. B, C
  2. B, C
  3. A, C, D
  4. A, B, C, D
  5. B, D

minsup2
Candidates
A
C
B
D
0
1
1
0

31
Apriori
  1. B, C
  2. B, C
  3. A, C, D
  4. A, B, C, D
  5. B, D

minsup2
Candidates
A
C
B
D
0
2
2
0

32
Apriori
  1. B, C
  2. B, C
  3. A, C, D
  4. A, B, C, D
  5. B, D

minsup2
Candidates
A
C
B
D
1
2
3
1

33
Apriori
  1. B, C
  2. B, C
  3. A, C, D
  4. A, B, C, D
  5. B, D

minsup2
Candidates
A
C
B
D
2
3
4
2

34
Apriori
  1. B, C
  2. B, C
  3. A, C, D
  4. A, B, C, D
  5. B, D

minsup2
Candidates
A
C
B
D
2
4
4
3

35
Apriori
  1. B, C
  2. B, C
  3. A, C, D
  4. A, B, C, D
  5. B, D

minsup2
Candidates
AB
BC
AC
AD
CD
BD
A
C
B
D
2
4
4
3

36
Apriori
  1. B, C
  2. B, C
  3. A, C, D
  4. A, B, C, D
  5. B, D

minsup2

AB
BC
AC
AD
CD
BD
1
2
2
3
2
2
A
C
B
D
2
4
4
3

37
Apriori
  1. B, C
  2. B, C
  3. A, C, D
  4. A, B, C, D
  5. B, D

Candidates
minsup2
ACD
BCD

AB
BC
AC
AD
CD
BD
1
2
2
3
2
2
A
C
B
D
2
4
4
3

38
Apriori
  1. B, C
  2. B, C
  3. A, C, D
  4. A, B, C, D
  5. B, D

minsup2
ACD
BCD

2
1

AB
BC
AC
AD
CD
BD
1
2
2
3
2
2
A
C
B
D
2
4
4
3

39
Apriori Algorithm
  • Apriori Algorithm
  • k 1
  • C1 A A is an item
  • Repeat until Ck
  • Count the support of each candidate in Ck
  • in one scan over DB
  • Fk I ? Ck I is frequent
  • Generate new candidates
  • Ck1 I I k1 and all J ? I with Jk
    are in Fk
  • kk1
  • Return ?i1k-1 Fi

40
Association Rule Mining
  • Definition
  • Frequent itemsets
  • Association rules
  • Frequent itemset mining
  • breadth-first Apriori
  • depth-first Eclat
  • Association Rule Mining

41
Depth-first strategy
  • Recursive procedure
  • FSET(DB) frequent sets in DB
  • Based on divide-and-conquer
  • Count frequency of all items
  • let D be a frequent item
  • FSET(DB)
  • Frequent sets with item D
  • Frequent sets without item D

42
Depth-first strategy
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
  • Frequent items
  • A, B, C, D
  • Frequent sets with D
  • remove transactions without D and D itself from
    DB
  • Count frequent sets A, B, C, AC
  • Append D AD, BD, CD, ACD
  • Frequent sets without D
  • remove D from all transactions in DB
  • Find frequent sets AC, BC

1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
43
Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
44
Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
45
Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
A 2 B 2 C 2
A 2 B 4 C 4 D 3
46
Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
DBCD
3 A, C 4 A, B, C 5 B,
  1. A,
  2. A, B

A 2
A 2 B 2 C 2
A 2 B 4 C 4 D 3
47
Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
DBCD
3 A, C 4 A, B, C 5 B,
  1. A,
  2. A, B

A 2
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
48
Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
49
Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
DBBD
4 A
A1
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
50
Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
51
Depth-First Algorithm
minsup2
DB
DBD
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
3 A, C 4 A, B, C 5 B,
A 2 B 2 C 2
A 2 B 4 C 4 D 3
AC 2
AD 2 BD 2 CD 2
ACD 2
52
Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2
53
Depth-First Algorithm
minsup2
DB
DBC
  • B
  • 2 B
  • 3 A
  • 4 A, B

1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 3
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2
54
Depth-First Algorithm
minsup2
DB
DBC
DBBC
  • 2
  • 4 A
  • B
  • 2 B
  • 3 A
  • 4 A, B

1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 1
A 2 B 3
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2
55
Depth-First Algorithm
minsup2
DB
DBC
  • B
  • 2 B
  • 3 A
  • 4 A, B

1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 3
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2
56
Depth-First Algorithm
minsup2
DB
DBC
  • B
  • 2 B
  • 3 A
  • 4 A, B

1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 3
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2 AC 2 BC 3
57
Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2 AC 2 BC 3
58
Depth-First Algorithm
minsup2
DB
DBB
1 2 4 A 5
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A1
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2 AC 2 BC 3
59
Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
ACD 2 AC 2 BC 3
60
Depth-First Algorithm
minsup2
DB
1 B, C 2 B, C 3 A, C, D 4 A, B, C, D 5
B, D
A 2 B 4 C 4 D 3
AD 2 BD 2 CD 2
Final set of frequent itemsets
ACD 2 AC 2 BC 3
61
Depth-first strategy
  • FSET(DB)
  • 1. Count frequency of items in DB
  • 2. F A A is frequent in DB
  • 3. // Remove infrequent items from DB
  • DB T ? F T?DB
  • 4. For all frequent items D except last one do
  • // Find frequent, strict supersets of D in
    DB
  • 4a. Let DBD T \ D T ? DB, D ? T
  • 4b. F F ? (I ? D) I in FSET(DBD)
  • 4c. // Remove D from DB
  • DB T \ D T?DB
  • 5. Return F

62
Depth-first strategy
  • All depth-first algorithms use this strategy
  • Difference data structure for DB
  • prefix-tree FPGrowth
  • vertical database Eclat

63
ECLAT
  • For each item, store a list of transaction ids
    (tids)

TID-list
64
ECLAT
  • Support of item A length of its tidlist
  • Remove item A from DB remove tidlist of A
  • Create conditional database DBE
  • Intersect all other tidlists with the tidlist of
    E
  • Only keep frequent items

A B C D E 1 1 2 2 1 4 2 3 4 3 5
5 4 5 6 6 7 8 9 7 8 9 8 10 9
A B C D 1 1 3 6
A B C 1 1 3 6
65
Association Rule Mining
  • Definition
  • Frequent itemsets
  • Association rules
  • Frequent itemset mining
  • breadth-first Apriori
  • depth-first Eclat
  • Association Rule Mining

66
Association Rule Mining
  • Remember
  • original problem find rules X?Y such that
  • support(XY)?? minsup
  • support(XY) / support(X) ? minconf
  • Frequent itemsets the combinations XY
  • Hence
  • Get XY by splitting up the frequent itemsets I

67
Rule Generation
  • Given a frequent itemset L, find all non-empty
    subsets f ? L such that f ? L f satisfies the
    minimum confidence requirement
  • If A,B,C,D is a frequent itemset, candidate
    rules
  • ABC ?D, ABD ?C, ACD ?B, BCD ?A, A ?BCD, B
    ?ACD, C ?ABD, D ?ABCAB ?CD, AC ? BD, AD ? BC,
    BC ?AD, BD ?AC, CD ?AB,
  • If L k, then there are 2k 2 candidate
    association rules (ignoring L ? ? and ? ? L)

68
Rule Generation
  • How to efficiently generate rules from frequent
    itemsets?
  • In general, confidence does not have an
    anti-monotone property
  • c(ABC ?D) can be larger or smaller than c(AB ?D)
  • But confidence of rules generated from the same
    itemset has an anti-monotone property
  • e.g., L A,B,C,D c(ABC ? D) ? c(AB ? CD)
    ? c(A ? BCD)
  • Confidence is anti-monotone w.r.t. number of
    items on the RHS of the rule

69
Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule
70
Summary Association Rule Mining
  • Find associations X? Y
  • rule appears in sufficient large part of the
    database
  • conditional probability P(Y X) is high
  • This problem can be split into two sub-problems
  • find frequent itemsets
  • split frequent itemsets to get association rules
  • Finding frequent itemsets
  • Apriori-property
  • breadth-first vs depth-first algorithms
  • From itemsets to association rules
  • split up frequent sets, use anti-monotonicity

71
Outline
  • Association Rule Mining
  • Frequent itemsets and association rules
  • Algorithms Apriori and Eclat
  • Sequential Pattern Mining
  • Mining frequent episodes
  • Algorithms WinEpi and MinEpi
  • Other types of patterns
  • strings, graphs,
  • process mining

72
Series and Sequences
  • In many applications, the order and transaction
    times are very important
  • stock prices
  • events in a networking environment
  • crash, starting a program, certain commands
  • Specific format of the data is very important
  • Goal find temporal rules, order is important.

73
Series and Sequences
  • Example
  • 70 of the customers that buy shoes and socks,
    will buy shoe polish within 5 days.
  • User U1 logging on, followed by User U2 starting
    program P, is always followed by a crash.
  • Here, we will concentate on the problem of
    finding frequent episodes
  • can be used in the same way as itemsets
  • split episodes to get the rules

74
Episode Mining
  • Event sequence sequence of pairs (e,t), e is an
    event, t an integer indicating the time of
    occurrence of e.
  • An linear episode is a sequence of events
  • lte1, , engt.
  • A window of length w is an interval s,e with
  • (e-s1) w.
  • An episode Elte1, , engt occurs in sequence
    Slt(s1,t1), , (sm,tm)gt within window Ws,e if
    there exist integers s ? i1 lt lt in ? e such
    that for all j1n, (ej,ij) is in S.

75
Episode mining support measure
  • Given a sequence SFind all linear episodes that
    occur frequently in S

76
Episode mining support measure
  • Given a sequence SFind all linear episodes that
    occur frequently in S
  • Given an integer w. The w-support of an episode
    Elte1, , engt in a sequence Slt(s1,t1), ,
    (sm,tm)gt is the number of windows W of length w
    such that E occurs in S within window W.
  • Note If an episode occurs in a very short time
    span, it will be in many subsequent windows, and
    thus contribute a lot to the support count!

77
Example
  • S lt b a a c b a a b c gt
  • E lt b a c gt
  • E occurs in S within window 0,4, within 1,4,
    within 5,9,
  • The 5-support of E in S is 3, since E is only in
    the following
  • windows of length 5 0,4, 1,5, 5,9
  • b a a c b a a b c

78
  • An episode E1lte1, , engt is a sub-episode of
    E2ltf1,,fmgt, denoted E1 ? E2 if there exist
    integers 1? i1 lt lt in ? m such that for all
    j1n, ejfij.
  • Example
  • lt b, a, a, c gt is a sub-episode of lta, b, c, a,
    a, b, cgt.

79
  • Episode Mining Problem
  • Given a sequence w, a minimal support minsup, and
    a window width w, find all episodes that have a
    w-support above minsup.
  • Monotonicity
  • Let S be a sequence, E1, E2 episodes, w a number.
  • If E1 ? E2, then the w-support(E2) ?
    w-support(E1).

80
WinEpi Algorithm
  • We can again apply a level-wise algorithm like
    Apriori.
  • Start with small episodes, only proceed with a
    larger episode if all sub-episodes are frequent.
  • lta,a,bgt is evaluated after ltagt, ltbgt, lta,agt,
    lta,bgt, and only if all these episodes were
    frequent.
  • Counting the frequency
  • slide window over stream
  • use smart update technique for the supports

81
Search space
Number of episodes of length k ek (e is
number of events) An episode of length k has
maximally k sub-sequences of length k-1. We can
count supports by sliding a window over the
sequence.
82
Example
  • S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
    ,9), (c,13), (a,14), (c,17), (c,18) gt
  • w 4, minsup 3
  • C1 ltagt, ltbgt, ltcgt

0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
83
Example
  • S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
    ,9), (c,13), (a,14), (c,17), (c,18) gt
  • w 4, minsup 3
  • C1 ltagt, ltbgt, ltcgt
  • Slide window of length 4 over S
  • 4-supports ltagt12, ltbgt12, ltcgt14

0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
84
Example
  • S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
    ,9), (c,13), (a,14), (c,17), (c,18) gt
  • w 4, minsup 3
  • C1 ltagt, ltbgt, ltcgt
  • Slide window of length 4 over S
  • 4-supports ltagt12, ltbgt12, ltcgt14
  • C2 lta,agt, lta,bgt, lta,cgt, ltb,agt, ltb,bgt, ltb,cgt,
    ltc,agt, ltc,bgt, ltc,cgt

0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
85
Example
  • S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
    ,9), (c,13), (a,14), (c,17), (c,18) gt
  • w 4, minsup 3
  • C1 ltagt, ltbgt, ltcgt
  • Slide window of length 4 over S
  • 4-supports ltagt12, ltbgt12, ltcgt14
  • C2 lta,agt, lta,bgt, lta,cgt, ltb,agt, ltb,bgt, ltb,cgt,
    ltc,agt, ltc,bgt, ltc,cgt
  • 4-supports lta,agt0 lta,bgt6 lta,cgt2
    ltb,agt3
  • ltb,bgt7 ltb,cgt3 ltc,agt3 ltc,bgt1
    ltc,cgt3

0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
86
Example
  • S lt (a,1),(b,2),(c,4),(b,5),(b,6),(a,7),(b,8),(b
    ,9), (c,13), (a,14), (c,17), (c,18) gt
  • w 4, minsup 3
  • C1 ltagt, ltbgt, ltcgt
  • Slide window of length 4 over S
  • 4-supports ltagt12, ltbgt12, ltcgt14
  • C2 lta,agt, lta,bgt, lta,cgt, ltb,agt, ltb,bgt, ltb,cgt,
    ltc,agt, ltc,bgt, ltc,cgt
  • 4-supports lta,agt0 lta,bgt6 lta,cgt2
    ltb,agt3
  • ltb,bgt7 ltb,cgt3 ltc,agt3 ltc,bgt1
    ltc,cgt3
  • C3 lta,b,bgt,ltb,a,bgt,ltb,b,agt,ltb,b,bgt,ltb,b,cgt,ltb,
    c,agt,
  • ltb,c,cgt, ltc,c,agt, ltc,c,cgt
  • 4-supports lta,b,bgt2, ltb,a,bgt2, ltb,b,agt2,
    ltb,b,bgt2,
  • ltb,b,cgt0, ltb,c,agt0, ltb,c,cgt0, ltc,c,agt0,
    ltc,c,cgt0

0
1
2
a
a
a
b
b
b
b
b
c
c
c
c
87
MinEpi
  • Very similar algorithm
  • based on other support measure
  • minimal occurrence of sequence smallest window
    in which the sequence occurs
  • support of E number of minimal occurrences of E
    with a width less than w
  • S lt a b c b b a b b c a c c c b bgt window
    length 5
  • 5-support of lt a b b gt
  • mo-support of lt a b b gt

88
MinEpi
  • Very similar algorithm
  • based on other support measure
  • minimal occurrence of sequence smallest window
    in which the sequence occurs
  • support of E number of minimal occurrences of E
    with a width less than w
  • S lt a b c b b a b b c a c c c b bgt window
    length 5
  • 5-support of lt a b b gt 5
  • a b c b b a b b c a c c c b b
  • mo-support of lt a b b gt

89
MinEpi
  • Very similar algorithm
  • based on other support measure
  • minimal occurrence of sequence smallest window
    in which the sequence occurs
  • support of E number of minimal occurrences of E
    with a width less than w
  • S lt a b c b b a b b c a c c c b bgt window
    length 5
  • 5-support of lt a b b gt 5
  • a b c b b a b b c a c c c b b
  • mo-support of lt a b b gt 2
  • a b c b b a b b c a c c c b b

90
Sequential Pattern Mining Summary
  • Mining sequential episodes
  • Two definitions of support
  • w-support
  • mo-support
  • Two algorithms
  • WinEpi
  • MinEpi
  • Based on monotonicity principle
  • generate candidates levelwise
  • only count candidates without infrequent
    subsequences

91
Outline
  • Association Rule Mining
  • Frequent itemsets and association rules
  • Algorithms Apriori and Eclat
  • Sequential Pattern Mining
  • Mining frequent episodes
  • Algorithms WinEpi and MinEpi
  • Other types of patterns
  • strings, graphs,
  • process mining

92
Other types of patterns
  • Sequence problems
  • Strings
  • Other types of sequences
  • Oher patterns in sequences
  • Graphs
  • Molecules
  • WWW
  • Social Networks

93
Strings require different support measures
IR context What if query or document contains
typos or misspellings?
A subsequence of a string is obtained by deleting
zero or more characters.
94
Other Types of Sequences
  • CGATGGGCCAGTCGATACGTCGATGCCGATGTCACGA

95
Other Patterns in Sequences
  • Substrings
  • Regular expressions (bbb2)
  • Partial orders
  • Directed Acyclic Graphs

96
Graphs
97
Patterns in Graphs
98
Rules
f 7
f 8
f 5
0.5
0.8
0.57
f 4
f 4
f 4
99
Summary
  • What is data mining and why is it important.
  • huge volumes of data
  • not enough human analysts
  • Pattern discovery as an important descriptive
    data mining task
  • association rule mining
  • sequential pattern mining
  • Important principles
  • Apriori principle
  • breadth-first vs depth-first algorithms
  • Many kinds and variaties of data-types, pattern
    types,support measures,
Write a Comment
User Comments (0)