Data Mining and Knowledge Acquisition Chapter 5

About This Presentation

Title:

Data Mining and Knowledge Acquisition Chapter 5

Description:

Market basket analysis, cross-marketing, catalog design, etc. Examples. ... abcd from abc and abd. acde from acd and ace. Pruning by Apriori principle: Step(1a) ... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 87

Provided by: jiaw196

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining and Knowledge Acquisition Chapter 5

1
Data Mining and Knowledge Acquisition
Chapter 5

BIS 541
Summer 2005

2
Chapter 5 Mining Association Rules in Large
Databases

Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary

3
What Is Association Mining?

Association rule mining
Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories.
Applications
Market basket analysis, cross-marketing, catalog
design, etc.
Examples.
Rule form Body Head support, confidence.
buys(x, diapers) buys(x, beers) 0.5,
60
major(x, MIS) takes(x, DM) grade(x,
AA) 1, 75

4
Association Rule Basic Concepts

Given
(1) database of transactions,
(2) each transaction is a list of items
(purchased by a customer in a visit)
Find all rules that correlate the presence of
one set of items with that of another set of
items
E.g., 98 of people who purchase tires and auto
accessories also get automotive services done
The user specifies
Minimum support level
Minimum confidence level
Rules exceeding the two trasholds are listed as
interesting

5
Basic Concepts cont.

Ii1,..,im set of all items, T any transaction
A?T T contains the itemset A
A?T, B?T A,B itemsets
Examine rule like
A?B where
A?B?,
support s P(A?B)
frequency of transactions containing both A and B
confidence c P(B?A) P(A?B)/P(A)
Conditional probability that a transaction
containing A contains B

6
Rule Measures Support and Confidence
Customer buys both

Find all the rules X Y ? Z with minimum
confidence and support
support, s, probability that a transaction
contains X ? Y ? Z
confidence, c, conditional probability that a
transaction having X ? Y also contains Z

Customer buys diaper
Customer buys beer

Let minimum support 50, and minimum confidence
50, we have
A ? C (50, 66.6)
C ? A (50, 100)

7
Frequent itemsets

Strong association rules
Support rule gt min_support
Confidence rule gt min_confidence
k-item set itemsets containing k items
occurrence frequencycountsupport count
Minimum support count
min_suptransactions in database
frequent item sets
Itemsets satisfying minimum support count
The Apriori Algorithm has two steps
(1) - Find all frequent itemsets
(2) - Genertate strong association rules from
frequent itemsets

8
Mining Association RulesAn Example(1)
Min_support 50 Min._confidence
50 Min_count0.542

A.B.C.D are 1-itemsets
A.B.C are frequent 1-itemsets as
CountA 3 gt 2 (minimum_count) or
SupportA 75 gt 50 (minimum_support)
D is not a frequent 1-itemsets as
CountD 1 lt 2 (minimum_count) or
SupportD 25 lt 50 (minimum_support)

9
Mining Association RulesAn Example(2)
Min_support 50 Min._confidence
50 Min_count0.542

A.B.A.C.A.D.B.C are 2-itemsets
A.Cis frequent 2-itemsets as
CountA.C 2 gt 2 (minimum_count) or
SupportA.C 50 gt 50 (minimum_support)
A.B.A.D are not frequent 2-itemsets as
CountA.D 1 lt 2 (minimum_count) or
SupportA.D 25 lt 50 (minimum_support)

10
Mining Association RulesAn Example(3)
Min. support 50 Min. confidence 50

For rule A ? C
support support(A ?C) 50
confidence support(A ?C)/support(A) 66.6
Strong rule as support gtmin_support
confidence gt min_confidence

11
The Apriori Principle
Min. support 50 Min. confidence 50

The Apriori principle
Any subset of a frequent itemset must be frequent
A.C is a frequent 2-itemset
A and C subsets of A,C must be frequent
1-itemsets

12
Apriori Algorithme has two steps

(1)-Find the frequent itemsets the sets of items
that have minimum support (the key step)
A subset of a frequent itemset must also be a
frequent itemset
i.e., if AB is a frequent itemset, both A and
B should be a frequent itemsets
Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)
Until k is an empty set
(2)-Use the frequent itemsets to generate
association rules.

13
Generation of frequent itemsets from candidate
itemsets (Step 1)

C1?L1 ? C2?L2 ?C3 ? L3 ? C4?L4
From Ck (candidate k-itemsets) generate Lk Ck ?
Lk
From candidate k itemsets generate frequent k
itemsets
(a)-Using the Apriori principle that
Eliminate itemset sk in Ck if
At least one k-1 subset of sk is not in Lk-1
(b)-For candidate k itemsets in Ck
Make a database scan to eliminate those itemsets
whose support counts are below the critical min
support cout
From frequent k itemsets Lk generate candidate
k1 itemsets Ck1 Lk ? Ck1
Self joining any Lk with Lk

14
Self Join operation

Sort the items in any li ?Lk in some
lexicographic order
li1ltli2lt, ltlik-1ltlik
li and lj are elements of Lk li.lj ?Lk
If li1lj1 and li2lj2 and
lik-1ljk-1
and likltljk
The first k-1 elements are the same
Only the last elements are different
?li lj satisfiing the above condition
Construct the item set lk1
li1, li2, lik-1,lik, ljk
common items
the k-1 items are taken form li or lj
k th item is taken from li
k1 th item is from lj

15
Example of Self Join operation

Lexigographic order alphabetic altbltcltd....
L3abc, abd, acd, ace, bcd
Self-joining L3L3 Step(2)
abcd from abc and abd
acde from acd and ace
Pruning by Apriori principle Step(1a)
acde is removed because ade is not in L3
C4abcd

16
The Apriori Algorithm Examplemin support cont2
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
17
Example 6.1 Han

TID_____list of item_Ids
T100 1 2 5 9 transactions
T200 2 4 D9
T300 2 3 minimum transaction
T400 1 2 4 support_count2
T500 1 3 min_sup2/922
T600 2 3
T700 1 3 min conf 70
T800 1 2 3 5
T900 1 2 3
Find strong association rules having min sup
count of 2 and min confidence 70

18
Data Dictionary

1 milk
2 apple
3 butter
4 bread
5 orange

19
1th iteration of algorithm

C1 itemset sup_count L1itemset sup_count
1 6 1 6
2 7 2 7
3 6 ? 3 6 ?
4 2 4 2
5 2 5 2
C2L1 join L1, itemset sup_count L2 itset
supcount
1 2 4 1 2 4
1 3 4 1 3 4
1 4 1x 1 5 2
1 5 2 2 3 4
2 3 4 2 4 2
2 4 2 ? 2 5 2
2 5 2 frequent 2 item sets L2 3 4 0x those
itemsets in C2
3 5 1x having minimum support
4 5 0x Step (1b)

20
3 th iteration

Self join to get C3 Step (2)
C3 L2 join L2 1 2 3, 1 2 5,1 3 5,2 3 4,
2 3 5,2 4 5
Now Step (1a) Apply Apriori to every itemset in
C3
2 item subsets of 1 2 31 2,1 3,2 3
all 2 items sets are members of L2
keep 1 2 3 in C3
2 item subsets of 1 2 51 2,1 5,2 5
all 2 items sets are members of L2
keep 1 2 5 in C3
2 item subsets of 1 3 51 3,1 5,3 5
3 5 is not a members of L2 so it si not
frequent
remove 1 2 5 from C3

21
3 iteration cont.

2 item subsets of 2 3 42 3,2 4,3 4
3 4 is not a members of L2 so it si not
frequent
remove 2 3 4 from C3
2 item subsets of 2 3 52 3,2 5,3 5
3 5 is not a members of L2 so it si not
frequent
remove 2 3 5 from C3
2 item subsets of 2 4 52 4,2 5,4 5
4 5 is not a members of L2 so it si not
frequent
remove 2 4 5 from C3
C31 2 3,1 2 5 after pruning

22
4 th iteration

C3?L3 check min support Step (1b)
L3those item sets having minimum support
L3 item sets minsupcount
1 2 3 2
1 2 5 2
L3 join L3 to generate C4 Step (2)
L3 join L3 1 2 3 5
pruned since its subset 2 3 5 is not frequent
C4?
the algorithm terminates

23
Generating Association Rules from frequent
itemsets

Strong rules
min support and min confidence
confidence(A?B) P(B?A)sup_count(A?B)
sup_count(A)
for each frequent itemset l
generate non empty subsets of l denoted by s
For each s?l
construct rules s ?(l-s)
Satisfying the condition
sup_count(l)/sup_count(s)gtmin_conf
are listed as interesting

24
Example 6.2 Han cont.

the 3-frequent item set l1 2 5 transaction
containing milk, apple and orange is frequent
non empty subsets of l are
1 2,1 5,2 5,1,2,5
the resulting association rules are
1?2?5 conf 2/450
1?5?2 conf 2/2100
2?5?1 conf 2/2100
1?2?5 conf 2/633
2?1?5 conf 2/729
5?1?2 conf 2/2100
if min conf 70 2th 3th and last rules are strong

25
Example 6.2 cont. Detail on confidence for two
rules

For the rule
1?5?2 conf s(1,2,5)/s(1,5)
conf 2/2100 gt 70
A strong rule
For the rule
2?1?5 conf s(1,2,5)/s(2)
conf 2/729 lt 70
Not a strong rule

26
Exercise

Find all strong association rules in Example 6.2
Check minimum confindence
for 2-frequent intemsets
1,2, 1,3, 1,5, 2,3, 2,4, 2,5
1?2, 2?1
2?5, 5?2 exetra
for 3-frequent intemset
1,2,5
1?2?3
3 ? 1?2 exetra

27
Exercise

a) Suppose A ? B and B ? C are strong rules
Dose this imply that A ? C is also a strong rule?
b) Suppose A ? B and A ? C are strong rules
Dose this imply that B ? C is also a strong rule?
c) Suppose A ? C and B ? C are strong rules
Dose this imply that A and B ? C is also a strong
rule?

28
Bottleneck of Frequent-pattern Mining

Multiple database scans are costly
Mining long patterns needs many passes of
scanning and generates lots of candidates
To find frequent itemset i1i2i100
of scans 100
of Candidates (1001) (1002) (110000)
2100-1 1.271030 !
Bottleneck candidate-generation-and-test
Can we avoid candidate generation?

29
Is Apriori Fast Enough? Performance Bottlenecks

The core of the Apriori algorithm
Use frequent (k 1)-itemsets to generate
candidate frequent k-itemsets
Use database scan and pattern matching to collect
counts for the candidate itemsets
The bottleneck of Apriori candidate generation
Huge candidate sets
104 frequent 1-itemset will generate 107
candidate 2-itemsets
To discover a frequent pattern of size 100, e.g.,
a1, a2, , a100, one needs to generate 2100 ?
1030 candidates.
Multiple scans of database
Needs (n 1 ) scans, n is the length of the
longest pattern

30
Mining Frequent Patterns Without Candidate
Generation

Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure
highly condensed, but complete for frequent
pattern mining
avoid costly database scans
Develop an efficient, FP-tree-based frequent
pattern mining method
A divide-and-conquer methodology decompose
mining tasks into smaller ones
Avoid candidate generation sub-database test
only!

31
Construct FP-tree from a Transaction DB
TID Items bought (ordered) frequent
items 100 f, a, c, d, g, i, m, p f, c, a, m,
p 200 a, b, c, f, l, m, o f, c, a, b,
m 300 b, f, h, j, o f, b 400 b, c, k,
s, p c, b, p 500 a, f, c, e, l, p, m,
n f, c, a, m, p
min_support 0.5

Steps
Scan DB once, find frequent 1-itemset (single
item pattern)
Order frequent items in frequency descending
order
Scan DB again, construct FP-tree

32
Benefits of the FP-tree Structure

Completeness
never breaks a long pattern of any transaction
preserves complete information for frequent
pattern mining
Compactness
reduce irrelevant informationinfrequent items
are gone
frequency descending ordering more frequent
items are more likely to be shared
never be larger than the original database (if
not count node-links and counts)
Example For Connect-4 DB, compression ratio
could be over 100

33
Chapter 5 Mining Association Rules in Large
Databases

Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary

34
Multiple-Level Association Rules

Items often form hierarchy.
Items at the lower level are expected to have
lower support.
Rules regarding itemsets at
appropriate levels could be quite useful.
Transaction database can be encoded based on
dimensions and levels
We can explore shared multi-level mining

35
Mining Multi-Level Associations

A top_down, progressive deepening approach
First find high-level strong rules
milk bread
20, 60.
Then find their lower-level weaker rules
2 milk wheat
bread 6, 50.
Variations at mining multiple-level association
rules.
Level-crossed association rules
2 milk Wonder wheat bread
Association rules with multiple, alternative
hierarchies
2 milk Wonder bread

36
Multi-level Association Uniform Support vs.
Reduced Support

Uniform Support the same minimum support for all
levels
One minimum support threshold. No need to
examine itemsets containing any item whose
ancestors do not have minimum support.
Lower level items do not occur as frequently.
If support threshold
too high ? miss low level associations
too low ? generate too many high level
associations
Reduced Support reduced minimum support at lower
levels
There are 4 search strategies
Level-by-level independent
Level-cross filtering by k-itemset
Level-cross filtering by single item
Controlled level-cross filtering by single item

37
Uniform Support
Multi-level mining with uniform support
Milk support 10
Level 1 min_sup 5
2 Milk support 6
Skim Milk support 4
Level 2 min_sup 5
Back
38
Reduced Support
Multi-level mining with reduced support
Level 1 min_sup 5
Milk support 10
2 Milk support 6
Skim Milk support 4
Level 2 min_sup 3
Back
39

Controlled level-cross filtering by single item
Specify a level passage treshold for each level k
min_sup_T(k1)ltLPT(k)ltmin_sup_T(k)
Example
High level milk
min supp5
Low level 2 milk,skim milk
Min supp 3
Level passage trashold 4

40
Multi-level Association Redundancy Filtering

Some rules may be redundant due to ancestor
relationships between items.
Example
milk ? wheat bread support 8, confidence
70
2 milk ? wheat bread support 2, confidence
72
We say the first rule is an ancestor of the
second rule.
A rule is redundant if its support is close to
the expected value, based on the rules
ancestor.

41
Multi-Level Mining Progressive Deepening

A top-down, progressive deepening approach
First mine high-level frequent items
milk (15), bread
(10)
Then mine their lower-level weaker frequent
itemsets
2 milk (5),
wheat bread (4)
Different min_support threshold across
multi-levels lead to different algorithms
If adopting the same min_support across
multi-levels
then toss t if any of ts ancestors is
infrequent.
If adopting reduced min_support at lower levels
then examine only those descendents whose
ancestors support is frequent/non-negligible.

42
Progressive Refinement of Data Mining Quality

Why progressive refinement?
Mining operator can be expensive or cheap, fine
or rough
Trade speed with quality step-by-step
refinement.
Superset coverage property
Preserve all the positive answersallow a
positive false test but not a false negative
test.
Two- or multi-step mining
First apply rough/cheap operator (superset
coverage)
Then apply expensive algorithm on a substantially
reduced candidate set (Koperski Han, SSD95).

43
Chapter 5 Mining Association Rules in Large
Databases

Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary

44
Interestingness Measurements

Objective measures
Two popular measurements
support and
confidence
Subjective measures (Silberschatz Tuzhilin,
KDD95)
A rule (pattern) is interesting if
it is unexpected (surprising to the user) and/or
actionable (the user can do something with it)

45
Criticism to Support and Confidence

Example 1 (Aggarwal Yu, PODS98)
Among 5000 students
3000 play basketball
3750 eat cereal
2000 both play basket ball and eat cereal
play basketball ? eat cereal 40, 66.7 is
misleading because the overall percentage of
students eating cereal is 75 which is higher
than 66.7.
play basketball ? not eat cereal 20, 33.3 is
far more accurate, although with lower support
and confidence

46
Criticism to Support and Confidence (Cont.)

Example 2
X and Y positively correlated,
X and Z, negatively related
support and confidence of
XgtZ dominates
We need a measure of dependent or correlated
events
P(BA)/P(B) is also called the lift of rule A gt B

47
Other Interestingness Measures Interest

Interest (correlation, lift)
taking both P(A) and P(B) in consideration
P(AB)P(B)P(A), if A and B are independent
events
A and B negatively correlated, if the value is
less than 1 otherwise A and B positively
correlated

48
Example

Total transactions 10,000
Items Ccomputers, V video
V 7,500 C 6,000 C and V 4,000
Min_support 0.3 min_conf0,50
Consider the rule
Buy(X computer)? buy(X video)
Support 4000/10000 0.4
Confidence P(C and V) /P(C) 4000/6000 66
Strong but
The probablity of buying a video is 0.75 buying a
comuter reduces the probablity of buying a video
From 0.75 to 0.66
Computer and video are negatively correlated

Lift of A ? B
Lift P(A and B)/P(A)P(B)
P(A and B) P(BA)P(A) then
Lift P(BA)/P(B)
Ratio of probablity of buying A and B divided by
buying A and B independently
Or it can be interpreted as
Conditional probablity of buying B given that A
is purchased divided by unconditional probablity
of buying B

50
C
not C
7500
V
not V
2500
10000
4000
6000
Lift C?V is P(P and V)/P(V)P(C) P(VC)/P(V)
0.4/0.60.750.89lt1 there is a negative
correlation Between Video and computer
51
Are All the Rules Found Interesting?

Buy walnuts ? buy milk 1, 80 is
misleading
if 85 of customers buy milk
Support and confidence are not good to represent
correlations
So many interestingness measures? (Tan, Kumar,
Sritastava _at_KDD02)

52
All Confidence

All confidence
All_conf sup(X)/max sup(Xi)?i
X (X1,X2,...,Xk)
For k 2
Rules are X1?X2 and X2 ?X1
All_conf sup(X1,X2)/max sup(X1),sup(X2)
Here sup(X1,X2)/sup(X1) confidence of rule
X1?X2
Ex all conf 0.4/max(0.6,0.75)0.4/0.750.53

53
Cosine

Cosine P(A,B)/sqrt(P(A),P(B))
Similar to lift but take square root of
denominator
Both cosine and all_conf are null inveriant
Not affected from null transactions
Ex
Cosine 0.4/sqrt(0.60.75)0.27

54
Mining Highly Correlated Patterns

lift and ?2 are not good measures for
correlations in transactional DBs
all-conf or cosine could be good measures
(Omiecinski _at_TKDE03)
Both all-conf and coherence have the downward
closure

55
(No Transcript)
56
Chapter 5 Mining Association Rules in Large
Databases

Association rule mining
Mining single-dimensional Boolean association
rules from transactional databases
Mining multilevel association rules from
transactional databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary

57
Constraint-based (Query-Directed) Mining

Finding all the patterns in a database
autonomously? unrealistic!
The patterns could be too many but not focused!
Data mining should be an interactive process
User directs what to be mined using a data mining
query language (or a graphical user interface)
Constraint-based mining
User flexibility provides constraints on what to
be mined
System optimization explores such constraints
for efficient miningconstraint-based mining

58
Constraints in Data Mining

Knowledge type constraint
classification, association, etc.
Data constraint using SQL-like queries
find product pairs sold together in stores in
Chicago in Dec.02
Dimension/level constraint
in relevance to region, price, brand, customer
category
Rule (or pattern) constraint
small sales (price lt 10) triggers big sales
(sum gt 200)
Interestingness constraint
strong rules min_support ? 3, min_confidence
? 60

59
Example

bread ? milk
milk ? butter
Strong rules but items are not that valuable
TV ? VCD player
Support may be lower then previous rules but
value of items are much higher
This rule may be more valuable

60
Constrained Mining vs. Constraint-Based Search

Constrained mining vs. constraint-based
search/reasoning
Both are aimed at reducing search space
Finding all patterns satisfying constraints vs.
finding some (or one) answer in constraint-based
search in AI
Constraint-pushing vs. heuristic search
It is an interesting research problem on how to
integrate them
Constrained mining vs. query processing in DBMS
Database query processing requires to find all
Constrained pattern mining shares a similar
philosophy as pushing selections deeply in query
processing

61
Rule Constraints in Association Mining

Two kind of rule constraints
Rule form constraints meta-rule guided mining.
P(x, y) Q(x, w) takes(x, database
systems).
Rule (content) constraint constraint-based query
optimization (Ng, et al., SIGMOD98).
sum(LHS) lt 100 min(LHS) gt 20 count(LHS) gt 3
sum(RHS) gt 1000
1-variable vs. 2-variable constraints
(Lakshmanan, et al. SIGMOD99)
1-var A constraint confining only one side (L/R)
of the rule, e.g., as shown above.
2-var A constraint confining both sides (L and
R).
sum(LHS) lt min(RHS) max(RHS) lt 5 sum(LHS)

Apriori principle stating that
All non empty subsets of a frequent itemsets must
also be frequent
Note that
If a given itemset does not satisfy minimum
support
None of its supersets can
Other examples of anti-monotone constraints
Min(l.price) gt 500
Count(l) lt 10
Average(l.price) lt 10 not anti-monotone

63
Anti-Monotonicity in Constraint Pushing
TDB (min_sup2)

Anti-monotonicity
When an intemset S violates the constraint, so
does any of its superset
sum(S.Price) ? v is anti-monotone
sum(S.Price) ? v is not anti-monotone
Example. C range(S.profit) ? 15 is anti-monotone
Itemset ab violates C
So does every superset of ab

64
Monotonicity for Constraint Pushing
TDB (min_sup2)

Monotonicity
When an intemset S satisfies the constraint, so
does any of its superset
sum(S.Price) ? v is monotone
min(S.Price) ? v is monotone
Example. C range(S.profit) ? 15
Itemset ab satisfies C
So does every superset of ab

65
The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
66
Naïve Algorithm Apriori Constraint
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Constraint SumS.price lt 5
Scan D
67
The Constrained Apriori Algorithm Push an
Anti-monotone Constraint Deep
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Constraint SumS.price lt 5
Scan D
68
The Constrained Apriori Algorithm Push Another
Constraint Deep
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Constraint minS.price lt 1
Scan D
69
Chapter 5 Mining Association Rules in Large
Databases

Association rule mining
Algorithms for scalable mining of
(single-dimensional Boolean) association rules in
transactional databases
Mining various kinds of association/correlation
rules
Constraint-based association mining
Sequential pattern mining
Applications/extensions of frequent pattern
mining
Summary

70
Sequence Databases and Sequential Pattern Analysis

Transaction databases, time-series databases vs.
sequence databases
Frequent patterns vs. (frequent) sequential
patterns
Applications of sequential pattern mining
Customer shopping sequences
First buy computer, then CD-ROM, and then digital
camera, within 3 months.
Medical treatment, natural disasters (e.g.,
earthquakes), science engineering processes,
stocks and markets, etc.
Telephone calling patterns, Weblog click streams
DNA sequences and gene structures

71
What Is Sequential Pattern Mining?

Given a set of sequences, find the complete set
of frequent subsequences

A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
72
Challenges on Sequential Pattern Mining

A huge number of possible sequential patterns are
hidden in databases
A mining algorithm should
find the complete set of patterns, when possible,
satisfying the minimum support (frequency)
threshold
be highly efficient, scalable, involving only a
small number of database scans
be able to incorporate various kinds of
user-specific constraints

73
Studies on Sequential Pattern Mining

Concept introduction and an initial Apriori-like
algorithm
R. Agrawal R. Srikant. Mining sequential
patterns, ICDE95
GSPAn Apriori-based, influential mining method
(developed at IBM Almaden)
R. Srikant R. Agrawal. Mining sequential
patterns Generalizations and performance
improvements, EDBT96
From sequential patterns to episodes
(Apriori-like constraints)
H. Mannila, H. Toivonen A.I. Verkamo.
Discovery of frequent episodes in event
sequences, Data Mining and Knowledge Discovery,
1997
Mining sequential patterns with constraints
M.N. Garofalakis, R. Rastogi, K. Shim SPIRIT
Sequential Pattern Mining with Regular Expression
Constraints. VLDB 1999

74
Sequential pattern mining Cases and Parameters

Duration of a time sequence T
Sequential pattern mining can then be confined to
the data within a specified duration
Ex. Subsequence corresponding to the year of 1999
Ex. Partitioned sequences, such as every year, or
every week after stock crashes, or every two
weeks before and after a volcano eruption
Event folding window w
If w T, time-insensitive frequent patterns are
found
If w 0 (no event sequence folding), sequential
patterns are found where each event occurs at a
distinct time instant
If 0 lt w lt T, sequences occurring within the same
period w are folded in the analysis

75
Example

When event folding window is 5 munites
Purchases within 5 munits is considered to be
taken together

76
Sequential pattern mining Cases and Parameters
(2)

Time interval, int, between events in the
discovered pattern
int 0 no interval gap is allowed, i.e., only
strictly consecutive sequences are found
Ex. Find frequent patterns occurring in
consecutive weeks
min_int ? int ? max_int find patterns that are
separated by at least min_int but at most max_int
Ex. If a person rents movie A, it is likely she
will rent movie B within 30 days (int ? 30)
int c ? 0 find patterns carrying an exact
interval
Ex. Every time when Dow Jones drops more than
5, what will happen exactly two days later?
(int 2)

77
A Basic Property of Sequential Patterns Apriori

A basic property Apriori (Agrawal Sirkant94)
If a sequence S is not frequent
Then none of the super-sequences of S is frequent
E.g, lthbgt is infrequent ? so do lthabgt and lt(ah)bgt

Given support threshold min_sup 2
78
GSPA Generalized Sequential Pattern Mining
Algorithm

GSP (Generalized Sequential Pattern) mining
algorithm
proposed by Agrawal and Srikant, EDBT96
Outline of the method
Initially, every item in DB is a candidate of
length-1
for each level (i.e., sequences of length-k) do
scan database to collect support count for each
candidate sequence
generate candidate length-(k1) sequences from
length-k frequent sequences using Apriori
repeat until no frequent sequence or no candidate
can be found
Major strength Candidate pruning by Apriori

79
Finding Length-1 Sequential Patterns

Examine GSP using an example
Initial candidates all singleton sequences
ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt, ltggt, lthgt
Scan database once, count support for candidates

80
Generating Length-2 Candidates
51 length-2 Candidates
Without Apriori property, 8887/292 candidates
Apriori prunes 44.57 candidates
81
Finding Length-2 Sequential Patterns

Scan database one more time, collect support
count for each length-2 candidate
There are 19 length-2 candidates which pass the
minimum support threshold
They are length-2 sequential patterns

82
Generating Length-3 Candidates and Finding
Length-3 Patterns

Generate Length-3 Candidates
Self-join length-2 sequential patterns
Based on the Apriori property
ltabgt, ltaagt and ltbagt are all length-2 sequential
patterns ? ltabagt is a length-3 candidate
lt(bd)gt, ltbbgt and ltdbgt are all length-2 sequential
patterns ? lt(bd)bgt is a length-3 candidate
46 candidates are generated
Find Length-3 Sequential Patterns
Scan database once more, collect support counts
for candidates
19 out of 46 candidates pass support threshold

83
The GSP Mining Process
min_sup 2
84

Definition c is a contiguous subsequence of a
sequence ss1,s2,...,sn if
c is derived by dropping an item from s1 or sn
c is derived by dropping an item from si which
has at least 2 items
c is a contiguous subsequence of c and c is a
contiguous subsequence of s
Ex s (1,2),(3,4),5,6
2,(3,4),5, (1,2),3,5,6, (3,5 are but
(1,2),(3,4),6, (1,5,6 are not

85
Candidate genration

Step 1 Join Step Lk-1 join with Lk-1 to give Ck
s1 and s2 are joined if dropping first item of s1
and last item of s2 gives the same sequence
s1 is extended by adding the last item of s2
Step 2 Prune Step Delete candidate sequences
having (k-1) contiguous subsequences whose
support count is less than min_support count

L3 C4
L4
(1,2),3 (1,2),(3,4)
(1,2),(3,4)
(1,2),4 (1,2),3,5
1,(3,4)
(1,3),5
2,(3,4)
2,3,5
(1,2),3 joined with 2,(3,4) to give
(1,2),(3,4)
(1,2),3 joined with 2,3,5 to give (1,2),3,5
(1,2),3,5 is dropped since its 3 contiguous
subseq
(1,3,5 not in L3

87
The GSP Algorithm

Take sequences in form of ltxgt as length-1
candidates
Scan database once, find F1, the set of length-1
sequential patterns
Let k1 while Fk is not empty do
Form Ck1, the set of length-(k1) candidates
from Fk
If Ck1 is not empty, scan database once, find
Fk1, the set of length-(k1) sequential patterns
Let kk1

88
Bottlenecks of GSP