Title: Data Mining and Knowledge Acquisition Chapter 5
1Data Mining and Knowledge Acquisition
Chapter 5
2Chapter 5 Mining Association Rules in Large
Databases
- Association rule mining
- Mining single-dimensional Boolean association
rules from transactional databases - Mining multilevel association rules from
transactional databases - Mining multidimensional association rules from
transactional databases and data warehouse - From association mining to correlation analysis
- Constraint-based association mining
- Summary
3What Is Association Mining?
- Association rule mining
- Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories. - Applications
- Market basket analysis, cross-marketing, catalog
design, etc. - Examples.
- Rule form Body Head support, confidence.
- buys(x, diapers) buys(x, beers) 0.5,
60 - major(x, MIS) takes(x, DM) grade(x,
AA) 1, 75
4Association Rule Basic Concepts
- Given
- (1) database of transactions,
- (2) each transaction is a list of items
(purchased by a customer in a visit) - Find all rules that correlate the presence of
one set of items with that of another set of
items - E.g., 98 of people who purchase tires and auto
accessories also get automotive services done - The user specifies
- Minimum support level
- Minimum confidence level
- Rules exceeding the two trasholds are listed as
interesting
5Basic Concepts cont.
- Ii1,..,im set of all items, T any transaction
- A?T T contains the itemset A
- A?T, B?T A,B itemsets
- Examine rule like
- A?B where
- A?B?,
- support s P(A?B)
- frequency of transactions containing both A and B
- confidence c P(B?A) P(A?B)/P(A)
- Conditional probability that a transaction
containing A contains B
6Rule Measures Support and Confidence
Customer buys both
- Find all the rules X Y ? Z with minimum
confidence and support - support, s, probability that a transaction
contains X ? Y ? Z - confidence, c, conditional probability that a
transaction having X ? Y also contains Z
Customer buys diaper
Customer buys beer
- Let minimum support 50, and minimum confidence
50, we have - A ? C (50, 66.6)
- C ? A (50, 100)
7Frequent itemsets
- Strong association rules
- Support rule gt min_support
- Confidence rule gt min_confidence
- k-item set itemsets containing k items
- occurrence frequencycountsupport count
- Minimum support count
- min_suptransactions in database
- frequent item sets
- Itemsets satisfying minimum support count
- The Apriori Algorithm has two steps
- (1) - Find all frequent itemsets
- (2) - Genertate strong association rules from
frequent itemsets
8Mining Association RulesAn Example(1)
Min_support 50 Min._confidence
50 Min_count0.542
- A.B.C.D are 1-itemsets
- A.B.C are frequent 1-itemsets as
- CountA 3 gt 2 (minimum_count) or
- SupportA 75 gt 50 (minimum_support)
- D is not a frequent 1-itemsets as
- CountD 1 lt 2 (minimum_count) or
- SupportD 25 lt 50 (minimum_support)
9Mining Association RulesAn Example(2)
Min_support 50 Min._confidence
50 Min_count0.542
- A.B.A.C.A.D.B.C are 2-itemsets
- A.Cis frequent 2-itemsets as
- CountA.C 2 gt 2 (minimum_count) or
- SupportA.C 50 gt 50 (minimum_support)
- A.B.A.D are not frequent 2-itemsets as
- CountA.D 1 lt 2 (minimum_count) or
- SupportA.D 25 lt 50 (minimum_support)
10Mining Association RulesAn Example(3)
Min. support 50 Min. confidence 50
- For rule A ? C
- support support(A ?C) 50
- confidence support(A ?C)/support(A) 66.6
- Strong rule as support gtmin_support
- confidence gt min_confidence
11The Apriori Principle
Min. support 50 Min. confidence 50
- The Apriori principle
- Any subset of a frequent itemset must be frequent
- A.C is a frequent 2-itemset
- A and C subsets of A,C must be frequent
1-itemsets
12Apriori Algorithme has two steps
- (1)-Find the frequent itemsets the sets of items
that have minimum support (the key step) - A subset of a frequent itemset must also be a
frequent itemset - i.e., if AB is a frequent itemset, both A and
B should be a frequent itemsets - Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset) - Until k is an empty set
- (2)-Use the frequent itemsets to generate
association rules.
13Generation of frequent itemsets from candidate
itemsets (Step 1)
- C1?L1 ? C2?L2 ?C3 ? L3 ? C4?L4
- From Ck (candidate k-itemsets) generate Lk Ck ?
Lk - From candidate k itemsets generate frequent k
itemsets - (a)-Using the Apriori principle that
- Eliminate itemset sk in Ck if
- At least one k-1 subset of sk is not in Lk-1
- (b)-For candidate k itemsets in Ck
- Make a database scan to eliminate those itemsets
whose support counts are below the critical min
support cout - From frequent k itemsets Lk generate candidate
k1 itemsets Ck1 Lk ? Ck1 - Self joining any Lk with Lk
14Self Join operation
- Sort the items in any li ?Lk in some
lexicographic order - li1ltli2lt, ltlik-1ltlik
- li and lj are elements of Lk li.lj ?Lk
- If li1lj1 and li2lj2 and
lik-1ljk-1 - and likltljk
- The first k-1 elements are the same
- Only the last elements are different
- ?li lj satisfiing the above condition
- Construct the item set lk1
- li1, li2, lik-1,lik, ljk
- common items
- the k-1 items are taken form li or lj
- k th item is taken from li
- k1 th item is from lj
15Example of Self Join operation
- Lexigographic order alphabetic altbltcltd....
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3 Step(2)
- abcd from abc and abd
- acde from acd and ace
- Pruning by Apriori principle Step(1a)
- acde is removed because ade is not in L3
- C4abcd
16The Apriori Algorithm Examplemin support cont2
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
17Example 6.1 Han
- TID_____list of item_Ids
- T100 1 2 5 9 transactions
- T200 2 4 D9
- T300 2 3 minimum transaction
- T400 1 2 4 support_count2
- T500 1 3 min_sup2/922
- T600 2 3
- T700 1 3 min conf 70
- T800 1 2 3 5
- T900 1 2 3
- Find strong association rules having min sup
count of 2 and min confidence 70
18Data Dictionary
- 1 milk
- 2 apple
- 3 butter
- 4 bread
- 5 orange
191th iteration of algorithm
- C1 itemset sup_count L1itemset sup_count
- 1 6 1 6
- 2 7 2 7
- 3 6 ? 3 6 ?
- 4 2 4 2
- 5 2 5 2
- C2L1 join L1, itemset sup_count L2 itset
supcount - 1 2 4 1 2 4
- 1 3 4 1 3 4
- 1 4 1x 1 5 2
- 1 5 2 2 3 4
- 2 3 4 2 4 2
- 2 4 2 ? 2 5 2
- 2 5 2 frequent 2 item sets L2 3 4 0x those
itemsets in C2 - 3 5 1x having minimum support
- 4 5 0x Step (1b)
203 th iteration
- Self join to get C3 Step (2)
- C3 L2 join L2 1 2 3, 1 2 5,1 3 5,2 3 4,
- 2 3 5,2 4 5
- Now Step (1a) Apply Apriori to every itemset in
C3 - 2 item subsets of 1 2 31 2,1 3,2 3
- all 2 items sets are members of L2
- keep 1 2 3 in C3
- 2 item subsets of 1 2 51 2,1 5,2 5
- all 2 items sets are members of L2
- keep 1 2 5 in C3
- 2 item subsets of 1 3 51 3,1 5,3 5
- 3 5 is not a members of L2 so it si not
frequent - remove 1 2 5 from C3
213 iteration cont.
- 2 item subsets of 2 3 42 3,2 4,3 4
- 3 4 is not a members of L2 so it si not
frequent - remove 2 3 4 from C3
- 2 item subsets of 2 3 52 3,2 5,3 5
- 3 5 is not a members of L2 so it si not
frequent - remove 2 3 5 from C3
- 2 item subsets of 2 4 52 4,2 5,4 5
- 4 5 is not a members of L2 so it si not
frequent - remove 2 4 5 from C3
- C31 2 3,1 2 5 after pruning
224 th iteration
- C3?L3 check min support Step (1b)
- L3those item sets having minimum support
- L3 item sets minsupcount
- 1 2 3 2
- 1 2 5 2
- L3 join L3 to generate C4 Step (2)
- L3 join L3 1 2 3 5
- pruned since its subset 2 3 5 is not frequent
- C4?
- the algorithm terminates
23Generating Association Rules from frequent
itemsets
- Strong rules
- min support and min confidence
- confidence(A?B) P(B?A)sup_count(A?B)
-
sup_count(A) - for each frequent itemset l
- generate non empty subsets of l denoted by s
- For each s?l
- construct rules s ?(l-s)
- Satisfying the condition
- sup_count(l)/sup_count(s)gtmin_conf
- are listed as interesting
24Example 6.2 Han cont.
- the 3-frequent item set l1 2 5 transaction
containing milk, apple and orange is frequent - non empty subsets of l are
- 1 2,1 5,2 5,1,2,5
- the resulting association rules are
- 1?2?5 conf 2/450
- 1?5?2 conf 2/2100
- 2?5?1 conf 2/2100
- 1?2?5 conf 2/633
- 2?1?5 conf 2/729
- 5?1?2 conf 2/2100
- if min conf 70 2th 3th and last rules are strong
25Example 6.2 cont. Detail on confidence for two
rules
- For the rule
- 1?5?2 conf s(1,2,5)/s(1,5)
- conf 2/2100 gt 70
- A strong rule
- For the rule
- 2?1?5 conf s(1,2,5)/s(2)
- conf 2/729 lt 70
- Not a strong rule
26Exercise
- Find all strong association rules in Example 6.2
- Check minimum confindence
- for 2-frequent intemsets
- 1,2, 1,3, 1,5, 2,3, 2,4, 2,5
- 1?2, 2?1
- 2?5, 5?2 exetra
- for 3-frequent intemset
- 1,2,5
- 1?2?3
- 3 ? 1?2 exetra
27Exercise
- a) Suppose A ? B and B ? C are strong rules
- Dose this imply that A ? C is also a strong rule?
- b) Suppose A ? B and A ? C are strong rules
- Dose this imply that B ? C is also a strong rule?
- c) Suppose A ? C and B ? C are strong rules
- Dose this imply that A and B ? C is also a strong
rule?
28Bottleneck of Frequent-pattern Mining
- Multiple database scans are costly
- Mining long patterns needs many passes of
scanning and generates lots of candidates - To find frequent itemset i1i2i100
- of scans 100
- of Candidates (1001) (1002) (110000)
2100-1 1.271030 ! - Bottleneck candidate-generation-and-test
- Can we avoid candidate generation?
29Is Apriori Fast Enough? Performance Bottlenecks
- The core of the Apriori algorithm
- Use frequent (k 1)-itemsets to generate
candidate frequent k-itemsets - Use database scan and pattern matching to collect
counts for the candidate itemsets - The bottleneck of Apriori candidate generation
- Huge candidate sets
- 104 frequent 1-itemset will generate 107
candidate 2-itemsets - To discover a frequent pattern of size 100, e.g.,
a1, a2, , a100, one needs to generate 2100 ?
1030 candidates. - Multiple scans of database
- Needs (n 1 ) scans, n is the length of the
longest pattern
30Mining Frequent Patterns Without Candidate
Generation
- Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure - highly condensed, but complete for frequent
pattern mining - avoid costly database scans
- Develop an efficient, FP-tree-based frequent
pattern mining method - A divide-and-conquer methodology decompose
mining tasks into smaller ones - Avoid candidate generation sub-database test
only!
31Construct FP-tree from a Transaction DB
TID Items bought (ordered) frequent
items 100 f, a, c, d, g, i, m, p f, c, a, m,
p 200 a, b, c, f, l, m, o f, c, a, b,
m 300 b, f, h, j, o f, b 400 b, c, k,
s, p c, b, p 500 a, f, c, e, l, p, m,
n f, c, a, m, p
min_support 0.5
- Steps
- Scan DB once, find frequent 1-itemset (single
item pattern) - Order frequent items in frequency descending
order - Scan DB again, construct FP-tree
32Benefits of the FP-tree Structure
- Completeness
- never breaks a long pattern of any transaction
- preserves complete information for frequent
pattern mining - Compactness
- reduce irrelevant informationinfrequent items
are gone - frequency descending ordering more frequent
items are more likely to be shared - never be larger than the original database (if
not count node-links and counts) - Example For Connect-4 DB, compression ratio
could be over 100
33Chapter 5 Mining Association Rules in Large
Databases
- Association rule mining
- Mining single-dimensional Boolean association
rules from transactional databases - Mining multilevel association rules from
transactional databases - Mining multidimensional association rules from
transactional databases and data warehouse - From association mining to correlation analysis
- Constraint-based association mining
- Summary
34Multiple-Level Association Rules
- Items often form hierarchy.
- Items at the lower level are expected to have
lower support. - Rules regarding itemsets at
- appropriate levels could be quite useful.
- Transaction database can be encoded based on
dimensions and levels - We can explore shared multi-level mining
35Mining Multi-Level Associations
- A top_down, progressive deepening approach
- First find high-level strong rules
- milk bread
20, 60. - Then find their lower-level weaker rules
- 2 milk wheat
bread 6, 50. - Variations at mining multiple-level association
rules. - Level-crossed association rules
- 2 milk Wonder wheat bread
- Association rules with multiple, alternative
hierarchies - 2 milk Wonder bread
36Multi-level Association Uniform Support vs.
Reduced Support
- Uniform Support the same minimum support for all
levels - One minimum support threshold. No need to
examine itemsets containing any item whose
ancestors do not have minimum support. - Lower level items do not occur as frequently.
If support threshold - too high ? miss low level associations
- too low ? generate too many high level
associations - Reduced Support reduced minimum support at lower
levels - There are 4 search strategies
- Level-by-level independent
- Level-cross filtering by k-itemset
- Level-cross filtering by single item
- Controlled level-cross filtering by single item
37Uniform Support
Multi-level mining with uniform support
Milk support 10
Level 1 min_sup 5
2 Milk support 6
Skim Milk support 4
Level 2 min_sup 5
Back
38Reduced Support
Multi-level mining with reduced support
Level 1 min_sup 5
Milk support 10
2 Milk support 6
Skim Milk support 4
Level 2 min_sup 3
Back
39- Controlled level-cross filtering by single item
- Specify a level passage treshold for each level k
- min_sup_T(k1)ltLPT(k)ltmin_sup_T(k)
- Example
- High level milk
- min supp5
- Low level 2 milk,skim milk
- Min supp 3
- Level passage trashold 4
40Multi-level Association Redundancy Filtering
- Some rules may be redundant due to ancestor
relationships between items. - Example
- milk ? wheat bread support 8, confidence
70 - 2 milk ? wheat bread support 2, confidence
72 - We say the first rule is an ancestor of the
second rule. - A rule is redundant if its support is close to
the expected value, based on the rules
ancestor.
41Multi-Level Mining Progressive Deepening
- A top-down, progressive deepening approach
- First mine high-level frequent items
- milk (15), bread
(10) - Then mine their lower-level weaker frequent
itemsets - 2 milk (5),
wheat bread (4) - Different min_support threshold across
multi-levels lead to different algorithms - If adopting the same min_support across
multi-levels - then toss t if any of ts ancestors is
infrequent. - If adopting reduced min_support at lower levels
- then examine only those descendents whose
ancestors support is frequent/non-negligible.
42Progressive Refinement of Data Mining Quality
- Why progressive refinement?
- Mining operator can be expensive or cheap, fine
or rough - Trade speed with quality step-by-step
refinement. - Superset coverage property
- Preserve all the positive answersallow a
positive false test but not a false negative
test. - Two- or multi-step mining
- First apply rough/cheap operator (superset
coverage) - Then apply expensive algorithm on a substantially
reduced candidate set (Koperski Han, SSD95).
43Chapter 5 Mining Association Rules in Large
Databases
- Association rule mining
- Mining single-dimensional Boolean association
rules from transactional databases - Mining multilevel association rules from
transactional databases - Mining multidimensional association rules from
transactional databases and data warehouse - From association mining to correlation analysis
- Constraint-based association mining
- Summary
44Interestingness Measurements
- Objective measures
- Two popular measurements
- support and
- confidence
- Subjective measures (Silberschatz Tuzhilin,
KDD95) - A rule (pattern) is interesting if
- it is unexpected (surprising to the user) and/or
- actionable (the user can do something with it)
45Criticism to Support and Confidence
- Example 1 (Aggarwal Yu, PODS98)
- Among 5000 students
- 3000 play basketball
- 3750 eat cereal
- 2000 both play basket ball and eat cereal
- play basketball ? eat cereal 40, 66.7 is
misleading because the overall percentage of
students eating cereal is 75 which is higher
than 66.7. - play basketball ? not eat cereal 20, 33.3 is
far more accurate, although with lower support
and confidence
46Criticism to Support and Confidence (Cont.)
- Example 2
- X and Y positively correlated,
- X and Z, negatively related
- support and confidence of
- XgtZ dominates
- We need a measure of dependent or correlated
events - P(BA)/P(B) is also called the lift of rule A gt B
47Other Interestingness Measures Interest
- Interest (correlation, lift)
- taking both P(A) and P(B) in consideration
- P(AB)P(B)P(A), if A and B are independent
events - A and B negatively correlated, if the value is
less than 1 otherwise A and B positively
correlated
48Example
- Total transactions 10,000
- Items Ccomputers, V video
- V 7,500 C 6,000 C and V 4,000
- Min_support 0.3 min_conf0,50
- Consider the rule
- Buy(X computer)? buy(X video)
- Support 4000/10000 0.4
- Confidence P(C and V) /P(C) 4000/6000 66
- Strong but
- The probablity of buying a video is 0.75 buying a
comuter reduces the probablity of buying a video - From 0.75 to 0.66
- Computer and video are negatively correlated
49- Lift of A ? B
- Lift P(A and B)/P(A)P(B)
- P(A and B) P(BA)P(A) then
- Lift P(BA)/P(B)
- Ratio of probablity of buying A and B divided by
buying A and B independently - Or it can be interpreted as
- Conditional probablity of buying B given that A
is purchased divided by unconditional probablity
of buying B
50C
not C
7500
V
not V
2500
10000
4000
6000
Lift C?V is P(P and V)/P(V)P(C) P(VC)/P(V)
0.4/0.60.750.89lt1 there is a negative
correlation Between Video and computer
51Are All the Rules Found Interesting?
- Buy walnuts ? buy milk 1, 80 is
misleading - if 85 of customers buy milk
- Support and confidence are not good to represent
correlations - So many interestingness measures? (Tan, Kumar,
Sritastava _at_KDD02)
52All Confidence
- All confidence
- All_conf sup(X)/max sup(Xi)?i
- X (X1,X2,...,Xk)
- For k 2
- Rules are X1?X2 and X2 ?X1
- All_conf sup(X1,X2)/max sup(X1),sup(X2)
- Here sup(X1,X2)/sup(X1) confidence of rule
- X1?X2
- Ex all conf 0.4/max(0.6,0.75)0.4/0.750.53
53Cosine
- Cosine P(A,B)/sqrt(P(A),P(B))
- Similar to lift but take square root of
denominator - Both cosine and all_conf are null inveriant
- Not affected from null transactions
- Ex
- Cosine 0.4/sqrt(0.60.75)0.27
54Mining Highly Correlated Patterns
- lift and ?2 are not good measures for
correlations in transactional DBs - all-conf or cosine could be good measures
(Omiecinski _at_TKDE03) - Both all-conf and coherence have the downward
closure
55(No Transcript)
56Chapter 5 Mining Association Rules in Large
Databases
- Association rule mining
- Mining single-dimensional Boolean association
rules from transactional databases - Mining multilevel association rules from
transactional databases - Mining multidimensional association rules from
transactional databases and data warehouse - From association mining to correlation analysis
- Constraint-based association mining
- Summary
57Constraint-based (Query-Directed) Mining
- Finding all the patterns in a database
autonomously? unrealistic! - The patterns could be too many but not focused!
- Data mining should be an interactive process
- User directs what to be mined using a data mining
query language (or a graphical user interface) - Constraint-based mining
- User flexibility provides constraints on what to
be mined - System optimization explores such constraints
for efficient miningconstraint-based mining
58Constraints in Data Mining
- Knowledge type constraint
- classification, association, etc.
- Data constraint using SQL-like queries
- find product pairs sold together in stores in
Chicago in Dec.02 - Dimension/level constraint
- in relevance to region, price, brand, customer
category - Rule (or pattern) constraint
- small sales (price lt 10) triggers big sales
(sum gt 200) - Interestingness constraint
- strong rules min_support ? 3, min_confidence
? 60
59Example
- bread ? milk
- milk ? butter
- Strong rules but items are not that valuable
- TV ? VCD player
- Support may be lower then previous rules but
value of items are much higher - This rule may be more valuable
60Constrained Mining vs. Constraint-Based Search
- Constrained mining vs. constraint-based
search/reasoning - Both are aimed at reducing search space
- Finding all patterns satisfying constraints vs.
finding some (or one) answer in constraint-based
search in AI - Constraint-pushing vs. heuristic search
- It is an interesting research problem on how to
integrate them - Constrained mining vs. query processing in DBMS
- Database query processing requires to find all
- Constrained pattern mining shares a similar
philosophy as pushing selections deeply in query
processing
61Rule Constraints in Association Mining
- Two kind of rule constraints
- Rule form constraints meta-rule guided mining.
- P(x, y) Q(x, w) takes(x, database
systems). - Rule (content) constraint constraint-based query
optimization (Ng, et al., SIGMOD98). - sum(LHS) lt 100 min(LHS) gt 20 count(LHS) gt 3
sum(RHS) gt 1000 - 1-variable vs. 2-variable constraints
(Lakshmanan, et al. SIGMOD99) - 1-var A constraint confining only one side (L/R)
of the rule, e.g., as shown above. - 2-var A constraint confining both sides (L and
R). - sum(LHS) lt min(RHS) max(RHS) lt 5 sum(LHS)
62- Apriori principle stating that
- All non empty subsets of a frequent itemsets must
also be frequent - Note that
- If a given itemset does not satisfy minimum
support - None of its supersets can
- Other examples of anti-monotone constraints
- Min(l.price) gt 500
- Count(l) lt 10
- Average(l.price) lt 10 not anti-monotone
63Anti-Monotonicity in Constraint Pushing
TDB (min_sup2)
- Anti-monotonicity
- When an intemset S violates the constraint, so
does any of its superset - sum(S.Price) ? v is anti-monotone
- sum(S.Price) ? v is not anti-monotone
- Example. C range(S.profit) ? 15 is anti-monotone
- Itemset ab violates C
- So does every superset of ab
64Monotonicity for Constraint Pushing
TDB (min_sup2)
- Monotonicity
- When an intemset S satisfies the constraint, so
does any of its superset - sum(S.Price) ? v is monotone
- min(S.Price) ? v is monotone
- Example. C range(S.profit) ? 15
- Itemset ab satisfies C
- So does every superset of ab
65The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
66Naïve Algorithm Apriori Constraint
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Constraint SumS.price lt 5
Scan D
67The Constrained Apriori Algorithm Push an
Anti-monotone Constraint Deep
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Constraint SumS.price lt 5
Scan D
68The Constrained Apriori Algorithm Push Another
Constraint Deep
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Constraint minS.price lt 1
Scan D
69Chapter 5 Mining Association Rules in Large
Databases
- Association rule mining
- Algorithms for scalable mining of
(single-dimensional Boolean) association rules in
transactional databases - Mining various kinds of association/correlation
rules - Constraint-based association mining
- Sequential pattern mining
- Applications/extensions of frequent pattern
mining - Summary
70Sequence Databases and Sequential Pattern Analysis
- Transaction databases, time-series databases vs.
sequence databases - Frequent patterns vs. (frequent) sequential
patterns - Applications of sequential pattern mining
- Customer shopping sequences
- First buy computer, then CD-ROM, and then digital
camera, within 3 months. - Medical treatment, natural disasters (e.g.,
earthquakes), science engineering processes,
stocks and markets, etc. - Telephone calling patterns, Weblog click streams
- DNA sequences and gene structures
71What Is Sequential Pattern Mining?
- Given a set of sequences, find the complete set
of frequent subsequences
A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
72Challenges on Sequential Pattern Mining
- A huge number of possible sequential patterns are
hidden in databases - A mining algorithm should
- find the complete set of patterns, when possible,
satisfying the minimum support (frequency)
threshold - be highly efficient, scalable, involving only a
small number of database scans - be able to incorporate various kinds of
user-specific constraints
73Studies on Sequential Pattern Mining
- Concept introduction and an initial Apriori-like
algorithm - R. Agrawal R. Srikant. Mining sequential
patterns, ICDE95 - GSPAn Apriori-based, influential mining method
(developed at IBM Almaden) - R. Srikant R. Agrawal. Mining sequential
patterns Generalizations and performance
improvements, EDBT96 - From sequential patterns to episodes
(Apriori-like constraints) - H. Mannila, H. Toivonen A.I. Verkamo.
Discovery of frequent episodes in event
sequences, Data Mining and Knowledge Discovery,
1997 - Mining sequential patterns with constraints
- M.N. Garofalakis, R. Rastogi, K. Shim SPIRIT
Sequential Pattern Mining with Regular Expression
Constraints. VLDB 1999
74Sequential pattern mining Cases and Parameters
- Duration of a time sequence T
- Sequential pattern mining can then be confined to
the data within a specified duration - Ex. Subsequence corresponding to the year of 1999
- Ex. Partitioned sequences, such as every year, or
every week after stock crashes, or every two
weeks before and after a volcano eruption - Event folding window w
- If w T, time-insensitive frequent patterns are
found - If w 0 (no event sequence folding), sequential
patterns are found where each event occurs at a
distinct time instant - If 0 lt w lt T, sequences occurring within the same
period w are folded in the analysis
75Example
- When event folding window is 5 munites
- Purchases within 5 munits is considered to be
taken together
76Sequential pattern mining Cases and Parameters
(2)
- Time interval, int, between events in the
discovered pattern - int 0 no interval gap is allowed, i.e., only
strictly consecutive sequences are found - Ex. Find frequent patterns occurring in
consecutive weeks - min_int ? int ? max_int find patterns that are
separated by at least min_int but at most max_int - Ex. If a person rents movie A, it is likely she
will rent movie B within 30 days (int ? 30) - int c ? 0 find patterns carrying an exact
interval - Ex. Every time when Dow Jones drops more than
5, what will happen exactly two days later?
(int 2)
77A Basic Property of Sequential Patterns Apriori
- A basic property Apriori (Agrawal Sirkant94)
- If a sequence S is not frequent
- Then none of the super-sequences of S is frequent
- E.g, lthbgt is infrequent ? so do lthabgt and lt(ah)bgt
Given support threshold min_sup 2
78GSPA Generalized Sequential Pattern Mining
Algorithm
- GSP (Generalized Sequential Pattern) mining
algorithm - proposed by Agrawal and Srikant, EDBT96
- Outline of the method
- Initially, every item in DB is a candidate of
length-1 - for each level (i.e., sequences of length-k) do
- scan database to collect support count for each
candidate sequence - generate candidate length-(k1) sequences from
length-k frequent sequences using Apriori - repeat until no frequent sequence or no candidate
can be found - Major strength Candidate pruning by Apriori
79Finding Length-1 Sequential Patterns
- Examine GSP using an example
- Initial candidates all singleton sequences
- ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt, ltggt, lthgt
- Scan database once, count support for candidates
80Generating Length-2 Candidates
51 length-2 Candidates
Without Apriori property, 8887/292 candidates
Apriori prunes 44.57 candidates
81Finding Length-2 Sequential Patterns
- Scan database one more time, collect support
count for each length-2 candidate - There are 19 length-2 candidates which pass the
minimum support threshold - They are length-2 sequential patterns
82Generating Length-3 Candidates and Finding
Length-3 Patterns
- Generate Length-3 Candidates
- Self-join length-2 sequential patterns
- Based on the Apriori property
- ltabgt, ltaagt and ltbagt are all length-2 sequential
patterns ? ltabagt is a length-3 candidate - lt(bd)gt, ltbbgt and ltdbgt are all length-2 sequential
patterns ? lt(bd)bgt is a length-3 candidate - 46 candidates are generated
- Find Length-3 Sequential Patterns
- Scan database once more, collect support counts
for candidates - 19 out of 46 candidates pass support threshold
83The GSP Mining Process
min_sup 2
84- Definition c is a contiguous subsequence of a
sequence ss1,s2,...,sn if - c is derived by dropping an item from s1 or sn
- c is derived by dropping an item from si which
has at least 2 items - c is a contiguous subsequence of c and c is a
contiguous subsequence of s - Ex s (1,2),(3,4),5,6
- 2,(3,4),5, (1,2),3,5,6, (3,5 are but
- (1,2),(3,4),6, (1,5,6 are not
85Candidate genration
- Step 1 Join Step Lk-1 join with Lk-1 to give Ck
- s1 and s2 are joined if dropping first item of s1
and last item of s2 gives the same sequence - s1 is extended by adding the last item of s2
- Step 2 Prune Step Delete candidate sequences
having (k-1) contiguous subsequences whose
support count is less than min_support count
86- L3 C4
L4 - (1,2),3 (1,2),(3,4)
(1,2),(3,4) - (1,2),4 (1,2),3,5
- 1,(3,4)
- (1,3),5
- 2,(3,4)
- 2,3,5
- (1,2),3 joined with 2,(3,4) to give
(1,2),(3,4) - (1,2),3 joined with 2,3,5 to give (1,2),3,5
- (1,2),3,5 is dropped since its 3 contiguous
subseq - (1,3,5 not in L3
87The GSP Algorithm
- Take sequences in form of ltxgt as length-1
candidates - Scan database once, find F1, the set of length-1
sequential patterns - Let k1 while Fk is not empty do
- Form Ck1, the set of length-(k1) candidates
from Fk - If Ck1 is not empty, scan database once, find
Fk1, the set of length-(k1) sequential patterns - Let kk1
88Bottlenecks of GSP
- A huge set of candidates could be generated
- 1,000 frequent length-1 sequences generate
length-2 candidates! - Multiple scans of database in mining
- Real challenge mining long sequential patterns
- An exponential number of short candidates
- A length-100 sequential pattern needs 1030
candidate
sequences!