Title: Data Mining
1Data Mining
2Papers
- Rakesh Agrawal and Ramakrishnan Srikant
- Fast algorithms for mining association rules.
- Mining sequential patterns.
3Outline
- For each paper
- Present the problem.
- Describe the algorithms.
- Intuition
- Design
- Performance.
4Market Basket Introduction
- Retailers are able to collect massive amounts of
sales data (basket data) - Bar-code technology
- E-commerce
- Sales data generally includes customer id,
transaction date and items bought.
5Market Basket Problem
- It would be useful to find association rules
between transactions. - ie. 75 of the people who buy spaghetti also by
tomato sauce. - Given a set of basket data, how can we
efficiently find the set of association rules?
6Formal Definition (1)
- L i1,i2, im set of items.
- Database D is a set of transactions.
- Transaction T is a set of items such that T ? L.
- An unique identifier, TID, is associated with
each transaction.
7Formal Definition (2)
- T contains X, a set of some items in L, if X ? T.
- Association rule, X ? Y
- X ? T, Y ? T, X ? Y ?
- Confidence of transactions which contain X
which also contain Y. - Support - of transactions in D which contain X
? Y.
8Formal Definition (3)
- Given a set of transactions D, we want to
generate all association rules that have support
and confidence greater than the user-specified
minimum support (minsup) and minimum confidence
(minconf).
9Problem Decomposition
- Two sub-problems
- Find all itemsets that have transaction support
above minsup. - These itemsets are called large itemsets.
- From all the large itemsets, generate the set of
association rules that have confidence about
minconf.
10Second Sub-problem
- Straightforward approach
- For every large itemset l, find all non-empty
subsets of l. - For every such subset a, output a rule of the
form a ? (l a) if ratio of support(l) to
support(a) is at least minconf.
11Discovering Large Itemsets
- Done with multiple passes over the data.
- First pass, find all individual items that are
large (have minimum support). - Subsequent pass, using large itemsets found in
previous pass - Generate candidate itemsets.
- Count support for each candidate itemset.
- Eliminate itemsets that do not have min support.
12Algorithm
- L1 large 1-itemsets
- for( k2 Lk-1?? k) do begin
- Ck apriori-gen(Lk-1) // New candidates
- forall transactions t ? D do // Counting support
- Ct subset(Ck, t) // Candidates in t
- forall candidates c ? Ct do
- c.count
- end
- Lk c ? Ck c.count ? minsup
- End
- Answer ?k Lk
13Candidate Generation
- AIS and SETM algorithms
- Uses the transactions in the database to generate
new candidates. - But this generates a lot of candidates which we
know beforehand are not large!
14Apriori Algorithms
- Generate candidates using only large itemsets
found in previous pass without considering the
database. - Intuition
- Any subset of a large itemset must be large.
15Apriori Candidate Generation
- Takes in Lk-1 and returns Ck.
- Two steps
- Join large itemsets Lk-1 with Lk-1.
- Prune out all itemsets in joined result which
contain a (k-1)subset not found in Lk-1.
16Candidate Generation (Join)
- insert into Ck
- select p.item1, p.item2,,p.itemk-1,q.itemk-1
- from Lk-1 p, Lk-1 q
- where p.item1 q.item1,, p.itemk-2 q.itemk-2,
p.itemk-1lt q.itemk-1
17Candidate Gen. (Example)
L3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
C4
1 2 3 4
1 3 4 5
C4
1 2 3 4
Join ?
Prune ?
18Counting Support
- Need to count the number of transactions which
support a given itemset. - For efficiency, use a hash-tree.
- Subset Function
19Subset Function (Hash-tree)
- Candidate itemsets are stored in hash-tree.
- Leaf node contains a list of itemsets.
- Interior node contains a hash table.
- Each bucket of the hash table points to another
node. - Root is at depth 1.
- Interior nodes at depth d points to nodes at
depth d1.
20Hash-tree Example (1)
depth
C3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
1
2
1
2 3 4
2
2
3
3
1 2 3 1 2 4
1 3 4 1 3 5
t2
21Using the hash-tree
- If we are at a leaf find all itemsets contained
in transaction. - If we are at an interior node hash on each
remaining element in transaction. - Root node hash on all elements in transaction.
22Hash-tree Example (2)
1
2
D
1 2 3 4
2 3 5
2 3 4
2
3
1 2 3 1 2 4
1 3 4 1 3 5
23AprioriTid (1)
- Does not use the transactions in the database for
counting itemset support. - Instead stores transactions as sets of possible
large itemsets, Ck. - Each member of Ck is of the form
- lt TID, Xkgt , Xk is a possible large itemset
24AprioriTid (2)
- Advantage of Ck
- If a transaction does not contain any candidate
k-itemset then it will have no entry in Ck. - Number of entries in Ck may be less than the
number of transactions in D. - Especially for large k.
- Speeds up counting!
25AprioriTid (3)
- However
- For small k each entry in Ck may be larger than
its corresponding transaction. - The usual space vs. time.
26AprioriTid (4) Example
27Observation
- When Ck does not fit in main memory we can see
large jump in execution time. - AprioriTid beats Apriori only when Ck can fit in
main memory.
28AprioriHybrid
- It is not necessary to use the same algorithm for
all the passes. - Combine the two algorithms!
- Start with Apriori
- When Ck can fit in main memory switch to
AprioriTID
29Performance (1)
- Measured performance by running algorithms on
generated synthetic data. - Used the following parameters
30Performance (2)
31Performance (3)
32Mining Sequential Patterns (1)
- Sequential patterns are ordered list of itemsets.
- Market basket example
- Customers typically rent star wars then empire
strikes back then return of the Jedi - Fitted sheets and pillow cases then comforter
then drapes and ruffles
33Mining Sequential Patterns (2)
- Looks at sequences of transactions as opposed to
a single transaction. - Groups transactions based on customer ID.
- Customer sequence.
34Formal Definition (1)
- Given a database D of customer transactions.
- Each transaction consists of customer id,
transaction-time, items purchased. - No customer has more than one transaction with
the same transaction-time.
35Formal Definition (2)
- Itemset i, (i1 i2...im) where ij is an item.
- Sequence s, ?s1s2sn? where sj is an itemset.
- Sequence ?a1a2an? contained in ?b1b2bn? if
there exist integers i1lt i2 ... lt in such that
a1? bi1 , a2? bi2 ,, an? bin . - A sequence s is maximal if it is not contained in
any other sequence.
36Formal Definition (3)
- A customer supports a sequence s if s is
contained in the customer sequence for this
customer. - Support of a sequence - of customers who
support the sequence. - For mining association rules, support was of
transactions.
37Formal Definition (4)
- Given a database D of customer transactions find
the maximal sequences among all sequences that
have a certain user-specified minimum support. - Sequences that have support above minsup are
large sequences.
38Algorithm Sort Phase
- Customer ID Major key
- Transaction-time Minor key
- Converts the original transaction database into a
database of customer sequences.
39Algorithm Litemset Phase (1)
- Litemset Phase
- Find all large itemsets.
- Why?
- Because each itemset in a large sequence has to
be a large itemset.
40Algorithm Litemset Phase (2)
- To get all large itemsets we can use the Apriori
algorithms discussed earlier. - Need to modify support counting.
- For sequential patterns, support is measured by
fraction of customers.
41Algorithm Litemset Phase (3)
- Each large itemset is then mapped to a set of
contiguous integers. - Used to compare two large itemsets in constant
time.
42Algorithm Transformation (1)
- Need to repeatedly determine which of a given set
of large sequences are contained in a customer
sequence. - Represent transactions as sets of large itemsets.
- Customer sequence now becomes a list of sets of
itemsets.
43Algorithm Transformation (2)
44Algorithm Sequence Phase (1)
- Use the set of large itemsets to find the desired
sequences. - Similar structure to Apriori algorithms used to
find large itemsets. - Use seed set to generate candidate sequences.
- Count support for each candidate.
- Eliminate candidate sequences which are not large.
45Algorithm Sequence Phase (2)
- Two types of algorithms
- Count-all counts all large sequences, including
non-maximal sequences. - AprioriAll
- Count-some try to avoid counting non-maximal
sequences by counting longer sequences first. - AprioriSome
- DynamicSome
46Algorithm Maximal Phase (1)
- Find the maximal sequences among the set of large
sequences. - Set of all large subsequences S
47Algorithm Maximal Phase (2)
- Use hash-tree to find all subsequences of sk in
S. - Similar to subset function used in finding large
itemsets. - S is stored in hash-tree.
48AprioriAll (1)
49AprioriAll (2)
- Hash-tree is used for counting.
- Candidate generation similar to candidate
generation in finding large itemsets. - Except that order matters and therefore we dont
have the condition - p.itemk-1lt q.itemk-1
50AprioriAll (3)
Example of candidate generation
51AprioriAll (4)
52Count-some Algorithms
- Try to avoid counting non-maximal sequences by
counting longer sequences first. - 2 phases
- Forward Phase find all large sequences or
certain lengths. - Backward Phase find all remaining large
sequences.
53AprioriSome (1)
- Determines which lengths to count using next()
function. - next() takes in as a parameter the length of the
sequence counted in the last pass. - next(k) k 1 - Same as AprioriAll
- Balances tradeoff between
- Counting non-maximal sequences
- Counting extensions of small candidate sequences
54AprioriSome (2)
- hitk ?Lk?/ ?Ck?
- Intuition As hitk increases the time wasted by
counting extensions of small candidates decreases.
55AprioriSome (3)
56AprioriSome (4)
- Backward Phase
- For all lengths which we skipped
- Delete sequences in candidate set which are
contained in some large sequence. - Count remaining candidates and find all sequences
with min. support. - Also delete large sequences found in forward
phase which are non-maximal.
57AprioriSome (5)
58AprioriSome (6)
next(k) 2k minsup 2
Forward Phase
C3
3-Sequences
59AprioriSome (7)
Backward Phase
C3
3-Sequences
60Performance (1)
- Used generated datasets again
- Parameters for data
61Performance (2)
- DynamicSome generates too many candidates.
- AprioriSome does a little better than AprioriAll.
- It avoids counting many non-maximal sequences.
62Performance (3)
- Advantage of AprioriSome is reduced for 2
reasons - AprioriSome generates more candidates.
- Candidates remain memory resident even if skipped
over. - Cannot always follow heuristic.
63Wrap up
- Just presented two classic papers on data-mining.
- Association Rules
- Sequential Patterns