Data Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Data Mining

Description:

Data Mining Presented By: Kevin Seng Papers Rakesh Agrawal and Ramakrishnan Srikant: Fast algorithms for mining association rules. Mining sequential patterns. – PowerPoint PPT presentation

Number of Views:151

Avg rating:3.0/5.0

Slides: 64

Provided by: KevinS158

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining

1
Data Mining

Presented By
Kevin Seng

2
Papers

Rakesh Agrawal and Ramakrishnan Srikant
Fast algorithms for mining association rules.
Mining sequential patterns.

3
Outline

For each paper
Present the problem.
Describe the algorithms.
Intuition
Design
Performance.

4
Market Basket Introduction

Retailers are able to collect massive amounts of
sales data (basket data)
Bar-code technology
E-commerce
Sales data generally includes customer id,
transaction date and items bought.

5
Market Basket Problem

It would be useful to find association rules
between transactions.
ie. 75 of the people who buy spaghetti also by
tomato sauce.
Given a set of basket data, how can we
efficiently find the set of association rules?

6
Formal Definition (1)

L i1,i2, im set of items.
Database D is a set of transactions.
Transaction T is a set of items such that T ? L.
An unique identifier, TID, is associated with
each transaction.

7
Formal Definition (2)

T contains X, a set of some items in L, if X ? T.
Association rule, X ? Y
X ? T, Y ? T, X ? Y ?
Confidence of transactions which contain X
which also contain Y.
Support - of transactions in D which contain X
? Y.

8
Formal Definition (3)

Given a set of transactions D, we want to
generate all association rules that have support
and confidence greater than the user-specified
minimum support (minsup) and minimum confidence
(minconf).

9
Problem Decomposition

Two sub-problems
Find all itemsets that have transaction support
above minsup.
These itemsets are called large itemsets.
From all the large itemsets, generate the set of
association rules that have confidence about
minconf.

10
Second Sub-problem

Straightforward approach
For every large itemset l, find all non-empty
subsets of l.
For every such subset a, output a rule of the
form a ? (l a) if ratio of support(l) to
support(a) is at least minconf.

11
Discovering Large Itemsets

Done with multiple passes over the data.
First pass, find all individual items that are
large (have minimum support).
Subsequent pass, using large itemsets found in
previous pass
Generate candidate itemsets.
Count support for each candidate itemset.
Eliminate itemsets that do not have min support.

12
Algorithm

L1 large 1-itemsets
for( k2 Lk-1?? k) do begin
Ck apriori-gen(Lk-1) // New candidates
forall transactions t ? D do // Counting support
Ct subset(Ck, t) // Candidates in t
forall candidates c ? Ct do
c.count
end
Lk c ? Ck c.count ? minsup
End
Answer ?k Lk

13
Candidate Generation

AIS and SETM algorithms
Uses the transactions in the database to generate
new candidates.
But this generates a lot of candidates which we
know beforehand are not large!

14
Apriori Algorithms

Generate candidates using only large itemsets
found in previous pass without considering the
database.
Intuition
Any subset of a large itemset must be large.

15
Apriori Candidate Generation

Takes in Lk-1 and returns Ck.
Two steps
Join large itemsets Lk-1 with Lk-1.
Prune out all itemsets in joined result which
contain a (k-1)subset not found in Lk-1.

16
Candidate Generation (Join)

insert into Ck
select p.item1, p.item2,,p.itemk-1,q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1 q.item1,, p.itemk-2 q.itemk-2,
p.itemk-1lt q.itemk-1

17
Candidate Gen. (Example)
L3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
C4
1 2 3 4
1 3 4 5
C4
1 2 3 4
Join ?
Prune ?
18
Counting Support

Need to count the number of transactions which
support a given itemset.
For efficiency, use a hash-tree.
Subset Function

19
Subset Function (Hash-tree)

Candidate itemsets are stored in hash-tree.
Leaf node contains a list of itemsets.
Interior node contains a hash table.
Each bucket of the hash table points to another
node.
Root is at depth 1.
Interior nodes at depth d points to nodes at
depth d1.

20
Hash-tree Example (1)
depth
C3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
1
2
1
2 3 4
2
2
3
3
1 2 3 1 2 4
1 3 4 1 3 5
t2
21
Using the hash-tree

If we are at a leaf find all itemsets contained
in transaction.
If we are at an interior node hash on each
remaining element in transaction.
Root node hash on all elements in transaction.

22
Hash-tree Example (2)
1
2
D
1 2 3 4
2 3 5
2 3 4
2
3
1 2 3 1 2 4
1 3 4 1 3 5
23
AprioriTid (1)

Does not use the transactions in the database for
counting itemset support.
Instead stores transactions as sets of possible
large itemsets, Ck.
Each member of Ck is of the form
lt TID, Xkgt , Xk is a possible large itemset

24
AprioriTid (2)

Advantage of Ck
If a transaction does not contain any candidate
k-itemset then it will have no entry in Ck.
Number of entries in Ck may be less than the
number of transactions in D.
Especially for large k.
Speeds up counting!

25
AprioriTid (3)

However
For small k each entry in Ck may be larger than
its corresponding transaction.
The usual space vs. time.

26
AprioriTid (4) Example
27
Observation

When Ck does not fit in main memory we can see
large jump in execution time.
AprioriTid beats Apriori only when Ck can fit in
main memory.

28
AprioriHybrid

It is not necessary to use the same algorithm for
all the passes.
Combine the two algorithms!
Start with Apriori
When Ck can fit in main memory switch to
AprioriTID

29
Performance (1)

Measured performance by running algorithms on
generated synthetic data.
Used the following parameters

30
Performance (2)
31
Performance (3)
32
Mining Sequential Patterns (1)

Sequential patterns are ordered list of itemsets.
Market basket example
Customers typically rent star wars then empire
strikes back then return of the Jedi
Fitted sheets and pillow cases then comforter
then drapes and ruffles

33
Mining Sequential Patterns (2)

Looks at sequences of transactions as opposed to
a single transaction.
Groups transactions based on customer ID.
Customer sequence.

34
Formal Definition (1)

Given a database D of customer transactions.
Each transaction consists of customer id,
transaction-time, items purchased.
No customer has more than one transaction with
the same transaction-time.

35
Formal Definition (2)

Itemset i, (i1 i2...im) where ij is an item.
Sequence s, ?s1s2sn? where sj is an itemset.
Sequence ?a1a2an? contained in ?b1b2bn? if
there exist integers i1lt i2 ... lt in such that
a1? bi1 , a2? bi2 ,, an? bin .
A sequence s is maximal if it is not contained in
any other sequence.

36
Formal Definition (3)

A customer supports a sequence s if s is
contained in the customer sequence for this
customer.
Support of a sequence - of customers who
support the sequence.
For mining association rules, support was of
transactions.

37
Formal Definition (4)

Given a database D of customer transactions find
the maximal sequences among all sequences that
have a certain user-specified minimum support.
Sequences that have support above minsup are
large sequences.

38
Algorithm Sort Phase

Customer ID Major key
Transaction-time Minor key
Converts the original transaction database into a
database of customer sequences.

39
Algorithm Litemset Phase (1)

Litemset Phase
Find all large itemsets.
Why?
Because each itemset in a large sequence has to
be a large itemset.

40
Algorithm Litemset Phase (2)

To get all large itemsets we can use the Apriori
algorithms discussed earlier.
Need to modify support counting.
For sequential patterns, support is measured by
fraction of customers.

41
Algorithm Litemset Phase (3)

Each large itemset is then mapped to a set of
contiguous integers.
Used to compare two large itemsets in constant
time.

42
Algorithm Transformation (1)

Need to repeatedly determine which of a given set
of large sequences are contained in a customer
sequence.
Represent transactions as sets of large itemsets.
Customer sequence now becomes a list of sets of
itemsets.

43
Algorithm Transformation (2)
44
Algorithm Sequence Phase (1)

Use the set of large itemsets to find the desired
sequences.
Similar structure to Apriori algorithms used to
find large itemsets.
Use seed set to generate candidate sequences.
Count support for each candidate.
Eliminate candidate sequences which are not large.

45
Algorithm Sequence Phase (2)

Two types of algorithms
Count-all counts all large sequences, including
non-maximal sequences.
AprioriAll
Count-some try to avoid counting non-maximal
sequences by counting longer sequences first.
AprioriSome
DynamicSome

46
Algorithm Maximal Phase (1)

Find the maximal sequences among the set of large
sequences.
Set of all large subsequences S

47
Algorithm Maximal Phase (2)

Use hash-tree to find all subsequences of sk in
S.
Similar to subset function used in finding large
itemsets.
S is stored in hash-tree.

48
AprioriAll (1)
49
AprioriAll (2)

Hash-tree is used for counting.
Candidate generation similar to candidate
generation in finding large itemsets.
Except that order matters and therefore we dont
have the condition
p.itemk-1lt q.itemk-1

50
AprioriAll (3)
Example of candidate generation
51
AprioriAll (4)

Example

52
Count-some Algorithms

Try to avoid counting non-maximal sequences by
counting longer sequences first.
2 phases
Forward Phase find all large sequences or
certain lengths.
Backward Phase find all remaining large
sequences.

53
AprioriSome (1)

Determines which lengths to count using next()
function.
next() takes in as a parameter the length of the
sequence counted in the last pass.
next(k) k 1 - Same as AprioriAll
Balances tradeoff between
Counting non-maximal sequences
Counting extensions of small candidate sequences

54
AprioriSome (2)