Title: David Newman, UC Irvine Lecture 17: Pattern Finding 1
1CS 277 Data MiningLecture 17 Pattern
Discovery Algorithms
- David Newman
- Department of Computer Science
- University of California, Irvine
TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAA
2Notices
- Presentations (10 minute)
- Thurs Dec 6th (5 students) usual time 5pm
- Thurs Dec 13th (5 students) Note time 4pm
- Volunteers for presenting Dec 6th? Otherwise
random - Final project report due Thurs Dec 13th
- please bring printed copy to class and email me a
copy - Will give instructions for presentations and
final report
3Project Presentations
- Thursday following two weeks, each student will
make an in-class 10-minute presentation on their
project (with 1 or 2 minutes for questions) - Email me your PowerPoint or PDF slides, with your
name (e.g., joesmith.ppt), before 3pm on the day
you are presenting - Suggested content
- Definition of the task/goal
- Description of data sets
- Description of algorithms
- Experimental results and conclusions
- Be visual where possible! (use figures, graphs,
etc)
4Final Project Reports
- Must be submitted as an email attachment (PDF,
Word, etc) by 3pm on Thursday Dec 13 - Use ICS 278 final project report in the subject
line of your email - Report should be self-contained
- Like a short technical paper
- A reader should be able to repeat your results
- Include details in appendices if necessary
- Approximately 1 page of text per section (see
next slide) - graphs/plots dont count include as many of
these as you like. - Can re-use material from proposal and from
midterm progress report if you wish
5Suggested Outline of Final Project Report
- Introduction
- Clear description of task/goals of the project
- Motivation why is this problem interesting
and/or important? - Discussion of relevant literature
- Summarize relevant aspects of prior
published/related work - Technical approach
- Data used in your project
- Exploratory data analysis relevant to your task
- Include as many of plots/graphs as you think are
useful/relevant - Algorithms used in your project
- Clear description of all algorithms used
- Credit appropriate sources if you used other
implementations -
- Experimental Results
- Clear description of your experimental
methodology - Detailed description of your results (graphs,
tables, etc)
6Homework 3 review
- K-means
- SVD (LSI)
- NMF
- PLSI
7K-means
- Classic, Euclidean version
- Euclidean objective Q
- E-Step (Expectation)
- rdk s.t. Q minimized
- M-Step (Maximization)
8K-means
- Cosine similarity version
- Correct update for objective Q?
- E-Step
- rdk s.t. Q maximized
- M-Step
???
9K-means
- Iterations till convergence
- 20, 1, 5, 15, 12, 10, 5-15
- Do reality check on topics printed
- You have 4 methods to find topics
- You should have some prior expectation that the
different methods will produce similar topics - therefore if you have a method that produces
strange topics, take a second look
10SVD
- W,D size(X)
- U,S,V svds(X,K)
- for k1K
- xsort,isort sort(-abs(U(,k)))
- fprintf('vector d ', k)
- for i112
- fprintf('s ', wordisort(i))
- end
- fprintf('\n')
- end
11SVD Time complexity
- Dense SVD of D x W matrix X
- Time min DW2, WD2
- Google matlab svd
- svd uses LAPACK DGESVD
- Time complexity less if
- Sparse
- Computing K-leading singular values / singular
vectors
12Matlab svd, svds
- svd
- uses LAPACK DGESVD
- svds
- uses eigs to find eigenvalues/eigenvectors of
- eigs uses ARPACK (Arnoldi Package)
- ARPACK requires (routine for) y Ax
- y A x
- dense O(n2 )
- sparse O(n)
- ? Estimate work for svds
13Space Complexity
- Size parameters
- D documents
- W words
- N total words, M nonzero entries (M N/2)
- L average words per document (L N/D)
- K topics
- Data many of you said space DW
- PubMed D 107, W105 4 Byte integers
- need 4 1012 4,000 GB memory
- memory, not disk
- this is exceptionally large
- DONT IGNORE SPARSENESS
14Space Complexity
- D documents
- W words
- N total words, M nonzero entries (M N/2)
- L average words per document
- K topics
Parameters (typically) topics K W mixes D
K total K (DW)
Data N D L e D W
Assumptions K
Data Parameters D (L K)
15Complexity comparison
16Complexity comparison
no free lunch
17PLSI
- Some of you did
- prob_d_w zeros(D,W)
- this is not scalable
- D.W likely to be bigger than K.D.L
- Space p(zw,d)
- K D L (largest of all!!!)
- for k1K
- prob_z_given_w_dk sparse(W,D)
- end
18Complexity of operations on n x n matrix A
- Dense A
- y A x
- x inv(A) y
- A x l x
- det(A)
- Sparse A
- y A x
- x inv(A) y
- A x l x
- det(A)
19- Lecture
- Pattern Discovery Algorithms
20Pattern-Based Algorithms
- Global predictive and descriptive modeling
- global models in the sense that they cover all of
the data space - Patterns
- More local structure, only describe certain
aspects of the data - Examples
- A single small very dense cluster in input space
- e.g., a new type of galaxy in astronomy data
- An unusual set of outliers
- e.g., indications of an anomalous event in
time-series climate data - Associations or rules
- If bread is purchased and wine is purchased then
cheese is purchased with probability p - Motif-finding in sequences, e.g.,
- motifs in DNA sequences noisy words in random
background
21General Ideas for Patterns
- Many patterns can be described in the general
form - if condition 1 then condition 2 (with some
certainty) - Probabilistic rules If Age 40 and
education college then income 50k with
probability p - Bumps
- If Age 40 and education college then
mean income 73k - if antecedent then consequent
- if j then v
- where j is generally some box in the input space
- where v is a statement about a variable of
interest, e.g., p(y j ) or E y j - Pattern support
- Support p( j ) or p(j , w )
- Fraction of points in input space where the
condition applies - Often interested in patterns with larger support
22How Interesting is a Pattern?
- Note interestingness is inherently subjective
- Depends on what the data analyst already knows
- Difficult to quantify prior knowledge
- How interesting a pattern is, can be a function
of - How surprising it is relative to prior knowledge?
- How useful (actionable) it is?
- This is a somewhat open research problem
- In general pattern interestingness is difficult
to quantify - ? Use simple surrogate measures in practice
23How Interesting is a Pattern?
- Interestingness of a pattern
- Measures how interesting the pattern j - v is
- Typical measures of interest
- Conditional probability p( v j )
-
- Change in probability p( v j ) - p( v
) - Lift p( v j ) / p( v ) (also log
of this) - Change in mean target response, e.g., E y j
/Ey
24Pattern-Finding Algorithms
- Typically search a data set for the set of
patterns that maximize some score function - Usually a function of both support and
interestingness - E.g.,
- Association rules
- Bump-hunting
- Issues
- Huge combinatorial search space
- How many patterns to return to the user
- How to avoid problems with redundant patterns
- Statistical issues
- Even in random noise, if we search over a very
large number of patterns, we are likely to find
something that looks significant - This is known as multiple hypothesis testing in
statistics - One approach that can help is to conduct
randomization tests - e.g., for matrix data randomly permute the values
in each column - Run pattern-discovery algorithm resulting
scores provide a null distribution - Ideally, also need a 2nd data set to validate
patterns
25Transaction Data and Market Baskets
x
x
x
x
x
x
x
- Supermarket example (Srikant and Agrawal, 1997)
- items 50,000, transactions 1.5 million
- Data sets are typically very sparse
26Market Basket Analysis
- given a (huge) transactions database
- each transaction representing basket for 1
customer visit - each transaction containing set of items
(itemset) - finite set of (boolean) items (e.g. wine, cheese,
diaper, beer, ) - Association rules
- classically used on supermarket transaction
databases - associations Trader Joes customers frequently
buy wine cheese - rule people who buy wine also buy cheese 60
of time - infamous beer diapers example
- in evening hours, beer and diapers often
purchased together - generalize to many other problems, e.g.
- baskets documents, items words
- baskets WWW pages, items links
27Market Basket Analysis Complexity
- usually transaction DB too huge to fit in RAM
- common sizes
- number of transactions 105 to 108 (hundreds
of millions) - number of items 102 to 106
(hundreds to millions) - entire DB needs to be examined
- usually very sparse
- e.g. 0.1 chance of buying random item
- subsampling often a useful trick in DM, but
- here, subsampling could easily miss the (rare)
interesting patterns - thus, runtime dominated by disk read times
- motivates focus on minimizing number of disk scans
28Association Rules
From Ullmans data mining lectures http//infolab
.stanford.edu/ullman/mining/2006/index.html
- Market Baskets
- Frequent Itemsets
- A-priori Algorithm
29The Market-Basket Model
- A large set of items, e.g., things sold in a
supermarket. - A large set of baskets, each of which is a small
set of the items, e.g., the things one customer
buys on one day.
30Support
- Simplest question find sets of items that appear
frequently in the baskets. - Support for itemset I the number of baskets
containing all items in I. - Given a support threshold s, sets of items that
appear in s baskets are called frequent
itemsets.
31Example
- Itemsmilk, coke, pepsi, beer, juice.
- Support 3 baskets.
- B1 m, c, b B2 m, p, j
- B3 m, b B4 c, j
- B5 m, p, b B6 m, c, b, j
- B7 c, b, j B8 b, c
- Frequent itemsets m, c, b, j, m,
b, c, b, j, c.
32Applications --- (1)
- Real market baskets chain stores keep terabytes
of information about what customers buy together. - Tells how typical customers navigate stores, lets
them position tempting items. - Suggests tie-in tricks, e.g., run sale on
diapers and raise the price of beer. - High support needed, or no s .
33Applications --- (2)
- Baskets documents items words in those
documents. - Lets us find words that appear together unusually
frequently, i.e., linked concepts. - Baskets sentences, items documents
containing those sentences. - Items that appear together too often could
represent plagiarism.
34Applications --- (3)
- Baskets Web pages items linked pages.
- Pairs of pages with many common references may be
about the same topic. - Baskets Web pages p items pages that
link to p . - Pages with many of the same links may be mirrors
or about the same topic.
35Important Point
- Market Baskets is an abstraction that models
any many-many relationship between two concepts
items and baskets. - Items need not be contained in baskets.
- The only difference is that we count
co-occurrences of items related to a basket, not
vice-versa.
36Scale of Problem
- WalMart sells 100,000 items and can store
billions of baskets. - The Web has over 100,000,000 words and billions
of pages.
37Association Rules
- If-then rules about the contents of baskets.
- i1, i2,,ik ? j means if a basket contains
all of i1,,ik then it is likely to contain j. - Confidence of this association rule is the
probability of j given i1,,ik.
38Example
- B1 m, c, b B2 m, p, j
- B3 m, b B4 c, j
- B5 m, p, b B6 m, c, b, j
- B7 c, b, j B8 b, c
- An association rule m, b ? c.
- Confidence 2/4 50.
_ _
39Interest
- The interest of an association rule X ? Y
is the absolute value of the amount by which the
confidence differs from the probability of Y.
40Example
- B1 m, c, b B2 m, p, j
- B3 m, b B4 c, j
- B5 m, p, b B6 m, c, b, j
- B7 c, b, j B8 b, c
- For association rule m, b ? c, item c appears
in 5/8 of the baskets. - Interest 2/4 - 5/8 1/8 --- not very
interesting.
41Relationships Among Measures
- Rules with high support and confidence may be
useful even if they are not interesting. - We dont care if buying bread causes people to
buy milk, or whether simply a lot of people buy
both bread and milk. - But high interest suggests a cause that might be
worth investigating.
42Finding Association Rules
- A typical question find all association rules
with support s and confidence c. - Note support of an association rule is the
support of the set of items it mentions. - Hard part finding the high-support (frequent )
itemsets. - Checking the confidence of association rules
involving those sets is relatively easy.
43Association Rules Problem Definition
- given set I of items, set T transactions, ?t ?T,
t ? I - Itemset Z a set of items (any subset of I)
- support count ?(Z) num transactions containing
Z - given any itemset Z ? I, ?(Z) t t ?T, Z
? t - association rule
- RX ? Y s,c, X,Y ? I, X?Y?
- support
- s(R) s(X?Y) ?(X?Y)/T p(X?Y)
- confidence
- c(R) s(X?Y) / s(X) ?(X?Y) / ?(X) p(X Y)
- goal find all R such that
- s(R) ? given minsup
- c(R) ? given minconf
44Comments on Association Rules
- association rule RX ? Y s,c
- Strictly speaking these are not rules
- i.e., we could have wine cheese and
cheese wine - correlation is not causation
- The space of all possible rules is enormous
- O( 2p ) where p the number of different items
- Will need some form of combinatorial search
algorithm - How are thresholds minsup and minconf selected?
- Not that easy to know ahead of time how to select
these
45Example
- simple example transaction database (T4)
- Transaction1 A,B,C
- Transaction2 A,C
- Transaction3 A,D
- Transaction4 B,E,F
- with minsup50, minconf50
- R1 A -- C s50, c66.6
- s(R1) s(A,C) , c(R1) s(A,C)/s(A) 2/3
- R2 C -- A s50, c100
- s(R2) s(A,C), c(R2) s(A,C)/s(C) 2/2
s(A) 3/4 75 s(B) 2/4
50 s(C) 2/4 50 s(A,C) 2/4 50
46Finding Association Rules
- two steps
- step 1 find all frequent itemsets (F)
- F Z s(Z) ? minsup (e.g.
Za,b,c,d,e) - step 2 find all rules R X -- Y such that
- X ? Y ? F and X ? Y? (e.g. R
a,b,c -- d,e) - s(R) ? minsup and c(R) ? minconf
- step 1s time-complexity typically step 2s
- step 2 need not scan the data (s(X),s(Y) all
cached in step 1) - search space is exponential in I, filters
choices for step 2 - so, most work focuses on fast frequent itemset
generation - step 1 never filters viable candidates for step 2
47Finding Frequent Itemsets
- frequent itemsets Z s(Z)minsup
- Apriori (monotonicity) Principle s(X) ? s(X?Y)
- any subset of a frequent itemset must be frequent
- finding frequent itemsets
- bottom-up approach
- do level-wise, for k1 I
- k1 find frequent singletons
- k2 find frequent pairs (often most costly)
- step k.1 find size-k itemset candidates from the
freq size-(k-1)s of prev level - step k.2 prune candidates Z for which s(Z)
- each level requires a single scan over all the
transaction data - computes support counts ?(Z) t t ?T, Z ?
t for all size-k Z candidates
s(A) 3/4 75 s(B) 2/4
50 s(C) 2/4 50 s(A,C) 2/4 50
48Apriori Example (minsup2)
bottleneck
itemset 1,2 1,3 1,5 2,3 2,5 3,5
C2
F1
C1
transactions T 1,3,4 2,3,5 1,2,3,5 2,5
itemset sup 1 2 2 3 3 3 4 1 5 3
itemset sup 1 2 2 3 3 3 5 3
gen
count (scan T)
filter
count (scan T)
F3
itemset sup 2,3,5 2
C2
C3 knows can avoid gen 1,2,3 (and 1,3,5)
apriori, without counting, because 1,2 (1,5)
not freq
itemset sup 1,2 1 1,3 2 1,5 1 2,3 2
2,5 3 3,5 2
F2
filter
itemset sup 1,3 2 2,3 2 2,5 3 3,5 2
C3
itemset sup 2,3,5 2
notice how C3 C3
filter
itemset 2,3,5
count (scan T)
gen
49Problems with Association Rules
- Consider 4 highly correlated items A, B, C, D
- Say p(subset isubset j) minconf for all
possible pairs of disjoint subsets - And p(subset i ? subset j) minsup
- How many possible rules?
- E.g., A-B, A,BC, A,CB, B,CA
- All possible combinations 4 x 23
- In general for K such items, K x 2K-1 rules
- For highly correlated items there is a
combinatorial explosion of redundant rules - In practice this makes interpretation of
association rule results difficult
50Computation Model
- Typically, data is kept in a flat file rather
than a database system. - Stored on disk.
- Stored basket-by-basket.
- Expand baskets into pairs, triples, etc. as you
read baskets.
51File Organization
Item
Item
Basket 1
Item
Item
Item
Item
Basket 2
Item
Item
Item
Item
Basket 3
Item
Item
Etc.
52Computation Model --- (2)
- The true cost of mining disk-resident data is
usually the number of disk I/Os. - In practice, association-rule algorithms read the
data in passes --- all baskets read in turn. - Thus, we measure the cost by the number of passes
an algorithm takes.
53Main-Memory Bottleneck
- For many frequent-itemset algorithms, main memory
is the critical resource. - As we read baskets, we need to count something,
e.g., occurrences of pairs. - The number of different things we can count is
limited by main memory. - Swapping counts in/out is a disaster.
54Finding Frequent Pairs
- The hardest problem often turns out to be finding
the frequent pairs. - Well concentrate on how to do that, then discuss
extensions to finding frequent triples, etc.
55Naïve Algorithm
- Read file once, counting in main memory the
occurrences of each pair. - Expand each basket of n items into its
n (n -1)/2 pairs. - Fails if (items)2 exceeds main memory.
- Remember items can be 100K (Wal-Mart) or 10B
(Web pages).
56Details of Main-Memory Counting
- Two approaches
- Count all item pairs, using a triangular matrix.
- Keep a table of triples i, j, c the count of
the pair of items i,j is c. - (1) requires only (say) 4 bytes/pair.
- (2) requires 12 bytes, but only for those pairs
with count 0.
574 per pair
12 per occurring pair
Method (1)
Method (2)
58Details of Approach 1
- Number items 1, 2,
- Keep pairs in the order 1,2, 1,3,, 1,n ,
2,3, 2,4,,2,n , 3,4,, 3,n ,n -1,n
. - Find pair i, j at the position
(i 1)(n i /2) j i. - Total number of pairs n (n 1)/2 total bytes
about 2n 2.
59Details of Approach 2
- You need a hash table, with i and j as the key,
to locate (i, j, c) triples efficiently. - Typically, the cost of the hash structure can be
neglected. - Total bytes used is about 12p, where p is the
number of pairs that actually occur. - Beats triangular matrix if at most 1/3 of
possible pairs actually occur.
60A-Priori Algorithm --- (1)
- A two-pass approach called a-priori limits the
need for main memory. - Key idea monotonicity if a set of items
appears at least s times, so does every subset. - Contrapositive for pairs if item i does not
appear in s baskets, then no pair including i
can appear in s baskets.
61A-Priori Algorithm --- (2)
- Pass 1 Read baskets and count in main memory the
occurrences of each item. - Requires only memory proportional to items.
- Pass 2 Read baskets again and count in main
memory only those pairs both of which were found
in Pass 1 to be frequent. - Requires memory proportional to square of
frequent items only.
62Picture of A-Priori
Item counts
Frequent items
Counts of candidate pairs
Pass 1
Pass 2
63Detail for A-Priori
- You can use the triangular matrix method with n
number of frequent items. - Saves space compared with storing triples.
- Trick number frequent items 1,2, and keep a
table relating new numbers to original item
numbers.
64Frequent Triples, Etc.
- For each k, we construct two sets of k
tuples - Ck candidate k tuples those that might be
frequent sets (support s ) based on information
from the pass for k 1. - Lk the set of truly frequent k tuples.
65Filter
Filter
Construct
Construct
C1
L1
C2
L2
C3
First pass
Second pass
66A-Priori for All Frequent Itemsets
- One pass for each k.
- Needs room in main memory to count each candidate
k tuple. - For typical market-basket data and reasonable
support (e.g., 1), k 2 requires the most
memory.
67Frequent Itemsets --- (2)
- C1 all items
- L1 those counted on first pass to be frequent.
- C2 pairs, both chosen from L1.
- In general, Ck k tuples, each k 1 of which is
in Lk-1. - Lk members of Ck with support s.