Title: Lecture 2: Data Mining
1Lecture 2 Data Mining
2Roadmap
- What is data mining?
- Data Mining Tasks
- Classification/Decision Tree
- Clustering
- Association Mining
- Data Mining Algorithms
- Decision Tree Construction
- Frequent 2-itemsets
- Frequent Itemsets (Apriori)
- Clustering/Collaborative Filtering
3What is Data Mining?
- Discovery of useful, possibly unexpected,
patterns in data. - Subsidiary issues
- Data cleansing detection of bogus data.
- E.g., age 150.
- Visualization something better than megabyte
files of output. - Warehousing of data (for retrieval).
4Typical Kinds of Patterns
- Decision trees succinct ways to classify by
testing properties. - Clusters another succinct classification by
similarity of properties. - Bayes, hidden-Markov, and other statistical
models, frequent-itemsets expose important
associations within data.
5Example Clusters
x xx x x x x x x x x x x x
x
x x x x x x x x x x x x
x x x
x x x x x x x x x x
x
6Applications (Among Many)
- Intelligence-gathering.
- Total Information Awareness.
- Web Analysis.
- PageRank.
- Marketing.
- Run a sale on diapers raise the price of beer.
- Detective?
7Cultures
- Databases concentrate on large-scale
(non-main-memory) data. - AI (machine-learning) concentrate on complex
methods, small data. - Statistics concentrate on inferring models.
8Models vs. Analytic Processing
- To a database person, data-mining is a powerful
form of analytic processing --- queries that
examine large amounts of data. - Result is the data that answers the query.
- To a statistician, data-mining is the inference
of models. - Result is the parameters of the model.
9Meaningfulness of Answers
- A big risk when data mining is that you will
discover patterns that are meaningless. - Statisticians call it Bonferronis principle
(roughly) if you look in more places for
interesting patterns than your amount of data
will support, you are bound to find crap.
10Examples
- A big objection to TIA was that it was looking
for so many vague connections that it was sure to
find things that were bogus and thus violate
innocents privacy. - The Rhine Paradox a great example of how not to
conduct scientific research.
11Rhine Paradox --- (1)
- David Rhine was a parapsychologist in the 1950s
who hypothesized that some people had
Extra-Sensory Perception. - He devised an experiment where subjects were
asked to guess 10 hidden cards --- red or blue. - He discovered that almost 1 in 1000 had ESP ---
they were able to get all 10 right!
12Rhine Paradox --- (2)
- He told these people they had ESP and called them
in for another test of the same type. - Alas, he discovered that almost all of them had
lost their ESP. - What did he conclude?
- Answer on next slide.
13Rhine Paradox --- (3)
- He concluded that you shouldnt tell people they
have ESP it causes them to lose it.
14Data Mining Tasks
- Data mining is the process of semi-automatically
analyzing large databases to find useful patterns
- Prediction based on past history
- Predict if a credit card applicant poses a good
credit risk, based on some attributes (income,
job type, age, ..) and past history - Predict if a pattern of phone calling card usage
is likely to be fraudulent - Some examples of prediction mechanisms
- Classification
- Given a new item whose class is unknown, predict
to which class it belongs - Regression formulae
- Given a set of mappings for an unknown function,
predict the function result for a new parameter
value
15Data Mining (Cont.)
- Descriptive Patterns
- Associations
- Find books that are often bought by similar
customers. If a new such customer buys one such
book, suggest the others too. - Associations may be used as a first step in
detecting causation - E.g. association between exposure to chemical X
and cancer, - Clusters
- E.g. typhoid cases were clustered in an area
surrounding a contaminated well - Detection of clusters remains important in
detecting epidemics
16Decision Trees
- Example
- Conducted survey to see what customers were
interested in new model car - Want to select customers for advertising campaign
training set
17One Possibility
agelt30
Y
N
citysf
carvan
Y
Y
N
N
likely
unlikely
likely
unlikely
18Another Possibility
cartaurus
Y
N
citysf
agelt45
Y
Y
N
N
likely
unlikely
likely
unlikely
19Issues
- Decision tree cannot be too deep
- would not have statistically significant amounts
of data for lower decisions - Need to select tree that most reliably predicts
outcomes
20Clustering
income
education
age
21Another Example Text
- Each document is a vector
- e.g., lt100110...gt contains words 1,4,5,...
- Clusters contain similar documents
- Useful for understanding, searching documents
sports
international news
business
22Issues
- Given desired number of clusters?
- Finding best clusters
- Are clusters semantically meaningful?
23Association Rule Mining
transaction id
customer id
products bought
sales records
market-basket data
- Trend Products p5, p8 often bough together
- Trend Customer 12 likes product p9
24Association Rule
- Rule p1, p3, p8
- Support number of baskets where these products
appear - High-support set support ? threshold s
- Problem find all high support sets
25Finding High-Support Pairs
- Baskets(basket, item)
- SELECT I.item, J.item, COUNT(I.basket)FROM
Baskets I, Baskets JWHERE I.basket J.basket
AND I.item lt J.itemGROUP BY
I.item, J.itemHAVING COUNT(I.basket) gt s
26Example
27Issues
- Performance for size 2 rules
even bigger!
big
- Performance for size k rules
28Roadmap
- What is data mining?
- Data Mining Tasks
- Classification/Decision Tree
- Clustering
- Association Mining
- Data Mining Algorithms
- Decision Tree Construction
- Frequent 2-itemsets
- Frequent Itemsets (Apriori)
- Clustering/Collaborative Filtering
29Classification Rules
- Classification rules help assign new objects to
classes. - E.g., given a new automobile insurance applicant,
should he or she be classified as low risk,
medium risk or high risk? - Classification rules for above example could use
a variety of data, such as educational level,
salary, age, etc. - ? person P, P.degree masters and P.income gt
75,000 -
? P.credit excellent - ? person P, P.degree bachelors and
(P.income ? 25,000 and P.income ?
75,000)
? P.credit good - Rules are not necessarily exact there may be
some misclassifications - Classification rules can be shown compactly as a
decision tree.
30Decision Tree
31Decision Tree Construction
Employed
Root
No
Yes
ClassNot Default
Node
Balance
gt50K
lt50K
ClassYes Default
Age
Leaf
gt45
lt45
ClassNot Default
ClassYes Default
32Construction of Decision Trees
- Training set a data sample in which the
classification is already known. - Greedy top down generation of decision trees.
- Each internal node of the tree partitions the
data into groups based on a partitioning
attribute, and a partitioning condition for the
node - Leaf node
- all (or most) of the items at the node belong to
the same class, or - all attributes have been considered, and no
further partitioning is possible.
33Finding the Best Split Point for Numerical
Attributes
The data comes from a IBM Quest synthetic dataset
for function 0
Best Split Point
In-core algorithms, such as C4.5, will just
online sort the numerical attributes!
34Best Splits
- Pick best attributes and conditions on which to
partition - The purity of a set S of training instances can
be measured quantitatively in several ways. - Notation number of classes k, number of
instances S, fraction of instances in class
i pi. - The Gini measure of purity is defined as
- Gini (S) 1 - ?
- When all instances are in a single class, the
Gini value is 0 - It reaches its maximum (of 1 1 /k) if each class
the same number of instances. -
35Best Splits (Cont.)
- Another measure of purity is the entropy measure,
which is defined as - entropy (S) ?
- When a set S is split into multiple sets Si, I1,
2, , r, we can measure the purity of the
resultant set of sets as -
- purity(S1, S2, .., Sr) ?
- The information gain due to particular split of S
into Si, i 1, 2, ., r - Information-gain (S, S1, S2, ., Sr)
purity(S ) purity (S1, S2, Sr) -
36Finding Best Splits
- Categorical attributes (with no meaningful
order) - Multi-way split, one child for each value
- Binary split try all possible breakup of values
into two sets, and pick the best - Continuous-valued attributes (can be sorted in a
meaningful order) - Binary split
- Sort values, try each as a split point
- E.g. if values are 1, 10, 15, 25, split at ?1,
? 10, ? 15 - Pick the value that gives best split
- Multi-way split
- A series of binary splits on the same attribute
has roughly equivalent effect
37Decision-Tree Construction Algorithm
- Procedure GrowTree (S ) Partition (S
)Procedure Partition (S) if ( purity (S ) gt
?p or S lt ?s ) then return for each
attribute A evaluate splits on attribute
A Use best split found (across all attributes)
to partition S into S1, S2, ., Sr, for
i 1, 2, .., r Partition (Si )
38Finding Association Rules
- We are generally only interested in association
rules with reasonably high support (e.g. support
of 2 or greater) - Naïve algorithm
- Consider all possible sets of relevant items.
- For each set find its support (i.e. count how
many transactions purchase all items in the
set). - Large itemsets sets with sufficiently high
support
39Example Association Rules
- How do we perform rule mining efficiently?
- Observation If set X has support t, then each X
subset must have at least support t - For 2-sets
- if we need support s for i, j
- then each i, j must appear in at least s baskets
40Algorithm for 2-Sets
- (1) Find OK products
- those appearing in s or more baskets
- (2) Find high-support pairs using only OK
products
41Algorithm for 2-Sets
- INSERT INTO okBaskets(basket, item) SELECT
basket, item FROM Baskets GROUP BY item
HAVING COUNT(basket) gt s - Perform mining on okBaskets SELECT I.item,
J.item, COUNT(I.basket) FROM okBaskets I,
okBaskets J WHERE I.basket J.basket AND
I.item lt J.item GROUP BY
I.item, J.item HAVING COUNT(I.basket) gt s
42Counting Efficiently
threshold 3
43Counting Efficiently
threshold 3
44Yet Another Way
threshold 3
false positive
45Discussion
- Hashing scheme 2 (or 3) scans of data
- Sorting scheme requires a sort!
- Hashing works well if few high-support pairs and
many low-support ones
iceberg queries
46Frequent Itemsets Mining
TID Transactions
100 A, B, E
200 B, D
300 A, B, E
400 A, C
500 B, C
600 A, C
700 A, B
800 A, B, C, E
900 A, B, C
1000 A, C, E
- Desired frequency 50 (support level)
- A,B,C,A,B, A,C
- Down-closure (apriori) property
- If an itemset is frequent, all of its subset must
also be frequent
47Lattice for Enumerating Frequent Itemsets
48Apriori
- L0 ?
- C1 1-item subsets of all the
transactions - For ( k1 Ck ? 0 k )
- support counting
- for all transactions t ? D
- for all k-subsets s of t
- if k ? Ck
- s.count
- candidates generation
- Lk c ? Ck c.countgt min
sup - Ck1 apriori_gen( Lk )
- Answer UkLk
49Clustering
- Clustering Intuitively, finding clusters of
points in the given data such that similar points
lie in the same cluster - Can be formalized using distance metrics in
several ways - Group points into k sets (for a given k) such
that the average distance of points from the
centroid of their assigned group is minimized - Centroid point defined by taking average of
coordinates in each dimension. - Another metric minimize average distance between
every pair of points in a cluster - Has been studied extensively in statistics, but
on small data sets - Data mining systems aim at clustering techniques
that can handle very large data sets - E.g. the Birch clustering algorithm!
50K-Means Clustering
51K-means Clustering
- Partitional clustering approach
- Each cluster is associated with a centroid
(center point) - Each point is assigned to the cluster with the
closest centroid - Number of clusters, K, must be specified
- The basic algorithm is very simple
52Hierarchical Clustering
- Example from biological classification
- (the word classification here does not mean a
prediction mechanism) - chordata
mammalia
reptilialeopards humans snakes
crocodiles - Other examples Internet directory systems (e.g.
Yahoo!) - Agglomerative clustering algorithms
- Build small clusters, then cluster small clusters
into bigger clusters, and so on - Divisive clustering algorithms
- Start with all items in a single cluster,
repeatedly refine (break) clusters into smaller
ones
53Collaborative Filtering
- Goal predict what movies/books/ a person may be
interested in, on the basis of - Past preferences of the person
- Other people with similar past preferences
- The preferences of such people for a new
movie/book/ - One approach based on repeated clustering
- Cluster people on the basis of preferences for
movies - Then cluster movies on the basis of being liked
by the same clusters of people - Again cluster people based on their preferences
for (the newly created clusters of) movies - Repeat above till equilibrium
- Above problem is an instance of collaborative
filtering, where users collaborate in the task of
filtering information to find information of
interest
54Other Types of Mining
- Text mining application of data mining to
textual documents - cluster Web pages to find related pages
- cluster pages a user has visited to organize
their visit history - classify Web pages automatically into a Web
directory - Data visualization systems help users examine
large volumes of data and detect patterns
visually - Can visually encode large amounts of information
on a single screen - Humans are very good a detecting visual patterns
55Data Streams
- What are Data Streams?
- Continuous streams
- Huge, Fast, and Changing
- Why Data Streams?
- The arriving speed of streams and the huge amount
of data are beyond our capability to store them. - Real-time processing
- Window Models
- Landscape window (Entire Data Stream)
- Sliding Window
- Damped Window
- Mining Data Stream
56A Simple Problem
- Finding frequent items
- Given a sequence (x1,xN) where xi ?1,m, and a
real number ? between zero and one. - Looking for xi whose frequency gt ?
- Naïve Algorithm (m counters)
- The number of frequent items 1/?
- Problem Ngtgtmgtgt1/?
57KRP algorithm - Karp, et. al (TODS 03)
N30
m12
T0.35
N/ (?1/??) N?
?1/?? 3
58Enhance the Accuracy
m12
N30
NTgt10
T0.35
?1/?? 3
e0.5 T(1- e )0.175
?1/(?e)? 6
59Frequent Items for Transactions with Fixed Length
Each transaction has 2 items T0.60
?1/??2 4