Lecture 2: Data Mining

About This Presentation

Title:

Lecture 2: Data Mining

Description:

Title: Chapter 22: Advanced Querying and Information Retrieval Author: S. Sudarshan Last modified by: KSU Created Date: 3/22/2000 4:02:45 PM Document presentation format – PowerPoint PPT presentation

Number of Views:135

Avg rating:3.0/5.0

Slides: 60

Provided by: S521

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 2: Data Mining

1
Lecture 2 Data Mining
2
Roadmap

What is data mining?
Data Mining Tasks
Classification/Decision Tree
Clustering
Association Mining
Data Mining Algorithms
Decision Tree Construction
Frequent 2-itemsets
Frequent Itemsets (Apriori)
Clustering/Collaborative Filtering

3
What is Data Mining?

Discovery of useful, possibly unexpected,
patterns in data.
Subsidiary issues
Data cleansing detection of bogus data.
E.g., age 150.
Visualization something better than megabyte
files of output.
Warehousing of data (for retrieval).

4
Typical Kinds of Patterns

Decision trees succinct ways to classify by
testing properties.
Clusters another succinct classification by
similarity of properties.
Bayes, hidden-Markov, and other statistical
models, frequent-itemsets expose important
associations within data.

5
Example Clusters
x xx x x x x x x x x x x x
x
x x x x x x x x x x x x
x x x
x x x x x x x x x x
x
6
Applications (Among Many)

Intelligence-gathering.
Total Information Awareness.
Web Analysis.
PageRank.
Marketing.
Run a sale on diapers raise the price of beer.
Detective?

7
Cultures

Databases concentrate on large-scale
(non-main-memory) data.
AI (machine-learning) concentrate on complex
methods, small data.
Statistics concentrate on inferring models.

8
Models vs. Analytic Processing

To a database person, data-mining is a powerful
form of analytic processing --- queries that
examine large amounts of data.
Result is the data that answers the query.
To a statistician, data-mining is the inference
of models.
Result is the parameters of the model.

9
Meaningfulness of Answers

A big risk when data mining is that you will
discover patterns that are meaningless.
Statisticians call it Bonferronis principle
(roughly) if you look in more places for
interesting patterns than your amount of data
will support, you are bound to find crap.

10
Examples

A big objection to TIA was that it was looking
for so many vague connections that it was sure to
find things that were bogus and thus violate
innocents privacy.
The Rhine Paradox a great example of how not to
conduct scientific research.

11
Rhine Paradox --- (1)

David Rhine was a parapsychologist in the 1950s
who hypothesized that some people had
Extra-Sensory Perception.
He devised an experiment where subjects were
asked to guess 10 hidden cards --- red or blue.
He discovered that almost 1 in 1000 had ESP ---
they were able to get all 10 right!

12
Rhine Paradox --- (2)

He told these people they had ESP and called them
in for another test of the same type.
Alas, he discovered that almost all of them had
lost their ESP.
What did he conclude?
Answer on next slide.

13
Rhine Paradox --- (3)

He concluded that you shouldnt tell people they
have ESP it causes them to lose it.

14
Data Mining Tasks

Data mining is the process of semi-automatically
analyzing large databases to find useful patterns
Prediction based on past history
Predict if a credit card applicant poses a good
credit risk, based on some attributes (income,
job type, age, ..) and past history
Predict if a pattern of phone calling card usage
is likely to be fraudulent
Some examples of prediction mechanisms
Classification
Given a new item whose class is unknown, predict
to which class it belongs
Regression formulae
Given a set of mappings for an unknown function,
predict the function result for a new parameter
value

15
Data Mining (Cont.)

Descriptive Patterns
Associations
Find books that are often bought by similar
customers. If a new such customer buys one such
book, suggest the others too.
Associations may be used as a first step in
detecting causation
E.g. association between exposure to chemical X
and cancer,
Clusters
E.g. typhoid cases were clustered in an area
surrounding a contaminated well
Detection of clusters remains important in
detecting epidemics

16
Decision Trees

Example
Conducted survey to see what customers were
interested in new model car
Want to select customers for advertising campaign

training set
17
One Possibility
agelt30
Y
N
citysf
carvan
Y
Y
N
N
likely
unlikely
likely
unlikely
18
Another Possibility
cartaurus
Y
N
citysf
agelt45
Y
Y
N
N
likely
unlikely
likely
unlikely
19
Issues

Decision tree cannot be too deep
would not have statistically significant amounts
of data for lower decisions
Need to select tree that most reliably predicts
outcomes

20
Clustering
income
education
age
21
Another Example Text

Each document is a vector
e.g., lt100110...gt contains words 1,4,5,...
Clusters contain similar documents
Useful for understanding, searching documents

sports
international news
business
22
Issues

Given desired number of clusters?
Finding best clusters
Are clusters semantically meaningful?

23
Association Rule Mining
transaction id
customer id
products bought
sales records
market-basket data

Trend Products p5, p8 often bough together
Trend Customer 12 likes product p9

24
Association Rule

Rule p1, p3, p8
Support number of baskets where these products
appear
High-support set support ? threshold s
Problem find all high support sets

25
Finding High-Support Pairs

Baskets(basket, item)
SELECT I.item, J.item, COUNT(I.basket)FROM
Baskets I, Baskets JWHERE I.basket J.basket
AND I.item lt J.itemGROUP BY
I.item, J.itemHAVING COUNT(I.basket) gt s

26
Example
27
Issues

Performance for size 2 rules

even bigger!
big

Performance for size k rules

28
Roadmap

What is data mining?
Data Mining Tasks
Classification/Decision Tree
Clustering
Association Mining
Data Mining Algorithms
Decision Tree Construction
Frequent 2-itemsets
Frequent Itemsets (Apriori)
Clustering/Collaborative Filtering

29
Classification Rules

Classification rules help assign new objects to
classes.
E.g., given a new automobile insurance applicant,
should he or she be classified as low risk,
medium risk or high risk?
Classification rules for above example could use
a variety of data, such as educational level,
salary, age, etc.
? person P, P.degree masters and P.income gt
75,000
? P.credit excellent
? person P, P.degree bachelors and
(P.income ? 25,000 and P.income ?
75,000)
? P.credit good
Rules are not necessarily exact there may be
some misclassifications
Classification rules can be shown compactly as a
decision tree.

30
Decision Tree
31
Decision Tree Construction
Employed
Root
No
Yes
ClassNot Default
Node
Balance
gt50K
lt50K
ClassYes Default
Age
Leaf
gt45
lt45
ClassNot Default
ClassYes Default
32
Construction of Decision Trees

Training set a data sample in which the
classification is already known.
Greedy top down generation of decision trees.
Each internal node of the tree partitions the
data into groups based on a partitioning
attribute, and a partitioning condition for the
node
Leaf node
all (or most) of the items at the node belong to
the same class, or
all attributes have been considered, and no
further partitioning is possible.

33
Finding the Best Split Point for Numerical
Attributes
The data comes from a IBM Quest synthetic dataset
for function 0
Best Split Point
In-core algorithms, such as C4.5, will just
online sort the numerical attributes!
34
Best Splits

Pick best attributes and conditions on which to
partition
The purity of a set S of training instances can
be measured quantitatively in several ways.
Notation number of classes k, number of
instances S, fraction of instances in class
i pi.
The Gini measure of purity is defined as
Gini (S) 1 - ?
When all instances are in a single class, the
Gini value is 0
It reaches its maximum (of 1 1 /k) if each class
the same number of instances.

35
Best Splits (Cont.)

Another measure of purity is the entropy measure,
which is defined as
entropy (S) ?
When a set S is split into multiple sets Si, I1,
2, , r, we can measure the purity of the
resultant set of sets as
purity(S1, S2, .., Sr) ?
The information gain due to particular split of S
into Si, i 1, 2, ., r
Information-gain (S, S1, S2, ., Sr)
purity(S ) purity (S1, S2, Sr)

36
Finding Best Splits

Categorical attributes (with no meaningful
order)
Multi-way split, one child for each value
Binary split try all possible breakup of values
into two sets, and pick the best
Continuous-valued attributes (can be sorted in a
meaningful order)
Binary split
Sort values, try each as a split point
E.g. if values are 1, 10, 15, 25, split at ?1,
? 10, ? 15
Pick the value that gives best split
Multi-way split
A series of binary splits on the same attribute
has roughly equivalent effect

37
Decision-Tree Construction Algorithm

Procedure GrowTree (S ) Partition (S
)Procedure Partition (S) if ( purity (S ) gt
?p or S lt ?s ) then return for each
attribute A evaluate splits on attribute
A Use best split found (across all attributes)
to partition S into S1, S2, ., Sr, for
i 1, 2, .., r Partition (Si )

38
Finding Association Rules

We are generally only interested in association
rules with reasonably high support (e.g. support
of 2 or greater)
Naïve algorithm
Consider all possible sets of relevant items.
For each set find its support (i.e. count how
many transactions purchase all items in the
set).
Large itemsets sets with sufficiently high
support

39
Example Association Rules

How do we perform rule mining efficiently?
Observation If set X has support t, then each X
subset must have at least support t
For 2-sets
if we need support s for i, j
then each i, j must appear in at least s baskets

40
Algorithm for 2-Sets

(1) Find OK products
those appearing in s or more baskets
(2) Find high-support pairs using only OK
products

41
Algorithm for 2-Sets

INSERT INTO okBaskets(basket, item) SELECT
basket, item FROM Baskets GROUP BY item
HAVING COUNT(basket) gt s
Perform mining on okBaskets SELECT I.item,
J.item, COUNT(I.basket) FROM okBaskets I,
okBaskets J WHERE I.basket J.basket AND
I.item lt J.item GROUP BY
I.item, J.item HAVING COUNT(I.basket) gt s

42
Counting Efficiently

One way

threshold 3
43
Counting Efficiently

Another way

threshold 3
44
Yet Another Way
threshold 3
false positive
45
Discussion

Hashing scheme 2 (or 3) scans of data
Sorting scheme requires a sort!
Hashing works well if few high-support pairs and
many low-support ones

iceberg queries
46
Frequent Itemsets Mining
TID Transactions
100 A, B, E
200 B, D
300 A, B, E
400 A, C
500 B, C
600 A, C
700 A, B
800 A, B, C, E
900 A, B, C
1000 A, C, E

Desired frequency 50 (support level)
A,B,C,A,B, A,C
Down-closure (apriori) property
If an itemset is frequent, all of its subset must
also be frequent

47
Lattice for Enumerating Frequent Itemsets
48
Apriori

L0 ?
C1 1-item subsets of all the
transactions
For ( k1 Ck ? 0 k )
support counting
for all transactions t ? D
for all k-subsets s of t
if k ? Ck
s.count
candidates generation
Lk c ? Ck c.countgt min
sup
Ck1 apriori_gen( Lk )
Answer UkLk

49
Clustering

Clustering Intuitively, finding clusters of
points in the given data such that similar points
lie in the same cluster
Can be formalized using distance metrics in
several ways
Group points into k sets (for a given k) such
that the average distance of points from the
centroid of their assigned group is minimized
Centroid point defined by taking average of
coordinates in each dimension.
Another metric minimize average distance between
every pair of points in a cluster
Has been studied extensively in statistics, but
on small data sets
Data mining systems aim at clustering techniques
that can handle very large data sets
E.g. the Birch clustering algorithm!

50
K-Means Clustering
51
K-means Clustering

Partitional clustering approach
Each cluster is associated with a centroid
(center point)
Each point is assigned to the cluster with the
closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple

52
Hierarchical Clustering

Example from biological classification
(the word classification here does not mean a
prediction mechanism)
chordata
mammalia
reptilialeopards humans snakes
crocodiles
Other examples Internet directory systems (e.g.
Yahoo!)
Agglomerative clustering algorithms
Build small clusters, then cluster small clusters
into bigger clusters, and so on
Divisive clustering algorithms
Start with all items in a single cluster,
repeatedly refine (break) clusters into smaller
ones

53
Collaborative Filtering

Goal predict what movies/books/ a person may be
interested in, on the basis of
Past preferences of the person
Other people with similar past preferences
The preferences of such people for a new
movie/book/
One approach based on repeated clustering
Cluster people on the basis of preferences for
movies
Then cluster movies on the basis of being liked
by the same clusters of people
Again cluster people based on their preferences
for (the newly created clusters of) movies
Repeat above till equilibrium
Above problem is an instance of collaborative
filtering, where users collaborate in the task of
filtering information to find information of
interest

54
Other Types of Mining

Text mining application of data mining to
textual documents
cluster Web pages to find related pages
cluster pages a user has visited to organize
their visit history
classify Web pages automatically into a Web
directory
Data visualization systems help users examine
large volumes of data and detect patterns
visually
Can visually encode large amounts of information
on a single screen
Humans are very good a detecting visual patterns

55
Data Streams

What are Data Streams?
Continuous streams
Huge, Fast, and Changing
Why Data Streams?
The arriving speed of streams and the huge amount
of data are beyond our capability to store them.
Real-time processing
Window Models
Landscape window (Entire Data Stream)
Sliding Window
Damped Window
Mining Data Stream

56
A Simple Problem

Finding frequent items
Given a sequence (x1,xN) where xi ?1,m, and a
real number ? between zero and one.
Looking for xi whose frequency gt ?
Naïve Algorithm (m counters)
The number of frequent items 1/?
Problem Ngtgtmgtgt1/?

57
KRP algorithm - Karp, et. al (TODS 03)
N30
m12
T0.35
N/ (?1/??) N?
?1/?? 3
58
Enhance the Accuracy
m12
N30
NTgt10
T0.35
?1/?? 3
e0.5 T(1- e )0.175
?1/(?e)? 6
59
Frequent Items for Transactions with Fixed Length
Each transaction has 2 items T0.60
?1/??2 4

Write a Comment

User Comments (0)