Lecture 2: Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 2: Data Mining

Description:

Title: Chapter 22: Advanced Querying and Information Retrieval Author: S. Sudarshan Last modified by: KSU Created Date: 3/22/2000 4:02:45 PM Document presentation format – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 60
Provided by: S521
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Lecture 2: Data Mining


1
Lecture 2 Data Mining
2
Roadmap
  • What is data mining?
  • Data Mining Tasks
  • Classification/Decision Tree
  • Clustering
  • Association Mining
  • Data Mining Algorithms
  • Decision Tree Construction
  • Frequent 2-itemsets
  • Frequent Itemsets (Apriori)
  • Clustering/Collaborative Filtering

3
What is Data Mining?
  • Discovery of useful, possibly unexpected,
    patterns in data.
  • Subsidiary issues
  • Data cleansing detection of bogus data.
  • E.g., age 150.
  • Visualization something better than megabyte
    files of output.
  • Warehousing of data (for retrieval).

4
Typical Kinds of Patterns
  1. Decision trees succinct ways to classify by
    testing properties.
  2. Clusters another succinct classification by
    similarity of properties.
  3. Bayes, hidden-Markov, and other statistical
    models, frequent-itemsets expose important
    associations within data.

5
Example Clusters
x xx x x x x x x x x x x x
x
x x x x x x x x x x x x
x x x
x x x x x x x x x x
x
6
Applications (Among Many)
  • Intelligence-gathering.
  • Total Information Awareness.
  • Web Analysis.
  • PageRank.
  • Marketing.
  • Run a sale on diapers raise the price of beer.
  • Detective?

7
Cultures
  • Databases concentrate on large-scale
    (non-main-memory) data.
  • AI (machine-learning) concentrate on complex
    methods, small data.
  • Statistics concentrate on inferring models.

8
Models vs. Analytic Processing
  • To a database person, data-mining is a powerful
    form of analytic processing --- queries that
    examine large amounts of data.
  • Result is the data that answers the query.
  • To a statistician, data-mining is the inference
    of models.
  • Result is the parameters of the model.

9
Meaningfulness of Answers
  • A big risk when data mining is that you will
    discover patterns that are meaningless.
  • Statisticians call it Bonferronis principle
    (roughly) if you look in more places for
    interesting patterns than your amount of data
    will support, you are bound to find crap.

10
Examples
  • A big objection to TIA was that it was looking
    for so many vague connections that it was sure to
    find things that were bogus and thus violate
    innocents privacy.
  • The Rhine Paradox a great example of how not to
    conduct scientific research.

11
Rhine Paradox --- (1)
  • David Rhine was a parapsychologist in the 1950s
    who hypothesized that some people had
    Extra-Sensory Perception.
  • He devised an experiment where subjects were
    asked to guess 10 hidden cards --- red or blue.
  • He discovered that almost 1 in 1000 had ESP ---
    they were able to get all 10 right!

12
Rhine Paradox --- (2)
  • He told these people they had ESP and called them
    in for another test of the same type.
  • Alas, he discovered that almost all of them had
    lost their ESP.
  • What did he conclude?
  • Answer on next slide.

13
Rhine Paradox --- (3)
  • He concluded that you shouldnt tell people they
    have ESP it causes them to lose it.

14
Data Mining Tasks
  • Data mining is the process of semi-automatically
    analyzing large databases to find useful patterns
  • Prediction based on past history
  • Predict if a credit card applicant poses a good
    credit risk, based on some attributes (income,
    job type, age, ..) and past history
  • Predict if a pattern of phone calling card usage
    is likely to be fraudulent
  • Some examples of prediction mechanisms
  • Classification
  • Given a new item whose class is unknown, predict
    to which class it belongs
  • Regression formulae
  • Given a set of mappings for an unknown function,
    predict the function result for a new parameter
    value

15
Data Mining (Cont.)
  • Descriptive Patterns
  • Associations
  • Find books that are often bought by similar
    customers. If a new such customer buys one such
    book, suggest the others too.
  • Associations may be used as a first step in
    detecting causation
  • E.g. association between exposure to chemical X
    and cancer,
  • Clusters
  • E.g. typhoid cases were clustered in an area
    surrounding a contaminated well
  • Detection of clusters remains important in
    detecting epidemics

16
Decision Trees
  • Example
  • Conducted survey to see what customers were
    interested in new model car
  • Want to select customers for advertising campaign

training set
17
One Possibility
agelt30
Y
N
citysf
carvan
Y
Y
N
N
likely
unlikely
likely
unlikely
18
Another Possibility
cartaurus
Y
N
citysf
agelt45
Y
Y
N
N
likely
unlikely
likely
unlikely
19
Issues
  • Decision tree cannot be too deep
  • would not have statistically significant amounts
    of data for lower decisions
  • Need to select tree that most reliably predicts
    outcomes

20
Clustering
income
education
age
21
Another Example Text
  • Each document is a vector
  • e.g., lt100110...gt contains words 1,4,5,...
  • Clusters contain similar documents
  • Useful for understanding, searching documents

sports
international news
business
22
Issues
  • Given desired number of clusters?
  • Finding best clusters
  • Are clusters semantically meaningful?

23
Association Rule Mining
transaction id
customer id
products bought
sales records
market-basket data
  • Trend Products p5, p8 often bough together
  • Trend Customer 12 likes product p9

24
Association Rule
  • Rule p1, p3, p8
  • Support number of baskets where these products
    appear
  • High-support set support ? threshold s
  • Problem find all high support sets

25
Finding High-Support Pairs
  • Baskets(basket, item)
  • SELECT I.item, J.item, COUNT(I.basket)FROM
    Baskets I, Baskets JWHERE I.basket J.basket
    AND I.item lt J.itemGROUP BY
    I.item, J.itemHAVING COUNT(I.basket) gt s

26
Example
27
Issues
  • Performance for size 2 rules

even bigger!
big
  • Performance for size k rules

28
Roadmap
  • What is data mining?
  • Data Mining Tasks
  • Classification/Decision Tree
  • Clustering
  • Association Mining
  • Data Mining Algorithms
  • Decision Tree Construction
  • Frequent 2-itemsets
  • Frequent Itemsets (Apriori)
  • Clustering/Collaborative Filtering

29
Classification Rules
  • Classification rules help assign new objects to
    classes.
  • E.g., given a new automobile insurance applicant,
    should he or she be classified as low risk,
    medium risk or high risk?
  • Classification rules for above example could use
    a variety of data, such as educational level,
    salary, age, etc.
  • ? person P, P.degree masters and P.income gt
    75,000

  • ? P.credit excellent
  • ? person P, P.degree bachelors and
    (P.income ? 25,000 and P.income ?
    75,000)
    ? P.credit good
  • Rules are not necessarily exact there may be
    some misclassifications
  • Classification rules can be shown compactly as a
    decision tree.

30
Decision Tree
31
Decision Tree Construction
Employed
Root
No
Yes
ClassNot Default
Node
Balance
gt50K
lt50K
ClassYes Default
Age
Leaf
gt45
lt45
ClassNot Default
ClassYes Default
32
Construction of Decision Trees
  • Training set a data sample in which the
    classification is already known.
  • Greedy top down generation of decision trees.
  • Each internal node of the tree partitions the
    data into groups based on a partitioning
    attribute, and a partitioning condition for the
    node
  • Leaf node
  • all (or most) of the items at the node belong to
    the same class, or
  • all attributes have been considered, and no
    further partitioning is possible.

33
Finding the Best Split Point for Numerical
Attributes
The data comes from a IBM Quest synthetic dataset
for function 0
Best Split Point
In-core algorithms, such as C4.5, will just
online sort the numerical attributes!
34
Best Splits
  • Pick best attributes and conditions on which to
    partition
  • The purity of a set S of training instances can
    be measured quantitatively in several ways.
  • Notation number of classes k, number of
    instances S, fraction of instances in class
    i pi.
  • The Gini measure of purity is defined as
  • Gini (S) 1 - ?
  • When all instances are in a single class, the
    Gini value is 0
  • It reaches its maximum (of 1 1 /k) if each class
    the same number of instances.

35
Best Splits (Cont.)
  • Another measure of purity is the entropy measure,
    which is defined as
  • entropy (S) ?
  • When a set S is split into multiple sets Si, I1,
    2, , r, we can measure the purity of the
    resultant set of sets as
  • purity(S1, S2, .., Sr) ?
  • The information gain due to particular split of S
    into Si, i 1, 2, ., r
  • Information-gain (S, S1, S2, ., Sr)
    purity(S ) purity (S1, S2, Sr)

36
Finding Best Splits
  • Categorical attributes (with no meaningful
    order)
  • Multi-way split, one child for each value
  • Binary split try all possible breakup of values
    into two sets, and pick the best
  • Continuous-valued attributes (can be sorted in a
    meaningful order)
  • Binary split
  • Sort values, try each as a split point
  • E.g. if values are 1, 10, 15, 25, split at ?1,
    ? 10, ? 15
  • Pick the value that gives best split
  • Multi-way split
  • A series of binary splits on the same attribute
    has roughly equivalent effect

37
Decision-Tree Construction Algorithm
  • Procedure GrowTree (S ) Partition (S
    )Procedure Partition (S) if ( purity (S ) gt
    ?p or S lt ?s ) then return for each
    attribute A evaluate splits on attribute
    A Use best split found (across all attributes)
    to partition S into S1, S2, ., Sr, for
    i 1, 2, .., r Partition (Si )

38
Finding Association Rules
  • We are generally only interested in association
    rules with reasonably high support (e.g. support
    of 2 or greater)
  • Naïve algorithm
  • Consider all possible sets of relevant items.
  • For each set find its support (i.e. count how
    many transactions purchase all items in the
    set).
  • Large itemsets sets with sufficiently high
    support

39
Example Association Rules
  • How do we perform rule mining efficiently?
  • Observation If set X has support t, then each X
    subset must have at least support t
  • For 2-sets
  • if we need support s for i, j
  • then each i, j must appear in at least s baskets

40
Algorithm for 2-Sets
  • (1) Find OK products
  • those appearing in s or more baskets
  • (2) Find high-support pairs using only OK
    products

41
Algorithm for 2-Sets
  • INSERT INTO okBaskets(basket, item) SELECT
    basket, item FROM Baskets GROUP BY item
    HAVING COUNT(basket) gt s
  • Perform mining on okBaskets SELECT I.item,
    J.item, COUNT(I.basket) FROM okBaskets I,
    okBaskets J WHERE I.basket J.basket AND
    I.item lt J.item GROUP BY
    I.item, J.item HAVING COUNT(I.basket) gt s

42
Counting Efficiently
  • One way

threshold 3
43
Counting Efficiently
  • Another way

threshold 3
44
Yet Another Way
threshold 3
false positive
45
Discussion
  • Hashing scheme 2 (or 3) scans of data
  • Sorting scheme requires a sort!
  • Hashing works well if few high-support pairs and
    many low-support ones

iceberg queries
46
Frequent Itemsets Mining
TID Transactions
100 A, B, E
200 B, D
300 A, B, E
400 A, C
500 B, C
600 A, C
700 A, B
800 A, B, C, E
900 A, B, C
1000 A, C, E
  • Desired frequency 50 (support level)
  • A,B,C,A,B, A,C
  • Down-closure (apriori) property
  • If an itemset is frequent, all of its subset must
    also be frequent

47
Lattice for Enumerating Frequent Itemsets
48
Apriori
  • L0 ?
  • C1 1-item subsets of all the
    transactions
  • For ( k1 Ck ? 0 k )
  • support counting
  • for all transactions t ? D
  • for all k-subsets s of t
  • if k ? Ck
  • s.count
  • candidates generation
  • Lk c ? Ck c.countgt min
    sup
  • Ck1 apriori_gen( Lk )
  • Answer UkLk

49
Clustering
  • Clustering Intuitively, finding clusters of
    points in the given data such that similar points
    lie in the same cluster
  • Can be formalized using distance metrics in
    several ways
  • Group points into k sets (for a given k) such
    that the average distance of points from the
    centroid of their assigned group is minimized
  • Centroid point defined by taking average of
    coordinates in each dimension.
  • Another metric minimize average distance between
    every pair of points in a cluster
  • Has been studied extensively in statistics, but
    on small data sets
  • Data mining systems aim at clustering techniques
    that can handle very large data sets
  • E.g. the Birch clustering algorithm!

50
K-Means Clustering
51
K-means Clustering
  • Partitional clustering approach
  • Each cluster is associated with a centroid
    (center point)
  • Each point is assigned to the cluster with the
    closest centroid
  • Number of clusters, K, must be specified
  • The basic algorithm is very simple

52
Hierarchical Clustering
  • Example from biological classification
  • (the word classification here does not mean a
    prediction mechanism)
  • chordata
    mammalia
    reptilialeopards humans snakes
    crocodiles
  • Other examples Internet directory systems (e.g.
    Yahoo!)
  • Agglomerative clustering algorithms
  • Build small clusters, then cluster small clusters
    into bigger clusters, and so on
  • Divisive clustering algorithms
  • Start with all items in a single cluster,
    repeatedly refine (break) clusters into smaller
    ones

53
Collaborative Filtering
  • Goal predict what movies/books/ a person may be
    interested in, on the basis of
  • Past preferences of the person
  • Other people with similar past preferences
  • The preferences of such people for a new
    movie/book/
  • One approach based on repeated clustering
  • Cluster people on the basis of preferences for
    movies
  • Then cluster movies on the basis of being liked
    by the same clusters of people
  • Again cluster people based on their preferences
    for (the newly created clusters of) movies
  • Repeat above till equilibrium
  • Above problem is an instance of collaborative
    filtering, where users collaborate in the task of
    filtering information to find information of
    interest

54
Other Types of Mining
  • Text mining application of data mining to
    textual documents
  • cluster Web pages to find related pages
  • cluster pages a user has visited to organize
    their visit history
  • classify Web pages automatically into a Web
    directory
  • Data visualization systems help users examine
    large volumes of data and detect patterns
    visually
  • Can visually encode large amounts of information
    on a single screen
  • Humans are very good a detecting visual patterns

55
Data Streams
  • What are Data Streams?
  • Continuous streams
  • Huge, Fast, and Changing
  • Why Data Streams?
  • The arriving speed of streams and the huge amount
    of data are beyond our capability to store them.
  • Real-time processing
  • Window Models
  • Landscape window (Entire Data Stream)
  • Sliding Window
  • Damped Window
  • Mining Data Stream

56
A Simple Problem
  • Finding frequent items
  • Given a sequence (x1,xN) where xi ?1,m, and a
    real number ? between zero and one.
  • Looking for xi whose frequency gt ?
  • Naïve Algorithm (m counters)
  • The number of frequent items 1/?
  • Problem Ngtgtmgtgt1/?

57
KRP algorithm - Karp, et. al (TODS 03)
N30
m12
T0.35
N/ (?1/??) N?
?1/?? 3
58
Enhance the Accuracy
m12
N30
NTgt10
T0.35
?1/?? 3
e0.5 T(1- e )0.175
?1/(?e)? 6
59
Frequent Items for Transactions with Fixed Length
Each transaction has 2 items T0.60
?1/??2 4
Write a Comment
User Comments (0)
About PowerShow.com