Data Mining Tutorial - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining Tutorial

Description:

Data Mining Tutorial Tomasz Imielinski Rutgers University What is data mining? Finding interesting, useful, unexpected Finding patterns, clusters, associations ... – PowerPoint PPT presentation

Number of Views:187
Avg rating:3.0/5.0
Slides: 93
Provided by: yinpe
Category:

less

Transcript and Presenter's Notes

Title: Data Mining Tutorial


1
Data Mining Tutorial
  • Tomasz Imielinski
  • Rutgers University

2
What is data mining?
  • Finding interesting, useful, unexpected
  • Finding patterns, clusters, associations,
    classifications
  • Answering inductive queries
  • Aggregations and their changes on
    multidimensional cubes

3
Table of Content
  • Association Rules
  • Interesting Rules
  • OLAP
  • Cubegrades unification of association rules and
    OLAP
  • Classification and Clustering methods not
    included in this tutorial

4
Association Rules
  • AIS 1993 Agrawal, Imielinski, Swami Mining
    Association Rules SIGMOD 1993
  • AS 1994 - Agrawal, Srikant Fast algortihms
    for mining association rules in large databases
    VLDB 94
  • B 1998 Bayardo Efficiently Mining Long
    Patterns from databases Sigmod 98
  • SA 1996 Srikant, Agrawal Mining Quantitative
    Association Rules in Large Relational Tables,
    Sigmod 96
  • T 1996 Toivonen Sampling Large Databases for
    Association Rules, VLDB 96
  • BMS 1997 Brin, Motwani, Silverstein Beyond
    Market Baskets Generalizing Association Rules to
    Correlations
  • IV 1999 Imielinski, Virmani MSQL A query
    language for database mining DMKD 1999

5
Baskets
  • I1,Im a set of (binary) attributes called items
  • T is a database of transactions
  • tk 1 if transaction t bought item k
  • Association rule X gt I with support s and
    confidence c
  • Support what fraction of T satisfies X
  • Confidence what fraction of X satisfies I

6
Baskets
  • Minsup. Minconf
  • Frequent sets sets of items X such that their
    support sup(X) gt minsup
  • If X is frequent all its subsets are (closure
    downwards)

7
Examples
  • 20 of transactions which bought cereal and milk
    also bought bread (support 2)
  • Worst case exponential number (in terms of size
    of the set of items) of such rules.
  • What is the set of transactions which leads to
    exponential blow up of the rule set?
  • Fortunately worst cases are unlikely not
    typical. Support provides excellent pruning
    ability.

8
General Strategy
  • Generate frequent sets
  • Get association rules XgtI and their confidence
    and support as ssupport(XI) and confidence c
    supportXI)/support(X)
  • Key property downward closure of the frequent
    sets dont have to consider supersets of X if X
    is not frequent

9
General strategies
  • Make repetitive passes through the database of
    transactions
  • In each pass count support of CANDIDATE frequent
    sets
  • In the next pass continue with frequent sets
    obtained so far by expanding them. Do not
    expand sets which were determined NOT to be
    frequent

10
AIS Algorithm
(R. Agrawal, T. Imielinski, A. Swami, Mining
Association Rules Between Sets of Items in Large
Databases, SIGMOD93)
11
AIS generating association rules
(R. Agrawal, T. Imielinski, A. Swami, Mining
Association Rules Between Sets of Items in Large
Databases, SIGMOD93)
12
AIS estimation part
(R. Agrawal, T. Imielinski, A. Swami, Mining
Association Rules Between Sets of Items in Large
Databases, SIGMOD93)
13
Apriori
(R. Agrawal, R Srikant, Fast Algorithms for
Mining Association Rules, VLDB94)
14
Apriori algorithm
(R. Agrawal, R Srikant, Fast Algorithms for
Mining Association Rules, VLDB94)
15
Pruning in apriori through self-join
(R. Agrawal, R Srikant, Fast Algorithms for
Mining Association Rules, VLDB94)
16
Performance improvement due to Apriori pruning
(R. Agrawal, R Srikant, Fast Algorithms for
Mining Association Rules, VLDB94)
17
Other pruning techniques
  • Key question At any point of time how to
    determine which extensions of a given candidate
    set are worth counting
  • Apriori only these for which all subsets are
    frequent
  • Only these for which the estimated upper bound
    of the count is above minsup
  • Take a risk count a large superset of the given
    candidate set. If it is frequent than all its
    subsets are also large saving. If not, at least
    we have pruned all its supersets.

18
Jump ahead schemes Bayardos MaxMine
(R. Bayardo, Efficiently Mining Long Patterns
from Databases, SIGMOD98)
19
Jump ahead scheme
  • h(g) and t(g) head and tail of an item group.
    Tail is the maximal set of items which g can be
    possibly extended with

20
Max-miner
(R. Bayardo, Efficiently Mining Long Patterns
from Databases, SIGMOD98)
21
Max-miner
(R. Bayardo, Efficiently Mining Long Patterns
from Databases, SIGMOD98)
22
Max-miner
(R. Bayardo, Efficiently Mining Long Patterns
from Databases, SIGMOD98)
23
Max-miner
(R. Bayardo, Efficiently Mining Long Patterns
from Databases, SIGMOD98)
24
Max-miner vs Apriori vs Apriori LB
  • Max-miner is over two orders of magnitude faster
    than apriori in identifying maximal frequent
    patterns on data sets with long max patterns
  • Considers fewer candidate sets
  • Indexes only on head items
  • Dynamic item reordering

25
Quantitative Rules
  • Rules which involve contignous/quantitative
    attributes
  • Standard approach discretize into intervals
  • Problem it is arbitrary, we will miss rules
  • MinSup problem if the number of intervals is
    large their support will be low
  • MinConf problem if intervals are large rules may
    not meet min confidence

26
Correlation Rules BMS 1997
  • Suppose the conditional probability that the
    customer buys coffee given that he buys tea is
    80, is this an important/interesting rule?
  • It dependsif apriori probability of a customer
    buying coffee is 90, than it is not
  • Need 2x2 contingency tables rather than just pure
    association rules. Chi-square test for
    correlation rather than just support/confidence
    framework which can be misleading

27
Correlation Rules
  • Events A and B are independent if p(AB) p(A) x
    p(B)
  • If any of the AB, A(notB), (notA)B, (notA)(notB)
    are dependent than AB are correlated likewise
    for three items if any of the eight combinations
    of A, B and C are dependent then A, B, C are
    correlated
  • Ii1,in is correlation rule iff the
    occurrences of i1,in are correlated
  • Correlation is upward closed if S is correlated
    so is any superset of S

28
Downward vs upward closure
  • Downward closure (frequent sets) is a pruning
    property
  • Upward closure minimal correlated itemsets,
    such that no subsets of them are correlated. Then
    finding correlation is a pruning step prune all
    the parents of a correlated itemset because they
    are not minimal.
  • Border of correlation

29
Pruning based on support-correlation
  • Correlation can be additional pruning criterion
    next to support
  • Unlike support/confidence where confidence is not
    upward closed

30
Chi-square
(S. Brin, R. Motwani, C. Silverstein, Beyond
Market Baskets Generalizing Association Rules to
Correlations, SIGMOD97)
31
Correlation Rules
(S. Brin, R. Motwani, C. Silverstein, Beyond
Market Baskets Generalizing Association Rules to
Correlations, SIGMOD97)
32
(S. Brin, R. Motwani, C. Silverstein, Beyond
Market Baskets Generalizing Association Rules to
Correlations, SIGMOD97)
33
Algorithms for Correlation Rules
  • Border can be large, exponential in terms of the
    size of the item set need better pruning
    functions
  • Support function needs to be defined but also for
    negative dependencies
  • A set of items S has support s at the p level if
    at least p of the cells in the contingency table
    for S have value s
  • Problem (plt50 all items have support at the
    level one)
  • For p gt 25 at least two cells in the contingency
    table will have support s

34
Pruning
  • Antisupport (for rare events)
  • Prune itemsets with very high chi-square to
    eliminate obvious correlations
  • Combine chi-squared correlation rules with
    pruning via support
  • Itemset is significant iff it is supported and
    minimally correlated

35
Algorithm ?2-support
  • INPUT A chi-squared significance level ?,
    support s, support fraction p gt 0.25.
  • Basket data B.
  • OUTPUT A set of minimal correlated itemsets,
    from B.
  • For each item , do count O(i). We can
    use these values to calculate any necessary
  • expected value.
  • Initialize
  • For each pair of items such that
    and , do add
    to
  • .
  • 5. If is empty, then return SIG
    and terminate.
  • For each itemset in , do construct
    the contingency table for the itemset. If less
    than
  • p percent of the cells have count s, then
    goto Step 8.
  • 7. If the value for contingency table is
    at least , then add the itemset to SIG,
  • else add the items to NOTSIG.
  • Continue with the next itemset in .
    If there are no more itemsets in ,
  • then set to be the set of all
    sets S such that every subset of size S - 1 is
    not .
  • Goto Step 4.

(S. Brin, R. Motwani, C. Silverstein, Beyond
Market Baskets Generalizing Association Rules to
Correlations, SIGMOD97)
36
Sampling Large Databases for Correlation Rules
T1996
  • Pick a random sample
  • Find all association rules which hold in that
    sample
  • Verify the results with the rest of the database
  • Missing rules can be found in the second pass

37
Key idea more detail
  • Find a collection of frequent sets in the sample
    using lower support threshold. This collection is
    likely to be a superset of the frequent sets in
    entire database
  • Concept of negative border minimal sets which
    are not in a set collection S

38
Algorithm
(H. Toivonen, Sampling Large Databases for
Association Rules, VLDB96)
39
Second pass
  • Negative border consists of the closest
    itemsets which can be frequent too
  • These have to be tried (measured)

40
(H. Toivonen, Sampling Large Databases for
Association Rules, VLDB96)
41
Probabilty that a sample s has exactly c rows
that contain X
(H. Toivonen, Sampling Large Databases for
Association Rules, VLDB96)
42
Bounding error
(H. Toivonen, Sampling Large Databases for
Association Rules, VLDB96)
43
Approximate mining
(H. Toivonen, Sampling Large Databases for
Association Rules, VLDB96)
44
Approximate mining
(H. Toivonen, Sampling Large Databases for
Association Rules, VLDB96)
45
Summary
  • Discover all frequent sets in one pass in a
    fraction of 1-D of the cases when D is given by
    the user missing sets may be found in second
    pass

46
Rules and whats next?
  • Querying rules
  • Embedding rules in applications (API)

47
MSQL
(T. Imielinski, A. Virmani, MSQL A Query
Language for Database Mining, Data Mining and
Knowledge Discovery 3, 99)
48
MSQL
(T. Imielinski, A. Virmani, MSQL A Query
Language for Database Mining, Data Mining and
Knowledge Discovery 3, 99)
49
Applications with embedded rules (what are rules
good for)
  • Typicality
  • Characteristic of
  • Changing patterns
  • Best N
  • What if
  • Prediction
  • Classification

50
OLAP
  • Multidimensional queries
  • Dimensions
  • Measures
  • Cubes

51
Data CUBE
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
52
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
53
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
54
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
55
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
56
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
57
Data Cube
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
58
Measure Properties
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
59
Measure Properties
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
60
Measure Properties
(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman,
D. Reichart, M. Vankatrao, Data Cube A
Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals, Data Mining
and Knowledge Discovery 1, 1997)
61
Monotonicty
  • Iceberg Queries
  • COUNT, MAX, SUM etc allow pruning
  • AVG does not AVG of a cube extension can be
    larger or smaller than the AVG over the original
    cube thus no pruning in the apriori sense

62
Examples of Monotonic Conditions
  • MAX, MIN
  • TOP-k AVG

63
Cubegrades combining OLAP and association rules
  • Consider rule milk, buttergt bread s100,
    C75.
  • Consider it as a gradient or derivative of a
    cube.
  • Body 2d-cube in multidimensional space
    representing transactions where milk and butter
    are bought together.
  • Consequent Represents the specialization of
    body cube by bread. Bodyconsequent
    represents subcube where milk, butter and bread
    are bought together.
  • Support COUNT of records in body cube.
  • Confidence measures how COUNT is affected when
    we specialize body cube by consequent.

64
A Different Perspective
  • Consider rule milk, buttergt bread s100,
    C75.
  • Consider it as a gradient or derivative of a
    cube.
  • Body 2d-cube in multidimensional space
    representing transactions where milk and butter
    are bought together.
  • Consequent Represents the specialization of
    body cube by bread. Bodyconsequent
    represents subcube where milk, butter and bread
    are bought together.
  • Support COUNT of records in body cube.
  • Confidence measures how COUNT is affected when
    we specialize body cube by consequent.

65
Cubegrades Generalization of Association Rules
  • We can generalize this in two ways.
  • Allow additional operators for cube
    transformation including specializations,
    generalization and mutations.
  • Allow additional measures such as MIN, MAX, SUM,
    etc.
  • ResultCubegrades
  • entities that describe how transforming source
    cube X to target cube Y affects a set of measure
    values.

66
Mathematical Similarity
  • Similar to function gradient measures how
    changes in function argument affects the function
    value.
  • Cubegrade measures how changes in cube affects
    measure (function) values.

67
Using cubegrades Examples
  • Data description Monthly summaries of item sales
    per customer customer demographics.
  • Examples
  • How is the average amount of milk bought affected
    by different age categories among buyers of
    cereals?
  • What factors cause the average amount of milk
    bought to increase by more than 25 among
    suburban buyer?
  • How do buyers in rural cubes compare with buyers
    in suburban cubes in terms of the average amount
    spent on bread milk and cereal?

68
Cubegrade lingo
  • Consider the following cube
  • areaTypeurban, Age25,35 (Avg(salesMilk)25)
  • Descriptor attribute-value pair.
  • K-Conjunct Conjunct of k-descriptors
  • Cube set of objects in a database that satisfy
    the k-conjunct.
  • Dimensions The attributes used in the
    descriptor.
  • Measures Attributes that are aggregated over
    objects.

69
Cubegrade Definition
  • Mathematically, a cubegrade is a 5-tuple ltSource,
    Target, Measures, Values, Delta-Valuegt
  • Source The source or initial cube.
  • Target Target cube obtained by applying factor F
    on source. Target Source Factor.
  • Measures set of measures evaluated.
  • Values function evaluating a measure in source.
  • Delta-Value function evaluating the ratio of
    measure value in target cube versus measure value
    in source cube.

70
Cubegrade Example
Source cube
Target cube
areaTypeurban-gt areaTypeurban,
Age25,35 (Avg(salesMilk), Avg(salesMilk)25,
DeltaAvg(salesMilk)125)
Measure
Value
Delta Value
71
Types of cubegrades
Generalize on C
Mutate C to c2
Specialize by D
Aa1, Bb1, Cc1, Dd1
72
Querying cubegrades.
  • CubeQL (for querying cubes) and CubegradeQL(for
    querying cubegrades).
  • Features.
  • SQL-like, declarative style.
  • Conditions on Source cube and target cube.
  • Conditions on measure values and delta values.
  • Join conditions between source and target.

73
How, which and what
(A. Abdulgani, Ph.D. Thesis, Ruthers University
2000)
74
The Challenge
  • Pruning was what made association rules
    practical.
  • Computation was bottom-up. If a cube doesnt
    satisfy the support threshold, no subcube would
    satisfy the support threshold.
  • COUNT is no longer the sole constraint. New
    additional constraints.

75
Assumptions
  • Dealing with the SQL aggregate measures MIN,MAX,
    SUM, AVG.
  • Each constraint is of the form AGG(X)gt,lt, c,
    where c is a constant.

76
Monotonicity
  • Consider a query Q, a database D and a cube X in
    D.
  • Query Q is monotonic if the condition

Q(X) is FALSE in D, where X?X
Q(X) is FALSE in D
77
View Monotonicity
  • Alternatively, define a cubes view as projection
    of the measure and dimension values holding on
    the cube.
  • A view is not tied to a particular cube or
    database.
  • Q is monotonic for view V, if the condition

For any cube X in any D s.t. V is a view for X,
Q(X) is FALSE
Q(X) is FALSE, where X ? X
78
GBP Sketch
  • Grid Construction for input query
  • Axes defined on dimension/measure attributes used
    in query.
  • Axis intervals based on constants used in query.
  • Cartesian product of intervals define individual
    cells.
  • Query evaluation for each cell.

MAX(X)
F
T
T
150
T
F
T
50
F
F
T
0
25
50
AVG(X)
79
Checking for satisfiability
  • .
  • Cell C defined by
  • mL ? MIN(A) ? mH
  • ML? MAX(A) ? MH
  • AL ? AVG(A) ? AH
  • SL? SUM(A) ? SH
  • CL ? COUNT() ? CH
  • Reduce to the system
  • (N-1)mLML ? S ?(N-1)MHmH
  • SL ? S ? SH
  • ALN ? S ? AHN
  • CL ? N ? CH
  • Solve for N and check the interval returned for
    N.
  • For measures on multiple attributes solve
    independently for distinct attributes. Check for
    a common shared interval for N.

80
View Reachability
MAX(X)
Question Is there a cube X with view V s.t. X
has a subcube which falls in a TRUE cell? Is a
TRUE cell C reachable from V?
F
T
T
150
T
F
T
V
50
F
F
T
0
25
50
AVG(X)
81
Defining View Reachability
  • A cell C defined by
  • mL?MIN(A) ? mH
  • ML ? MAX(A) ? MH
  • AL ? AVG(A) ? AH
  • SL ? SUM(A) ? SH
  • CL ? COUNT() ? CH
  • A view V defined by
  • MIN(A)m
  • MAX(A)M
  • AVG(A)a
  • SUM(A)?
  • COUNT(A)c
  • Cell C is reachable from view V if there is a
    set X of X1, X2, .. XN, .. XC real elements
    which satisfies the view constraints and a subset
    X of X1, X2, .. XN which satisfies the cell
    constraints.

82
Checking for View Reachability
  • View Reachability on measures of a single
    attribute can be reduced to at most 4 systems
    with constant number of linear constraints on N.
  • For measures on multiple distinct attributes,
    obtain set of intervals on every attribute
    separately. V is reachable from C if there is a
    shared interval obtained on N containing an
    integral point.

83
Example
  • Consider view of 19 records XX1, , X19 with
  • MIN(X)0, MAX(X)75, SUM(X)1000.
  • Let C be defined by
  • CL, CH1, 19, mL, mH0,10, ML,
    MH0,50, AL, AH46.5, 50.
  • C is reachable from V either with N12 or with
    N15.

84
Complexity Analysis
  • Let Q be a query in disjunctive normal form
    consisting of m conjuncts in J dimensions and K
    distinct measure attributes.
  • The monotonicity of Q for a given view can be
    tested in O(m(JKlogK)) time.

85
Computing cubegrades
  • Algorithm Cubegrade Gen Basic
  • Evaluate Qsource
  • For each S in Qsource
  • Evaluate QS
  • For each T in QS
  • Form the cubegrade ltS, T, Measure, Values, Delta
    Valuesgt where Delta Values have to be calculated
    as ratios of the Measure evaluated on the target
    and on the source cubes respectively.

86
Cube and Cubegrade query classes
  • Cube Query classification
  • Queries with strong monotonicity.
  • Queries with weak monotonicity.
  • Hopeless queries.
  • Cubegrade query classification, based on source
    cube query classification and target cube
    classification
  • Focused.
  • Weakly focused.
  • Hopeless.

87
Cubegrade Application Development
  • Cubegrades are not end products. Rather, an
    investment to drive a set of applications.
  • Definition of an application framework for
    cubegrades. Features include
  • Extension of Dmajor datamining platform.
  • Generation, storage and retrieval of cubegrades.
  • Accessing internal components of cubegrades for
    browsing, comparisons and modifications.
  • Traversals through a set of cubegrades.
  • Primitives for correlating cubegrades with
    underlying data and vice versa.

88
Application Example Effective Factors
  • Find factors which are effective in changing a
    measure value m for a collection of cubes by a
    significant ratio.
  • Factor F is effective for C iff for all
    GltC,CF,m,V,Deltagt where C? C it holds that
    Delta(m)gt(1x) or Delta(m)lt(1-x).

89
Cubegrades and OLAP
90
Future work
  • Extending GBP to cover additional constraint
    types.
  • Monotonicity threshold of a query.
  • Domain Specific Application Gene Expression
    Mining.

91
Summary
  • Cubegrade concept as a generalization of
    association rules and cubes.
  • Concept of querying of cubes and cubegrades.
  • Description of a GBP method for efficient pruning
    of queries with constraints of type Agg(a) ?,?
    c, where Agg() can be MIN(), MAX(), SUM(), AVG().
  • Experimentally through a cubegrade engine
    prototype shown the viability of GBP and the
    cubegrade generation process.
  • Classification of a hierarchy of query classes
    based on theoretical pruning characteristics.
  • Presentation of a framework for developing
    cubegrade applications.

92
Conclusions
  • OLAP and Association rules really one approach
  • Key problem - the set of rules, cubegrades can
    be orders of magnitude larger than the source
    data set
  • Hence, the key issue is how do we present/use the
    obtained rules in applications which provide real
    value for the user
  • Discovery as querying
Write a Comment
User Comments (0)
About PowerShow.com