Data Mining: A Database Perspective - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining: A Database Perspective

Description:

Data Mining: A Database Perspective Present By YC Liu outline Introduction Mining Association Rules Multilevel Data Generalization, Summarization, and ... – PowerPoint PPT presentation

Number of Views:1474
Avg rating:3.0/5.0
Slides: 49
Provided by: pyxid
Category:

less

Transcript and Presenter's Notes

Title: Data Mining: A Database Perspective


1
Data MiningA Database Perspective
  • Present By YC Liu

2
Reference
  • Jiawei Han and Micheline Kamber, "Data Mining
    Concepts and Techniques", Chapter 6.
  • M.S. Chen, J. Han, and P.S. Yu., Data Mining An
    Overview from a Database Perspective , IEEE
    Transactions on Knowledge and Data Engineering,
    8(6) 866-883, 1996.
  • J. Liu, Y. Pan, K. Wang, and J. Han, "Mining
    Frequent Item Sets by Opportunistic Projection,"
    In Proc. of 2002 Int. Conf. on Knowledge
    Discovery in Databases (KDD'02), Edmonton,
    Canada, July 2002.

3
outline
  • Introduction
  • Mining Association Rules
  • Multilevel Data Generalization, Summarization,
    and Characterization
  • Data Classification
  • Clustering Analysis
  • (Pattern-Based Similarity Search)
  • (Mining Path Traversal Patterns)
  • (Recommendation)
  • (Web Mining)
  • (Text Mining)

4
Introduction(1/5)
  • Knowledge Discovery in Databases
  • A process of nontrivial extraction of implicit,
    previously unknown and potentially useful
    information.

5
Introduction(2/5)
  • ????
  • ?????????
  • ???????
  • ???????
  • ????
  • Data Mining ?????
  • ?????????
  • ???????
  • ?????????????
  • ??????????????

6
Introduction(3/5)Data Mining A KDD Process
Knowledge
Pattern Evaluation
  • Data mining the core of knowledge discovery
    process.

Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
7
Introduction(4/5) Challenges of Data Mining(1/2)
  • Handling of Different Types of Data
  • Efficiency and Scalability of Data Mining
    Algorithms
  • Usefulness, Certainty, and Expressiveness of Data
    Mining Results
  • Expression of Various Kinds of Data Mining
    Requests and Result

8
Introduction(5/5) Challenges of Data
Mining(2/2)
  • Interactive Mining Knowledge at Multiple
    Abstraction Levels
  • Mining Information from Different Sources of Data
  • Protection of Privacy and Data Security

9
An Overview of Data Mining Techniques
  • Classifying Data Mining Techniques
  • What kinds of databases to work on
  • Relational database, transaction database,
    spatial database, temporal database.....
  • What kinds of knowledge to be mined
  • Association rules, classification, clustering...
  • What kind of techniques to be utilized
  • Generalization-based mining, pattern-based
    mining, mining based on statistics or
    mathematical.

10
Mining Different Kinds of Knowledge from
Databases


































  • Association Rules
  • Data generalization, summarization, and
    characterization
  • Data classification
  • Data clustering
  • Pattern-based similarity search
  • Path traversal patterns
  • Recommendation
  • Web Mining
  • Text Mining

11
Mining Association Rules
  • An association rule is an implication of the form
    XgtY, where X? I, Y? I and X?Y?.
  • The rule XgtY has support s in the transaction
    set D if s of transactions in D contain X?Y.
  • The rule XgtY holds in the transaction set D with
    confidence c if c of transactions in D that
    contain X also contain Y.

12
What Is Association Mining?
  • Association rule mining
  • Finding frequent patterns, associations,
    correlations, or causal structures among sets of
    items or objects in transaction databases,
    relational databases, and other information
    repositories.
  • Applications
  • For cross-marketing and attached mailing
    applications. Other applications include catalog
    design, add-on sales, store layout and customer
    segmentation based on buying patterns.
  • Examples.
  • Rule form Body Head support, confidence.
  • buys(x, diapers) buys(x, beers) 0.5,
    60
  • major(x, CS) takes(x, DB) grade(x, A)
    1, 75

13
Association Rule Basic Concepts
  • Given (1) database of transactions, (2) each
    transaction is a list of items (purchased by a
    customer in a visit)
  • Find all rules that correlate the presence of
    one set of items with that of another set of
    items
  • E.g., 98 of people who purchase tires and auto
    accessories also get automotive services done
  • Applications
  • ? Maintenance Agreement (What the store
    should do to boost Maintenance Agreement sales)
  • Home Electronics ? (What other products
    should the store stocks up?)

14
Rule Measures Support and Confidence
Customer buys both
  • Find all the rules X Y ? Z with minimum
    confidence and support
  • support, s, probability that a transaction
    contains X?Y?Z
  • confidence, c, conditional probability that a
    transaction having X?Y also contains Z

Customer buys diaper
Customer buys beer
  • Let minimum support 50, and minimum confidence
    50, we have
  • A ? C (50, 66.6)
  • C ? A (50, 100)

15
Association Rule Mining A Road Map
  • Boolean vs. quantitative associations (Based on
    the types of values handled)
  • buys(x, SQLServer) buys(x, DMBook)
    buys(x, DBMiner) 0.2, 60
  • age(x, 30..39) income(x, 42..48K)
    buys(x, PC) 1, 75
  • Single dimension vs. multiple dimensional
    associations
  • age(x, 30..39) income(x, 42..48K)
    buys(x, PC) 1, 75
  • Single level vs. multiple-level analysis
  • What brands of beers are associated with what
    brands of diapers?
  • Various extensions
  • Correlation, causality analysis
  • Association does not necessarily imply
    correlation or causality
  • Maxpatterns and closed itemsets
  • Constraints enforced
  • E.g., small sales (sum lt 100) trigger big buys
    (sum gt 1,000)?

16
Mining Association RulesAn Example
Min. support 50 Min. confidence 50
  • For rule A ? C
  • support support(A ?C) 50
  • confidence support(A ?C)/support(A) 66.6
  • The Apriori principle
  • Any subset of a frequent itemset must be frequent

17
Mining Association Rules
  • Steps for mining association rules -
  • Discover all large itemsets
  • Use the large itemsets to generate the
    association rules for the database
  • To Identify The Large Itemset Algorithm
    Apriori

18
Mining generalized and multi-level association
rules
  • Interesting associations among data items often
    occur at a relatively high concept level

19
Interestingness of Discovered Association Rules
  • Example 1 (Aggarwal Yu, PODS98)
  • Among 5000 students
  • 3000 play basketball
  • 3750 eat cereal
  • 2000 both play basket ball and eat cereal
  • play basketball ? eat cereal 40, 66.7 is
    misleading because the overall percentage of
    students eating cereal is 75 which is higher
    than 66.7.
  • play basketball ? not eat cereal 20, 33.3 is
    far more accurate, although with lower support
    and confidence

20
Interestingness of Discovered Association Rules
  • An association rule AgtB is interesting if its
    confidence exceeds a certain measure, or

  • where d is a suitable constant.

21
Improving the Efficiency of Mining Association
Rules
  • Database Scan Reduction
  • FP-tree......
  • Sampling
  • Incremental Updating of Discovered Association
    Rules
  • Parallel Data Mining

22
Classification
  • A process of learning a function that maps a data
    item into one of several predefined classes.
  • Every classification based on inductive-learning
    algorithms is given as input a set of samples
    that consist of vectors of attribute values and a
    corresponding class.
  • predicts categorical class labels
  • classifies data (constructs a model) based on the
    training set and the values (class labels) in a
    classifying attribute and uses it in classifying
    new data

23
Classification Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
24
Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
25
Data Classification
  • Decision-tree-based Classification Method
  • Decision Tree Learning System, ID3
  • Evaluation Functions
  • Information Gain
  • Gini Index

26
Training Dataset
This follows an example from Quinlans ID3
27
Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
28
Performance Improvement
  • Database Indices
  • Attribute-oriented Induction
  • Two-phase Multiattribute Extraction
  • Inference Power
  • Feature Extraction Phase
  • Feature Combination Phase

29
Clustering Analysis
  • ClusteringThe process of grouping physical or
    abstract objects into classes of similar objects.
  • Clustering Analysisto construct meaningful
    partitioning of a large set of objects based on a
    divide and conquer methodology.
  • Method
  • Statistic Analysis (Bayesian Classification
    Method)
  • Probability Analysis

30
Clustering Based on Randomized Search
  • PAM (Partitioning Around Medoids)
  • CLARA (CLustering LARge Application)
  • CLARANS (Clustering Large Applications Based
    Upon RANdomized Search)

31
PAM (Partitioning Around Medoids) (1987)
  • PAM (Kaufman and Rousseeuw, 1987), built in Splus
  • Use real object to represent the cluster
  • Select k representative objects arbitrarily
  • For each pair of non-selected object h and
    selected object i, calculate the total swapping
    cost TCih
  • For each pair of i and h,
  • If TCih lt 0, i is replaced by h
  • Then assign each non-selected object to the most
    similar representative object
  • repeat steps 2-3 until there is no change

32
PAM Clustering Total swapping cost TCih?jCjih
33
CLARA (Clustering Large Applications) (1990)
  • CLARA (Kaufmann and Rousseeuw in 1990)
  • Built in statistical analysis packages, such as
    S
  • It draws multiple samples of the data set,
    applies PAM on each sample, and gives the best
    clustering as the output
  • Strength deals with larger data sets than PAM
  • Weakness
  • Efficiency depends on the sample size
  • A good clustering based on samples will not
    necessarily represent a good clustering of the
    whole data set if the sample is biased

34
Focusing Methods
  • Focusing Methods
  • CLARANS assumes that all the objects to be
    clustered are all stored in main memory
  • The most computationally expensive step of
    CLARANS is calculating the total distances
    between the two clusters
  • Reducing the number of objects considered
  • Only the most central object of a leaf node of
    the R-tree are used to compute the medoids of
    the clusters
  • Restricting the access
  • Focus on Relevant Clusters
  • Focus on a Cluster

35
BIRCH(Balanced Iterative Reducing and Clustering)
  • An incremental one with the possibility of
    adjustment of memory requirements to the size of
    memory that is available
  • Clustering Features
  • Summarize information about the subclusters of
    points instead of storing all points
  • CF Trees
  • Branching factor B and threshold T
  • By changing the threshold value we can change the
    size of the tree
  • Use an arbitrary clustering algorithm to cluster
    the leaf nodes of the CF-tree

36
Clustering Feature Vector
CF (5, (16,30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
37
CF Tree
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
38
Data Generalization, Summarization, and
Characterization
  • Data GeneralizationA process which abstracts a
    large set of relevant data in a database from a
    low concept level to relatively high ones
  • Approaches
  • Data Cube Approach
  • Attribute-oriented Induction Approach

39
Data Cube Approach
  • Multidimensional database, OLAP, ....
  • The general idea of the approach is to
    materialize certain expensive computation that
    are frequently inquired
  • Such as count, sum, average, max, min,...
  • Fast response time and flexible views of data
    from different angles at different abstraction
    levels

40
Attribute-oriented Induction Approach
  • Essential Background KnowledgeConcept Hierarchy
  • Steps
  • Retrieval initial relation
  • Attribute Removal
  • Concept-tree climbing
  • Vote propagation
  • Threshold control
  • Rule transformation

41
Concept Hierarchy and Concept-Tree
  • ????????????????,?????????ANY??ALL???,????????
    ???????????????????Birth place?????????

42
example
  • ??????????(graduated student)?????

43
example
  • ?????????(Concept Hierarchy Table)

44
example
  • ???????Status?Graduate????????????????????Vote??
    ????????,?????????????

45
Example-attribute removal
  • ??????,????????????????

46
Example-Concept-tree Climbing and Vote Propagation
  • ???????????????????????,?????????????????????histo
    ry, physics, math...??science??...
  • ????????,??????tuples,?????tuples?????,??vote?????
    ???tuple??

47
Example-Concept-tree Climbing and Vote Propagation
48
Example-Threshold Control and Rule Transformation
  • ????(Threshold Control )
  • ????
Write a Comment
User Comments (0)
About PowerShow.com