Data Mining: A Database Perspective

About This Presentation

Title:

Data Mining: A Database Perspective

Description:

Data Mining: A Database Perspective Present By YC Liu outline Introduction Mining Association Rules Multilevel Data Generalization, Summarization, and ... – PowerPoint PPT presentation

Number of Views:1478

Avg rating:3.0/5.0

Slides: 49

Provided by: pyxid

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining: A Database Perspective

1
Data MiningA Database Perspective

Present By YC Liu

2
Reference

Jiawei Han and Micheline Kamber, "Data Mining
Concepts and Techniques", Chapter 6.
M.S. Chen, J. Han, and P.S. Yu., Data Mining An
Overview from a Database Perspective , IEEE
Transactions on Knowledge and Data Engineering,
8(6) 866-883, 1996.
J. Liu, Y. Pan, K. Wang, and J. Han, "Mining
Frequent Item Sets by Opportunistic Projection,"
In Proc. of 2002 Int. Conf. on Knowledge
Discovery in Databases (KDD'02), Edmonton,
Canada, July 2002.

3
outline

Introduction
Mining Association Rules
Multilevel Data Generalization, Summarization,
and Characterization
Data Classification
Clustering Analysis
(Pattern-Based Similarity Search)
(Mining Path Traversal Patterns)
(Recommendation)
(Web Mining)
(Text Mining)

4
Introduction(1/5)

Knowledge Discovery in Databases
A process of nontrivial extraction of implicit,
previously unknown and potentially useful
information.

5
Introduction(2/5)

????
?????????
???????
???????
????
Data Mining ?????
?????????
???????
?????????????
??????????????

6
Introduction(3/5)Data Mining A KDD Process
Knowledge
Pattern Evaluation

Data mining the core of knowledge discovery
process.

Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
7
Introduction(4/5) Challenges of Data Mining(1/2)

Handling of Different Types of Data
Efficiency and Scalability of Data Mining
Algorithms
Usefulness, Certainty, and Expressiveness of Data
Mining Results
Expression of Various Kinds of Data Mining
Requests and Result

8
Introduction(5/5) Challenges of Data
Mining(2/2)

Interactive Mining Knowledge at Multiple
Abstraction Levels
Mining Information from Different Sources of Data
Protection of Privacy and Data Security

9
An Overview of Data Mining Techniques

Classifying Data Mining Techniques
What kinds of databases to work on
Relational database, transaction database,
spatial database, temporal database.....
What kinds of knowledge to be mined
Association rules, classification, clustering...
What kind of techniques to be utilized
Generalization-based mining, pattern-based
mining, mining based on statistics or
mathematical.

10
Mining Different Kinds of Knowledge from
Databases

Association Rules
Data generalization, summarization, and
characterization
Data classification
Data clustering
Pattern-based similarity search
Path traversal patterns
Recommendation
Web Mining
Text Mining

11
Mining Association Rules

An association rule is an implication of the form
XgtY, where X? I, Y? I and X?Y?.
The rule XgtY has support s in the transaction
set D if s of transactions in D contain X?Y.
The rule XgtY holds in the transaction set D with
confidence c if c of transactions in D that
contain X also contain Y.

12
What Is Association Mining?

Association rule mining
Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories.
Applications
For cross-marketing and attached mailing
applications. Other applications include catalog
design, add-on sales, store layout and customer
segmentation based on buying patterns.
Examples.
Rule form Body Head support, confidence.
buys(x, diapers) buys(x, beers) 0.5,
60
major(x, CS) takes(x, DB) grade(x, A)
1, 75

13
Association Rule Basic Concepts

Given (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
Find all rules that correlate the presence of
one set of items with that of another set of
items
E.g., 98 of people who purchase tires and auto
accessories also get automotive services done
Applications
? Maintenance Agreement (What the store
should do to boost Maintenance Agreement sales)
Home Electronics ? (What other products
should the store stocks up?)

14
Rule Measures Support and Confidence
Customer buys both

Find all the rules X Y ? Z with minimum
confidence and support
support, s, probability that a transaction
contains X?Y?Z
confidence, c, conditional probability that a
transaction having X?Y also contains Z

Customer buys diaper
Customer buys beer

Let minimum support 50, and minimum confidence
50, we have
A ? C (50, 66.6)
C ? A (50, 100)

15
Association Rule Mining A Road Map

Boolean vs. quantitative associations (Based on
the types of values handled)
buys(x, SQLServer) buys(x, DMBook)
buys(x, DBMiner) 0.2, 60
age(x, 30..39) income(x, 42..48K)
buys(x, PC) 1, 75
Single dimension vs. multiple dimensional
associations
age(x, 30..39) income(x, 42..48K)
buys(x, PC) 1, 75
Single level vs. multiple-level analysis
What brands of beers are associated with what
brands of diapers?
Various extensions
Correlation, causality analysis
Association does not necessarily imply
correlation or causality
Maxpatterns and closed itemsets
Constraints enforced
E.g., small sales (sum lt 100) trigger big buys
(sum gt 1,000)?

16
Mining Association RulesAn Example
Min. support 50 Min. confidence 50

For rule A ? C
support support(A ?C) 50
confidence support(A ?C)/support(A) 66.6
The Apriori principle
Any subset of a frequent itemset must be frequent

17
Mining Association Rules

Steps for mining association rules -
Discover all large itemsets
Use the large itemsets to generate the
association rules for the database
To Identify The Large Itemset Algorithm
Apriori

18
Mining generalized and multi-level association
rules

Interesting associations among data items often
occur at a relatively high concept level

19
Interestingness of Discovered Association Rules

Example 1 (Aggarwal Yu, PODS98)
Among 5000 students
3000 play basketball
3750 eat cereal
2000 both play basket ball and eat cereal
play basketball ? eat cereal 40, 66.7 is
misleading because the overall percentage of
students eating cereal is 75 which is higher
than 66.7.
play basketball ? not eat cereal 20, 33.3 is
far more accurate, although with lower support
and confidence

20
Interestingness of Discovered Association Rules

An association rule AgtB is interesting if its
confidence exceeds a certain measure, or
where d is a suitable constant.

21
Improving the Efficiency of Mining Association
Rules

Database Scan Reduction
FP-tree......
Sampling
Incremental Updating of Discovered Association
Rules
Parallel Data Mining

22
Classification

A process of learning a function that maps a data
item into one of several predefined classes.
Every classification based on inductive-learning
algorithms is given as input a set of samples
that consist of vectors of attribute values and a
corresponding class.
predicts categorical class labels
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying
new data

23
Classification Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
24
Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
25
Data Classification

Decision-tree-based Classification Method
Decision Tree Learning System, ID3
Evaluation Functions
Information Gain
Gini Index

26
Training Dataset
This follows an example from Quinlans ID3
27
Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
28
Performance Improvement

Database Indices
Attribute-oriented Induction
Two-phase Multiattribute Extraction
Inference Power
Feature Extraction Phase
Feature Combination Phase

29
Clustering Analysis

ClusteringThe process of grouping physical or
abstract objects into classes of similar objects.
Clustering Analysisto construct meaningful
partitioning of a large set of objects based on a
divide and conquer methodology.
Method
Statistic Analysis (Bayesian Classification
Method)
Probability Analysis

30
Clustering Based on Randomized Search

PAM (Partitioning Around Medoids)
CLARA (CLustering LARge Application)
CLARANS (Clustering Large Applications Based
Upon RANdomized Search)

31
PAM (Partitioning Around Medoids) (1987)

PAM (Kaufman and Rousseeuw, 1987), built in Splus
Use real object to represent the cluster
Select k representative objects arbitrarily
For each pair of non-selected object h and
selected object i, calculate the total swapping
cost TCih
For each pair of i and h,
If TCih lt 0, i is replaced by h
Then assign each non-selected object to the most
similar representative object
repeat steps 2-3 until there is no change

32
PAM Clustering Total swapping cost TCih?jCjih
33
CLARA (Clustering Large Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as
S
It draws multiple samples of the data set,
applies PAM on each sample, and gives the best
clustering as the output
Strength deals with larger data sets than PAM
Weakness
Efficiency depends on the sample size
A good clustering based on samples will not
necessarily represent a good clustering of the
whole data set if the sample is biased

34
Focusing Methods

Focusing Methods
CLARANS assumes that all the objects to be
clustered are all stored in main memory
The most computationally expensive step of
CLARANS is calculating the total distances
between the two clusters
Reducing the number of objects considered
Only the most central object of a leaf node of
the R-tree are used to compute the medoids of
the clusters
Restricting the access
Focus on Relevant Clusters
Focus on a Cluster

35
BIRCH(Balanced Iterative Reducing and Clustering)

An incremental one with the possibility of
adjustment of memory requirements to the size of
memory that is available
Clustering Features
Summarize information about the subclusters of
points instead of storing all points
CF Trees
Branching factor B and threshold T
By changing the threshold value we can change the
size of the tree
Use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree

36
Clustering Feature Vector
CF (5, (16,30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
37
CF Tree
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
38
Data Generalization, Summarization, and
Characterization

Data GeneralizationA process which abstracts a
large set of relevant data in a database from a
low concept level to relatively high ones
Approaches
Data Cube Approach
Attribute-oriented Induction Approach

39
Data Cube Approach

Multidimensional database, OLAP, ....
The general idea of the approach is to
materialize certain expensive computation that
are frequently inquired
Such as count, sum, average, max, min,...
Fast response time and flexible views of data
from different angles at different abstraction
levels

40
Attribute-oriented Induction Approach

Essential Background KnowledgeConcept Hierarchy
Steps
Retrieval initial relation
Attribute Removal
Concept-tree climbing
Vote propagation
Threshold control
Rule transformation

41
Concept Hierarchy and Concept-Tree

????????????????,?????????ANY??ALL???,????????
???????????????????Birth place?????????

42
example

??????????(graduated student)?????

43
example

?????????(Concept Hierarchy Table)

44
example

???????Status?Graduate????????????????????Vote??
????????,?????????????

45
Example-attribute removal

??????,????????????????

46
Example-Concept-tree Climbing and Vote Propagation

???????????????????????,?????????????????????histo
ry, physics, math...??science??...

????????,??????tuples,?????tuples?????,??vote?????
???tuple??

47
Example-Concept-tree Climbing and Vote Propagation
48
Example-Threshold Control and Rule Transformation

Data Mining: A Database Perspective - PowerPoint PPT Presentation

Data Mining: A Database Perspective

Data Mining: A Database Perspective Present By YC Liu outline Introduction Mining Association Rules Multilevel Data Generalization, Summarization, and ... – PowerPoint PPT presentation