Title: Data Mining: A Database Perspective
1Data MiningA Database Perspective
2Reference
- Jiawei Han and Micheline Kamber, "Data Mining
Concepts and Techniques", Chapter 6. - M.S. Chen, J. Han, and P.S. Yu., Data Mining An
Overview from a Database Perspective , IEEE
Transactions on Knowledge and Data Engineering,
8(6) 866-883, 1996. - J. Liu, Y. Pan, K. Wang, and J. Han, "Mining
Frequent Item Sets by Opportunistic Projection,"
In Proc. of 2002 Int. Conf. on Knowledge
Discovery in Databases (KDD'02), Edmonton,
Canada, July 2002.
3outline
- Introduction
- Mining Association Rules
- Multilevel Data Generalization, Summarization,
and Characterization - Data Classification
- Clustering Analysis
- (Pattern-Based Similarity Search)
- (Mining Path Traversal Patterns)
- (Recommendation)
- (Web Mining)
- (Text Mining)
4Introduction(1/5)
- Knowledge Discovery in Databases
- A process of nontrivial extraction of implicit,
previously unknown and potentially useful
information.
5Introduction(2/5)
- ????
- ?????????
- ???????
- ???????
- ????
- Data Mining ?????
- ?????????
- ???????
- ?????????????
- ??????????????
6Introduction(3/5)Data Mining A KDD Process
Knowledge
Pattern Evaluation
- Data mining the core of knowledge discovery
process.
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
7Introduction(4/5) Challenges of Data Mining(1/2)
- Handling of Different Types of Data
- Efficiency and Scalability of Data Mining
Algorithms - Usefulness, Certainty, and Expressiveness of Data
Mining Results - Expression of Various Kinds of Data Mining
Requests and Result
8Introduction(5/5) Challenges of Data
Mining(2/2)
- Interactive Mining Knowledge at Multiple
Abstraction Levels - Mining Information from Different Sources of Data
- Protection of Privacy and Data Security
9An Overview of Data Mining Techniques
- Classifying Data Mining Techniques
- What kinds of databases to work on
- Relational database, transaction database,
spatial database, temporal database..... - What kinds of knowledge to be mined
- Association rules, classification, clustering...
- What kind of techniques to be utilized
- Generalization-based mining, pattern-based
mining, mining based on statistics or
mathematical.
10Mining Different Kinds of Knowledge from
Databases
- Association Rules
- Data generalization, summarization, and
characterization - Data classification
- Data clustering
- Pattern-based similarity search
- Path traversal patterns
- Recommendation
- Web Mining
- Text Mining
11Mining Association Rules
- An association rule is an implication of the form
XgtY, where X? I, Y? I and X?Y?. - The rule XgtY has support s in the transaction
set D if s of transactions in D contain X?Y. - The rule XgtY holds in the transaction set D with
confidence c if c of transactions in D that
contain X also contain Y.
12What Is Association Mining?
- Association rule mining
- Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories. - Applications
- For cross-marketing and attached mailing
applications. Other applications include catalog
design, add-on sales, store layout and customer
segmentation based on buying patterns. - Examples.
- Rule form Body Head support, confidence.
- buys(x, diapers) buys(x, beers) 0.5,
60 - major(x, CS) takes(x, DB) grade(x, A)
1, 75
13Association Rule Basic Concepts
- Given (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit) - Find all rules that correlate the presence of
one set of items with that of another set of
items - E.g., 98 of people who purchase tires and auto
accessories also get automotive services done - Applications
- ? Maintenance Agreement (What the store
should do to boost Maintenance Agreement sales) - Home Electronics ? (What other products
should the store stocks up?)
14Rule Measures Support and Confidence
Customer buys both
- Find all the rules X Y ? Z with minimum
confidence and support - support, s, probability that a transaction
contains X?Y?Z - confidence, c, conditional probability that a
transaction having X?Y also contains Z
Customer buys diaper
Customer buys beer
- Let minimum support 50, and minimum confidence
50, we have - A ? C (50, 66.6)
- C ? A (50, 100)
15Association Rule Mining A Road Map
- Boolean vs. quantitative associations (Based on
the types of values handled) - buys(x, SQLServer) buys(x, DMBook)
buys(x, DBMiner) 0.2, 60 - age(x, 30..39) income(x, 42..48K)
buys(x, PC) 1, 75 - Single dimension vs. multiple dimensional
associations - age(x, 30..39) income(x, 42..48K)
buys(x, PC) 1, 75 - Single level vs. multiple-level analysis
- What brands of beers are associated with what
brands of diapers? - Various extensions
- Correlation, causality analysis
- Association does not necessarily imply
correlation or causality - Maxpatterns and closed itemsets
- Constraints enforced
- E.g., small sales (sum lt 100) trigger big buys
(sum gt 1,000)?
16Mining Association RulesAn Example
Min. support 50 Min. confidence 50
- For rule A ? C
- support support(A ?C) 50
- confidence support(A ?C)/support(A) 66.6
- The Apriori principle
- Any subset of a frequent itemset must be frequent
17Mining Association Rules
- Steps for mining association rules -
- Discover all large itemsets
- Use the large itemsets to generate the
association rules for the database - To Identify The Large Itemset Algorithm
Apriori
18Mining generalized and multi-level association
rules
- Interesting associations among data items often
occur at a relatively high concept level
19Interestingness of Discovered Association Rules
- Example 1 (Aggarwal Yu, PODS98)
- Among 5000 students
- 3000 play basketball
- 3750 eat cereal
- 2000 both play basket ball and eat cereal
- play basketball ? eat cereal 40, 66.7 is
misleading because the overall percentage of
students eating cereal is 75 which is higher
than 66.7. - play basketball ? not eat cereal 20, 33.3 is
far more accurate, although with lower support
and confidence
20Interestingness of Discovered Association Rules
- An association rule AgtB is interesting if its
confidence exceeds a certain measure, or -
- where d is a suitable constant.
21Improving the Efficiency of Mining Association
Rules
- Database Scan Reduction
- FP-tree......
- Sampling
- Incremental Updating of Discovered Association
Rules - Parallel Data Mining
22Classification
- A process of learning a function that maps a data
item into one of several predefined classes. - Every classification based on inductive-learning
algorithms is given as input a set of samples
that consist of vectors of attribute values and a
corresponding class. - predicts categorical class labels
- classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying
new data
23Classification Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
24Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
25Data Classification
- Decision-tree-based Classification Method
- Decision Tree Learning System, ID3
- Evaluation Functions
- Information Gain
- Gini Index
26Training Dataset
This follows an example from Quinlans ID3
27Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
28Performance Improvement
- Database Indices
- Attribute-oriented Induction
- Two-phase Multiattribute Extraction
- Inference Power
- Feature Extraction Phase
- Feature Combination Phase
29Clustering Analysis
- ClusteringThe process of grouping physical or
abstract objects into classes of similar objects. - Clustering Analysisto construct meaningful
partitioning of a large set of objects based on a
divide and conquer methodology. - Method
- Statistic Analysis (Bayesian Classification
Method) - Probability Analysis
30Clustering Based on Randomized Search
- PAM (Partitioning Around Medoids)
- CLARA (CLustering LARge Application)
- CLARANS (Clustering Large Applications Based
Upon RANdomized Search)
31PAM (Partitioning Around Medoids) (1987)
- PAM (Kaufman and Rousseeuw, 1987), built in Splus
- Use real object to represent the cluster
- Select k representative objects arbitrarily
- For each pair of non-selected object h and
selected object i, calculate the total swapping
cost TCih - For each pair of i and h,
- If TCih lt 0, i is replaced by h
- Then assign each non-selected object to the most
similar representative object - repeat steps 2-3 until there is no change
32PAM Clustering Total swapping cost TCih?jCjih
33CLARA (Clustering Large Applications) (1990)
- CLARA (Kaufmann and Rousseeuw in 1990)
- Built in statistical analysis packages, such as
S - It draws multiple samples of the data set,
applies PAM on each sample, and gives the best
clustering as the output - Strength deals with larger data sets than PAM
- Weakness
- Efficiency depends on the sample size
- A good clustering based on samples will not
necessarily represent a good clustering of the
whole data set if the sample is biased
34Focusing Methods
- Focusing Methods
- CLARANS assumes that all the objects to be
clustered are all stored in main memory - The most computationally expensive step of
CLARANS is calculating the total distances
between the two clusters - Reducing the number of objects considered
- Only the most central object of a leaf node of
the R-tree are used to compute the medoids of
the clusters - Restricting the access
- Focus on Relevant Clusters
- Focus on a Cluster
35BIRCH(Balanced Iterative Reducing and Clustering)
- An incremental one with the possibility of
adjustment of memory requirements to the size of
memory that is available - Clustering Features
- Summarize information about the subclusters of
points instead of storing all points - CF Trees
- Branching factor B and threshold T
- By changing the threshold value we can change the
size of the tree - Use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree
36Clustering Feature Vector
CF (5, (16,30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
37CF Tree
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
38Data Generalization, Summarization, and
Characterization
- Data GeneralizationA process which abstracts a
large set of relevant data in a database from a
low concept level to relatively high ones - Approaches
- Data Cube Approach
- Attribute-oriented Induction Approach
39Data Cube Approach
- Multidimensional database, OLAP, ....
- The general idea of the approach is to
materialize certain expensive computation that
are frequently inquired - Such as count, sum, average, max, min,...
- Fast response time and flexible views of data
from different angles at different abstraction
levels
40Attribute-oriented Induction Approach
- Essential Background KnowledgeConcept Hierarchy
- Steps
- Retrieval initial relation
- Attribute Removal
- Concept-tree climbing
- Vote propagation
- Threshold control
- Rule transformation
41Concept Hierarchy and Concept-Tree
- ????????????????,?????????ANY??ALL???,????????
???????????????????Birth place?????????
42example
- ??????????(graduated student)?????
43example
- ?????????(Concept Hierarchy Table)
44example
- ???????Status?Graduate????????????????????Vote??
????????,?????????????
45Example-attribute removal
46Example-Concept-tree Climbing and Vote Propagation
- ???????????????????????,?????????????????????histo
ry, physics, math...??science??...
- ????????,??????tuples,?????tuples?????,??vote?????
???tuple??
47Example-Concept-tree Climbing and Vote Propagation
48Example-Threshold Control and Rule Transformation