Data mining : Concepts, Techniques and Applications - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Data mining : Concepts, Techniques and Applications

Description:

Data Mining: Concepts and Techniques. 1. Data mining : Concepts, ... If q = 1, d is Manhattan distance. Similarity and Dissimilarity Between Objects (Cont. ... – PowerPoint PPT presentation

Number of Views:231

Avg rating:3.0/5.0

Slides: 29

Provided by: jiaw217

Category:

more less

Transcript and Presenter's Notes

Title: Data mining : Concepts, Techniques and Applications

1
Data mining Concepts, Techniques and
Applications

Motivation Why data mining?
What is data mining?
Data Mining On what kind of data?
Data mining functionality
Are all the patterns interesting?

2
Motivation Necessity is the Mother of
Invention

Data explosion problem
Automated data collection tools and mature
database technology lead to tremendous amounts of
data stored in databases, data warehouses and
other information repositories
Solution Data warehousing and data mining
Data warehousing and on-line analytical
processing
Extraction of interesting knowledge (rules,
regularities, patterns, constraints) from data
in large databases

3
Why Data Mining? Potential Applications

Database analysis and decision support
Market analysis and management
Risk analysis and management
Prediction
Web analysis
Intelligent query answering

4
Data Mining Functionalities (1)

Association
Multi-dimensional vs. single-dimensional
association
age(X, 20..29) income(X, 20..29K) à buys(X,
PC) support 2, confidence 60
contains(T, computer) à contains(x, software)
1, 75

5
Data Mining Functionalities (2)

Classification and Prediction
Finding models (functions) that describe and
distinguish classes or concepts for future
prediction
Presentation decision-tree, classification rule,
neural network
Prediction Predict some unknown or missing
numerical values
Cluster analysis
Class label is unknown Group data to form new
classes
Clustering based on the principle maximizing the
intra-class similarity and minimizing the
interclass similarity

6
Are All the Discovered Patterns Interesting?

A data mining system/query may generate thousands
of patterns, not all of them are interesting.
Interestingness measures A pattern is
interesting if it is easily understood by humans,
potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
support, confidence

7
Summary

Data mining discovering interesting patterns
from large amounts of data
A natural evolution of database technology, in
great demand, with wide applications
A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation
Mining can be performed in a variety of
information repositories
Data mining functionalities association,
classification, clustering, etc.

8
What is Cluster Analysis?

Cluster a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to
the characteristics found in the data and
grouping similar data objects into clusters
Unsupervised learning no predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms

9
Quality What Is Good Clustering?

A good clustering method will produce high
quality clusters with
high intra-class similarity
low inter-class similarity

10
Measure the Quality of Clustering

Dissimilarity/Similarity metric Similarity is
expressed in terms of a distance function,
typically metric d(i, j)
There is a separate quality function that
measures the goodness of a cluster.
The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal ratio, and vector variables.
Weights should be associated with different
variables based on applications and data
semantics.
It is hard to define similar enough or good
enough
the answer is typically highly subjective.

11
Type of data in clustering analysis

Interval-scaled variables
Binary variables
Categorical, ordinal, and ratio variables
Variables of mixed types

12
Interval-valued variables

Standardize data
Calculate the mean absolute deviation
where
Calculate the standardized measurement (z-score)
Using mean absolute deviation is more robust than
using standard deviation

13
Similarity and Dissimilarity Between Objects

Distances are normally used to measure the
similarity or dissimilarity between two data
objects
Some popular ones include Minkowski distance
where i (xi1, xi2, , xip) and j (xj1, xj2,
, xjp) are two p-dimensional data objects, and q
is a positive integer
If q 1, d is Manhattan distance

14
Similarity and Dissimilarity Between Objects
(Cont.)

If q 2, d is Euclidean distance
Properties
d(i,j) ? 0
d(i,i) 0
d(i,j) d(j,i)
d(i,j) ? d(i,k) d(k,j)
Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures

15
Binary Variables

A contingency table for binary data
Distance measure for symmetric binary variables
Distance measure for asymmetric binary variables
Jaccard coefficient (similarity measure for
asymmetric binary variables)

16
Variables of Mixed Types

A database may contain all the six types of
variables
symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
One may use a weighted formula to combine their
effects
f is binary or nominal
dij(f) 0 if xif xjf , or dij(f) 1
otherwise
f is interval-based use the normalized distance
f is ordinal or ratio-scaled
compute ranks rif and
and treat zif as interval-scaled

17
Vector Objects

Vector objects keywords in documents, gene
features in micro-arrays, etc.
Broad applications information retrieval,
biologic taxonomy, etc.
Cosine measure
A variant Tanimoto coefficient

18
Major Clustering Approaches (I)

Partitioning approach
Construct various partitions and then evaluate
them by some criterion, e.g., minimizing the sum
of square errors
Typical methods k-means, k-medoids, CLARANS
Hierarchical approach
Create a hierarchical decomposition of the set of
data (or objects) using some criterion
Typical methods Diana, Agnes, BIRCH, ROCK,
CAMELEON
Density-based approach
Based on connectivity and density functions
Typical methods DBSACN, OPTICS, DenClue

19
Major Clustering Approaches (II)

Grid-based approach
based on a multiple-level granularity structure
Typical methods STING, WaveCluster, CLIQUE
Model-based
A model is hypothesized for each of the clusters
and tries to find the best fit of that model to
each other
Typical methods EM, SOM, COBWEB
Frequent pattern-based
Based on the analysis of frequent patterns
Typical methods pCluster
User-guided or constraint-based
Clustering by considering user-specified or
application-specific constraints
Typical methods COD (obstacles), constrained
clustering

20
Typical Alternatives to Calculate the Distance
between Clusters

Single link smallest distance between an
element in one cluster and an element in the
other, i.e., dis(Ki, Kj) min(tip, tjq)
Complete link largest distance between an
element in one cluster and an element in the
other, i.e., dis(Ki, Kj) max(tip, tjq)
Average avg distance between an element in one
cluster and an element in the other, i.e.,
dis(Ki, Kj) avg(tip, tjq)
Centroid distance between the centroids of two
clusters, i.e., dis(Ki, Kj) dis(Ci, Cj)

21
Centroid, Radius and Diameter of a Cluster (for
numerical data sets)

Centroid the middle of a cluster
Radius square root of average distance from any
point of the cluster to its centroid
Diameter square root of average mean squared
distance between all pairs of points in the
cluster

22
Hierarchical Clustering

Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition

23
Dendrogram Shows How the Clusters are Merged
Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram. A clustering of the data objects is
obtained by cutting the dendrogram at the desired
level, then each connected component forms a
cluster.
24
Recent Hierarchical Clustering Methods

Major weakness of agglomerative clustering
methods
do not scale well time complexity of at least
O(n2), where n is the number of total objects
can never undo what was done previously
Integration of hierarchical with distance-based
clustering
BIRCH (1996) uses CF-tree and incrementally
adjusts the quality of sub-clusters
ROCK (1999) clustering categorical data by
neighbor and link analysis
CHAMELEON (1999) hierarchical clustering using
dynamic modeling

25
BIRCH (1996)

Birch Balanced Iterative Reducing and Clustering
using Hierarchies (Zhang, Ramakrishnan Livny,
SIGMOD96)
Incrementally construct a CF (Clustering Feature)
tree, a hierarchical data structure for
multiphase clustering
Phase 1 scan DB to build an initial in-memory CF
tree (a multi-level compression of the data that
tries to preserve the inherent clustering
structure of the data)
Phase 2 use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree
Scales linearly finds a good clustering with a
single scan and improves the quality with a few
additional scans
Weakness handles only numeric data, and
sensitive to the order of the data record.

26
Clustering Feature Vector in BIRCH
CF (5, (16,30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
27
CF-Tree in BIRCH

Clustering feature
summary of the statistics for a given subcluster
the 0-th, 1st and 2nd moments of the subcluster
from the statistical point of view.
registers crucial measurements for computing
cluster and utilizes storage efficiently
A CF tree is a height-balanced tree that stores
the clustering features for a hierarchical
clustering
A nonleaf node in a tree has descendants or
children
The nonleaf nodes store sums of the CFs of their
children
A CF tree has two parameters
Branching factor specify the maximum number of
children.
threshold max diameter of sub-clusters stored at
the leaf nodes

28
The CF Tree Structure
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next

Write a Comment

User Comments (0)