Title: Data mining : Concepts, Techniques and Applications
1Data mining Concepts, Techniques and
Applications
- Motivation Why data mining?
- What is data mining?
- Data Mining On what kind of data?
- Data mining functionality
- Are all the patterns interesting?
2Motivation Necessity is the Mother of
Invention
- Data explosion problem
- Automated data collection tools and mature
database technology lead to tremendous amounts of
data stored in databases, data warehouses and
other information repositories - Solution Data warehousing and data mining
- Data warehousing and on-line analytical
processing - Extraction of interesting knowledge (rules,
regularities, patterns, constraints) from data
in large databases
3Why Data Mining? Potential Applications
- Database analysis and decision support
- Market analysis and management
- Risk analysis and management
- Prediction
- Web analysis
- Intelligent query answering
4Data Mining Functionalities (1)
- Association
- Multi-dimensional vs. single-dimensional
association - age(X, 20..29) income(X, 20..29K) Ã buys(X,
PC) support 2, confidence 60 - contains(T, computer) Ã contains(x, software)
1, 75
5Data Mining Functionalities (2)
- Classification and Prediction
- Finding models (functions) that describe and
distinguish classes or concepts for future
prediction - Presentation decision-tree, classification rule,
neural network - Prediction Predict some unknown or missing
numerical values - Cluster analysis
- Class label is unknown Group data to form new
classes - Clustering based on the principle maximizing the
intra-class similarity and minimizing the
interclass similarity
6Are All the Discovered Patterns Interesting?
- A data mining system/query may generate thousands
of patterns, not all of them are interesting. - Interestingness measures A pattern is
interesting if it is easily understood by humans,
potentially useful, novel, or validates some
hypothesis that a user seeks to confirm - support, confidence
7Summary
- Data mining discovering interesting patterns
from large amounts of data - A natural evolution of database technology, in
great demand, with wide applications - A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation - Mining can be performed in a variety of
information repositories - Data mining functionalities association,
classification, clustering, etc.
8What is Cluster Analysis?
- Cluster a collection of data objects
- Similar to one another within the same cluster
- Dissimilar to the objects in other clusters
- Cluster analysis
- Finding similarities between data according to
the characteristics found in the data and
grouping similar data objects into clusters - Unsupervised learning no predefined classes
- Typical applications
- As a stand-alone tool to get insight into data
distribution - As a preprocessing step for other algorithms
9Quality What Is Good Clustering?
- A good clustering method will produce high
quality clusters with - high intra-class similarity
- low inter-class similarity
10Measure the Quality of Clustering
- Dissimilarity/Similarity metric Similarity is
expressed in terms of a distance function,
typically metric d(i, j) - There is a separate quality function that
measures the goodness of a cluster. - The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal ratio, and vector variables. - Weights should be associated with different
variables based on applications and data
semantics. - It is hard to define similar enough or good
enough - the answer is typically highly subjective.
11Type of data in clustering analysis
- Interval-scaled variables
- Binary variables
- Categorical, ordinal, and ratio variables
- Variables of mixed types
12Interval-valued variables
- Standardize data
- Calculate the mean absolute deviation
- where
- Calculate the standardized measurement (z-score)
- Using mean absolute deviation is more robust than
using standard deviation
13Similarity and Dissimilarity Between Objects
- Distances are normally used to measure the
similarity or dissimilarity between two data
objects - Some popular ones include Minkowski distance
- where i (xi1, xi2, , xip) and j (xj1, xj2,
, xjp) are two p-dimensional data objects, and q
is a positive integer - If q 1, d is Manhattan distance
14Similarity and Dissimilarity Between Objects
(Cont.)
- If q 2, d is Euclidean distance
- Properties
- d(i,j) ? 0
- d(i,i) 0
- d(i,j) d(j,i)
- d(i,j) ? d(i,k) d(k,j)
- Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
15Binary Variables
- A contingency table for binary data
- Distance measure for symmetric binary variables
- Distance measure for asymmetric binary variables
- Jaccard coefficient (similarity measure for
asymmetric binary variables)
16Variables of Mixed Types
- A database may contain all the six types of
variables - symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio - One may use a weighted formula to combine their
effects - f is binary or nominal
- dij(f) 0 if xif xjf , or dij(f) 1
otherwise - f is interval-based use the normalized distance
- f is ordinal or ratio-scaled
- compute ranks rif and
- and treat zif as interval-scaled
17Vector Objects
- Vector objects keywords in documents, gene
features in micro-arrays, etc. - Broad applications information retrieval,
biologic taxonomy, etc. - Cosine measure
- A variant Tanimoto coefficient
18Major Clustering Approaches (I)
- Partitioning approach
- Construct various partitions and then evaluate
them by some criterion, e.g., minimizing the sum
of square errors - Typical methods k-means, k-medoids, CLARANS
- Hierarchical approach
- Create a hierarchical decomposition of the set of
data (or objects) using some criterion - Typical methods Diana, Agnes, BIRCH, ROCK,
CAMELEON - Density-based approach
- Based on connectivity and density functions
- Typical methods DBSACN, OPTICS, DenClue
19Major Clustering Approaches (II)
- Grid-based approach
- based on a multiple-level granularity structure
- Typical methods STING, WaveCluster, CLIQUE
- Model-based
- A model is hypothesized for each of the clusters
and tries to find the best fit of that model to
each other - Typical methods EM, SOM, COBWEB
- Frequent pattern-based
- Based on the analysis of frequent patterns
- Typical methods pCluster
- User-guided or constraint-based
- Clustering by considering user-specified or
application-specific constraints - Typical methods COD (obstacles), constrained
clustering
20Typical Alternatives to Calculate the Distance
between Clusters
- Single link smallest distance between an
element in one cluster and an element in the
other, i.e., dis(Ki, Kj) min(tip, tjq) - Complete link largest distance between an
element in one cluster and an element in the
other, i.e., dis(Ki, Kj) max(tip, tjq) - Average avg distance between an element in one
cluster and an element in the other, i.e.,
dis(Ki, Kj) avg(tip, tjq) - Centroid distance between the centroids of two
clusters, i.e., dis(Ki, Kj) dis(Ci, Cj)
21Centroid, Radius and Diameter of a Cluster (for
numerical data sets)
- Centroid the middle of a cluster
- Radius square root of average distance from any
point of the cluster to its centroid - Diameter square root of average mean squared
distance between all pairs of points in the
cluster
22Hierarchical Clustering
- Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition
23Dendrogram Shows How the Clusters are Merged
Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram. A clustering of the data objects is
obtained by cutting the dendrogram at the desired
level, then each connected component forms a
cluster.
24Recent Hierarchical Clustering Methods
- Major weakness of agglomerative clustering
methods - do not scale well time complexity of at least
O(n2), where n is the number of total objects - can never undo what was done previously
- Integration of hierarchical with distance-based
clustering - BIRCH (1996) uses CF-tree and incrementally
adjusts the quality of sub-clusters - ROCK (1999) clustering categorical data by
neighbor and link analysis - CHAMELEON (1999) hierarchical clustering using
dynamic modeling
25BIRCH (1996)
- Birch Balanced Iterative Reducing and Clustering
using Hierarchies (Zhang, Ramakrishnan Livny,
SIGMOD96) - Incrementally construct a CF (Clustering Feature)
tree, a hierarchical data structure for
multiphase clustering - Phase 1 scan DB to build an initial in-memory CF
tree (a multi-level compression of the data that
tries to preserve the inherent clustering
structure of the data) - Phase 2 use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree - Scales linearly finds a good clustering with a
single scan and improves the quality with a few
additional scans - Weakness handles only numeric data, and
sensitive to the order of the data record.
26Clustering Feature Vector in BIRCH
CF (5, (16,30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
27CF-Tree in BIRCH
- Clustering feature
- summary of the statistics for a given subcluster
the 0-th, 1st and 2nd moments of the subcluster
from the statistical point of view. - registers crucial measurements for computing
cluster and utilizes storage efficiently - A CF tree is a height-balanced tree that stores
the clustering features for a hierarchical
clustering - A nonleaf node in a tree has descendants or
children - The nonleaf nodes store sums of the CFs of their
children - A CF tree has two parameters
- Branching factor specify the maximum number of
children. - threshold max diameter of sub-clusters stored at
the leaf nodes
28The CF Tree Structure
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next