Title: Region Discovery Using Hierarchical Supervised Clustering
1Region Discovery Using Hierarchical Supervised
Clustering
2Outline
- Goals of the thesis
- Introduction to Spatial Data Mining and Region
Discovery - Introduction to Supervised Clustering
- A Measure of Interestingness Reward-based
Fitness Function - Agglomerative Hierarchical Technique (SCAH)
- Supervised Clustering using Multi-Resolution
Grids (SCMRG) - Experimental Evaluation
- Summary and Future Work
31. Goals of this thesis
- Design Supervised Clustering Algorithms for
Region Discovery - Evaluate Reward-based Fitness Function for Region
Discovery - Analyze the performance of the developed
algorithms for a benchmark of spatial datasets.
42. Introduction of Spatial Data Mining
- Spatial Data Mining
- is the process of discovering interesting,
unexpected and previously unknown, but
potentially useful patterns from large spatial
datasets. - Typical Task
- Revelation of interesting groups
- Co-location discovery
- Location prediction
- Spatial interaction
- Hot spot discoveries
5Spatial dataset format
- The raster data model
- represents map features as cells in a grid
matrix. - The matrix is organized by continuous, evenly
spaced rows and columns. - Each cell is coded with an attribute that
represents a data parameter that appears within
the cell. - The vector data model
- is used to represent discrete features that are
defined as points, lines, and polygons in a
geographic information system (GIS). - Vector data represent features as pairs of x, y
coordinates. - A point is defined as a single x, y coordinate
pair. - Lines and polygons are defined by a set or
series of coordinate pairs.
6Difference between Traditional Dataset and
Spatial Dataset
- Spatial data usually comes in a continuous space,
whereas traditional datasets are often discrete. - Spatial data mining usually focuses on the
discussion of local patterns whereas traditional
data mining techniques often focus on global
patterns. - Data samples are independently generated in
traditional statistical analysis Spatial data
tends to be highly auto-correlated
7Region Discovery
- Discovering interesting regions in spatial
datasets - In particular, identifying disjoint, contiguous
regions that are unusual with respect to the
distribution of a given class - e.g. a region that contains an unusually low or
high number of instances of a particular class - In a reasonable time.
8Region Discovery and Related Work
- MAH98 Characterization and Trend Detection in
Spatial Database. - Starts with a small set of target objects (see
Figure, left). - Expands regions around the target objects,
simultaneously selecting those attributes of the
regions for which the distribution of values
differs significantly from the distribution in
the whole database (Figure, right).
93. Supervised Clustering
10Category of Clustering Algorithm
- Hierarchical clustering methods
- start with all patterns in a single cluster and
successively performs splitting or merging until
a stopping criterion is met. - Partitional / Representative-based clustering
methods - start with each pattern as a single cluster and
iteratively reassigns each data object to one of
the clusters to forming new cluster until a
stopping criterion is met. - Density-based clustering methods
- try to find clusters based on the density of data
points in a region. Dense regions of objects in
the data space form a cluster. -
- Grid-based clustering algorithms
- quantize the clustering space into a finite
number of cells, and then perform the required
operations on the quantized space. Cells that
contain more than a certain number of points are
treated as dense. The dense cells are then merged
or split to form the clusters.
114. Measuring the Interestingness of Region using
Reward-based Fitness Function
- Evaluates a clustering based on the density of a
class of interest C and assigns rewards to
regions in which the distribution of class C
significantly deviates from the prior probability
of class C in the whole dataset. - The quality of a clustering Sum of the rewards
associated with each cluster - A reward of a cluster c is computed as follows
Reward (c) size (c)ß - Idea clusters reward increases with cluster
size non-linearly - Consequence Merging two clusters with equal
reward leads to a better clustering a ß b ßlt (a
b) ß with a and b being the size of each
cluster.
12(No Transcript)
13Example of Reward-based Fitness Function
- Class of Interest Poor
- Prior Probability 20
- ?1 0.5, ?2 1.5
- R 1, R- 1
- ß 1.1, ?1.
10
30
14Outline
- Goals of the thesis
- Introduction to Spatial Data Mining and Region
Discovery - Introduction to Supervised Clustering
- Measures of Interestingness for Region Discovery
- Agglomerative Hierarchical Technique (SCAH)
- Supervised Clustering using Multi-Resolution
Grids (SCMRG) - Experimental Evaluation
- Summary and Future Work
15SCAH-Description
- SCAH constructs a two-dimension dissimilarity
matrix (N N) which is used to store spatial
distance of clusters - SCAH then computes the merge candidates which
contain pairs of clusters that can be potentially
merged. - A pair of clusters (ci, cj) can be classified as
merge candidate if ci is the closest cluster to
cj or cj is the closest cluster to ci. - Merge candidates and distance information is
updated incrementally when clusters are merged. - In each step, SCAH selected two clusters to merge
from the pool of merge candidates so that this
merging improves the overall quality most. - The algorithm terminates if no such pair of
clusters exists.
16Inter-Cluster Distance Measures for SCAH
- Single-linkage
- encounter the problem called chaining
phenomenon - Complete-linkage
- datasets contain noise or outliers then the
farthest objects are very possibly outliers. - Average-linkage
- computationally more expensive
- the chaining problem is fixed and outliers have
much less negative impact on clustering decisions
17Flow chart
18Outline
- Goals of the thesis
- Introduction to Spatial Data Mining and Region
Discovery - Introduction to Supervised Clustering
- Measure of Interestingness Reward-based Fitness
Function - Agglomerative Hierarchical Technique (SCAH)
- Supervised Clustering using Multi-Resolution
Grids (SCMRG) - Experimental Evaluation
- Summary and Future Work
196. Supervised Clustering using
Multi-Resolution Grids
- SCMRG is a hierarchical grid based method for
reward-based region discovery. - SCMRG uses rectangular grid cells and searches in
a top-down direction, subdividing a rectangular
grid cell into sets of smaller rectangular grid
cells.
STINGWang97
20Example of SCMRG
- If a cell receives a reward, and its reward is
larger than the sum of the rewards associated of
its children and larger than the sum of rewards
of its grandchildren, this cell is returned as a
cluster by the algorithm. - If a cell does not receive a reward and its
children and grandchildren do not receive a
reward, neither the cell nor any of its
descendents will be included in the computed
clustering. - Otherwise, all the children cells of the cell are
put on a queue for further processing.
21Pseudo code
22Indexing Structure for SCMRG
23Outline
- Goals of the thesis
- Introduction to Spatial Data Mining and Region
Discovery - Introduction to Supervised Clustering
- Measure of Interestingness Reward-based Fitness
Function - Agglomerative Hierarchical Technique (SCAH)
- Supervised Clustering using Multi-Resolution
Grids (SCMRG) - Experimental Evaluation
- Summary and Future Work
24Objectives of the Experiments
- To illustrate how SCAH and SCMRG perform for
different kinds of spatial datasets - To study how changes in the parameters of the
fitness functions affect the clustering results
and region discovery, in general. - To determine which supervised clustering
algorithm performs well for which dataset and
identify the strength and weakness of each
algorithm.
25Datasets Used
These two datasets were obtained from Salvador,
S. and Chan, P., which were used in their
publication Determining the Number of
Clusters/Segments in Hierarchical
clustering/Segmentation Algorithm.
26Datasets Used
- Obtained from Geosciences Department in
University of Houston. -
- The Volcano dataset contains basic geographic and
geologic information for volcanoes thought to be
active in the last 10,000 years - The original data include a unique volcano
number, volcano name, location, latitude and
longitude, summit elevation, volcano type, status
and the time range of the last recorded eruption.
- The Subset of the volcano dataset used in this
thesis contains longitude, latitude and a class
variable that indicates if a volcano is non
violent (blue) or violent (red).
27Datasets Used
- Obtained from Geosciences Department in
University of Houston. - The Earthquake dataset contains all earthquake
data worldwide done by the United States
Geological Survey (USGS) National Earthquake
Information Center (NEIC). - The modified Earthquake dataset contains the
longitude, latitude and a class variable that
indicates the depth of the earthquake,
0(shallow), 1(medium) and 2(deep).
28Datasets Used
- Wyoming datasets were created from U.S. Census
2000 data. - The Wyoming Modified Poverty Status in 1999 is a
modified version of the original dataset, Wyoming
Poverty Status. - The Wyoming Poverty Datasets were created using
county statistics. For each county, random
population coordinates were generated using the
complete spatial randomness (CSR) functions in
S-PLUS. - Then, the background information was attached to
each individual county based on the countys
distribution for the class of interest. Finally,
all counties were merged into a single dataset
that describes the whole state.
29Datasets Used
30Evaluation Measures Used
- Number of clusters it indicates the number of
regions discovered. - SCMRG the regions discovered denotes the number
of black cells that correspond to the set of
discovered regions that receive a highest reward. - SCAH the number of clusters is reported.
- Outliers
- SCMRG if the region does not receive reward, the
objects belong to this region are considered to
be outliers. - SCAH every object has to belong to a cluster
therefore, there are no outliers when running
SCAH. - Quality the result of the reward-based fitness
function. - Cluster purity
- it is measured by the percentage of majority
examples in different clusters that are returned
by the supervised clustering algorithm. - A majority example in a cluster is an instance
that belongs to a class which has the most
frequent in that cluster. - Time Complexity this is total time used to
process the whole dataset and produce the final
clustering result. In the experiments, we used
Wall-Clock Time (WCT) in seconds.
31Input Parameters Tested
32Using different Inter-Cluster Distance Measures
for SCAH
33Complex9 Experimental Result Analysis
- The outside elliptical shape belongs to class 1
and two spots inside belong to class 0. - Based on average distance, merge candidates 1,
6, 5, 6, 3, 7, 2, 7 - SCAH will stop here since no improvement made
using a single merging. However, for ß3 a higher
reward can be obtained by merging all 7 clusters,
but SCAH fails to do so.
34Experimental Results for SCAH
35SCAH Earthquake Results
36SCAH Volcano Original Dataset
37SCAH Volcano Experimental Result
38SCAH - Advantage and Weakness
- SCAH does not need to have prior knowledge of the
datasets to run, and it processes the objects
based on the distance matrix. It can precisely
find the nearest neighbor of each cluster and
produce the merge candidates for the next
iteration. - SCAH does a good job in picking up small pure
clusters, like chain-like patterns in the Volcano
datasets that have been approximated by
representative-based clustering algorithm into a
union of clusters. - SCAHs definition of merging one pair of
candidates per cluster is too restrictive to get
good result as beta increases. - SCAH also has no look-ahead option and terminates
in the certain step in the experiments of
Complex9 dataset even when ß increases. - Moreover, SCAH has to store a two dimensional
distance matrix and the complexity of this
algorithm will be O (N2) where N is the total
number of objects in the dataset. Therefore,
space complexity and time complexity still remain
the main challenges for this algorithm.
39SCMRG - Experimental Results
- The color yellow is used to indicate reward
regions that have a high density with respect to
the class of interest. - The color black indicates reward regions with a
very low density of the class of interest. - Non-reward regions will not be colored at all.
40ß 1.01? 6
41ß 1.3 ? 1
42ß 3 ? 1
43SCMRG Complex9
44SCMRG Complex8
45SCMRG Volcano
46SCMRG Earthquake
47SCMRG - Advantage and Weakness
- The main advantage of this approach is its fast
processing time in comparison to the SCAH. The
time complexity is determined by the number of
cells added in the queue for further processing
in each level. It makes this algorithm much
faster than SCAH. - The disadvantage of SCMRG is that the boundary of
the regions it discovered could only be vertical
and horizontal since the regions discovered are
mostly composed of square cells.
48Comparison of Experimental Results
49Comparison of Experimental Results
50Purity and Quality
51Comparison
- SCAH outperforms SCMRG and SCEC on the purity and
quality of the clustering result. However, SCAH
does not do well for ß 3 SCAH only considers
one merge candidate per cluster which is too
restrictive to get good result. When ß increases,
the algorithm gets stuck because of the
restriction. - SCMRG dramatically outperforms the SCAH with
respect to time consumed. SCMRG can use several
seconds to process thousands of objects in a
dataset while SCAH takes more than an hour to
process the same dataset. - Comparing to the SCAH, SCEC takes much more time
to process the same dataset. SCEC takes more than
14 hours to process the earthquake 1 dataset,
but SCMRG just takes 4 minutes to process the
earthquake 1 dataset. Normally, the time
consumed using SCEC is 3 to 5 times more than
that of SCAH. From the time complexity aspect,
SCMRG has the best performance. SCAH is better
than SCEC.
52Summary
- We introduced two supervised clustering
approaches and a reward-based evaluation
framework suitable for region discovery problems.
- Finding interesting regions in spatial datasets
is viewed as a clustering problem, in which the
sum of rewards for the obtained clusters is
maximized, and where the reward associated with
each cluster reflects its degree of
interestingness for the problem at hand. - This approach is unique and quite different from
most other work in spatial data mining that
mostly uses association rules. Different measures
of interestingness can easily be supported in the
proposed framework by designing different cluster
reward functions that correspond to different
measures of interestingness - In this case, neither the supervised clustering
algorithm itself nor our general evaluation
framework has to be modified. - We also discussed how hierarchical, and
grid-based clustering algorithms can be adapted
for supervised clustering in general, and for
region discovery in particular, and provided
evidence with respect to the usefulness of the
proposed framework for hotspot discovery
problems.
53Future Work
- The algorithms should be tested with other reward
functions. - For the SCAH, strategies should be developed to
avoid that the algorithms gets stuck, returning a
too large number of regions. - SCMRG needs to find a better way to visualize and
has to approximate regions more precisely. - More comparisons with other region discovery
algorithms are needed as well.
54Reference
- 1 Fayyad96 Usama M. Fayyad Tutorial on
Knowledge Discovery and Data Mining
IJCAI-95/KDD-95. - 2 NgHan94 R. Ng and J. Han, '' Efficient and
Effective Clustering Method for Spatial Data
Mining'', Proc. of 1994 Int'l Conf. on Very Large
Data Bases (VLDB'94), Santiago, Chile, September
1994, pp. 144-155. - 3 MAHKDD98 Ester M., Frommelt A., Kriegel
H.-P., Sander J. Algorithms for
Characterization and Trend Detection in Spatial
Databases, Proc. 4th Int. Conf. on Knowledge
Discovery and Data Mining (KDD'98), New York
City, NY, 1998, pp. 44-50. - 4 KR87 KR90, clustering large applications
based on randomized search - 5 NH02 Raymond T. Ng and Jiawei Han, Member,
CLARANS A Method for Clustering Objects for
Spatial Data Mining IEEE Transactions on
Knowledge and Data Engineering, Vol. 14, No. 5,
September/October 2002 - 6 EST96 Ester, Kriegel, Sander, and Xu
DBSCAN Density Based Spatial Clustering of
Applications with Noise Knowledge Discovery and
Data Mining (KDD96) - 7 ABHS99 Mihael Ankerst, Markus M. Breunig,
Hans-Peter Kriegel, Jorg Sander. OPTICS
Ordering Points To Identify the Clustering
Structure Pro. ACM SOGMOD99 Int, Conf. on
Management of Data, Philadelphia PA, 1999 - 8 WANG97 Wei Wang, Jiong Yang, Richard Muntz
STING A Statistical Information Grid Approach
to Spatial Data Mining Twenty-Third
International Conference on Very Large Data Bases
1997 - 9 BBM03 Basu, S., Bilenko,M., Mooney, R.
Comparing and Unifying Search-based and
Similarity-Based Approaches to Semi-Supervised
Clustering, in Proc. ICML03, pp. 42-29,
Washington DC, August 2003. - 10 BHSW03 Bar-Hillel, A., Hertz, T., Shental,
N., Weinshall, D. Learning Distance Functions
Using Equivalence Relations, in Proc. ICML03,
Washington DC, August 2003. - 11 DBE99 Demiriz, A., Benett, K.-P.,
Embrechts, M.J. Semi-supervised Clustering using
Genetic Algorithms, in Proc. ANNIE99. - 12 ERCBV04 Eick, C., Rouhana, A., Chen, C.,
Bagherjeiran, A., Vilalta, R. Using Clustering
to Learn Distance Functions for Supervised
Similarity Assessment, submitted for
publication. - 13 EZ05 C. Eick, N. Zeidat, Using Supervised
Clustering to Enhance Classifiers, in Proc. 15th
International Symposium on Methodologies for
Intelligent Systems (ISMIS), Saratoga Springs,
New York, May 2005. - 14 EZV04 Using Representative-Based
Clustering for Nearest Neighbor Dataset Editing,
in Proc. IEEE Int. Conference on Data Mining,
Brighton, England, Nov. 2004. - 15 KKM02 Klein,D., Kamvar,S.-D., Manning, C.
From instance-level Constraints to Space-level
Constraints Making the Most of Prior Knowledge
in Data Clustering, in Proc. ICML02, Sydney,
Australia. - 16 KR90 Kaufman L. and Rousseeuw P. J.
Finding Groups in Data an Introduction to
Cluster Analysis, John Wiley Sons, 1990. - 17 ST99 Slonim, N. and Tishby, N.,
Agglomerative Information Bottleneck, Neural
Information Processing Systems (NIPS-1999). - 18 Z04 Zhao, Z., Evolutionary Computing and
Splitting Algorithms for Supervised Clustering,
Masters Thesis, University of Houston,
Department of Computer Science, May 2004,
cs.uh.edu/zhenzhao/ZhenghongThesis.zip. - 19 ZE04 N. Zeidat and C. Eick, K-medoid-style
Clustering Algorithms for Supervised Summary
Generation, in Proc. 2004 International
Conference on Machine Learning Models,
Technologies and Applications (MLMTA'04), Las
Vegas, Nevada, June 2004.