Title: Region Discovery Using Supervised Clustering Algorithms
1Region Discovery Using Supervised Clustering
Algorithms
2Outline
- Goals of the Thesis
- Introduction
- Supervised Clustering (SC)
- Fitness Function for Region Discovery
- An Environment for Region Discovery
- Experimental Results
- Conclusion and Future Work
3Goals of the Thesis
- Investigate using Supervised Clustering (SC)
Algorithms for region discovery - Design a graphic display program to aid in
visualizing the results of region discovery - Create census-based spatial datasets for the
state of Wyoming - Analyze and compare the performance of SC
algorithms in region discovery
4Outline
- Goals of the Thesis
- Introduction
- Supervised Clustering (SC)
- Fitness Function for Region Discovery
- An Environment for Region Discovery
- Experimental Results
- Conclusion and Future Work
5Introduction
- Spatial Data Mining
- seeks to discover meaningful and interesting
patterns from data where a key dimension of data
is geographical location - Region Discovery
- subdivide a territory into different disjoint,
contiguous regions minimizing some measure of
interestingness
6Overview of Clustering
- Identify groups of object (or clusters) in a
dataset according to their similarity with
respect to a particular distance metric - Three types of clustering unsupervised (or
traditional) clustering, semi-supervised
clustering, and supervised clustering
7Outline
- Goals of the Thesis
- Introduction
- Supervised Clustering (SC)
- Fitness Function for Region Discovery
- An Environment for Region Discovery
- Experimental Results
- Conclusion and Future Work
8Supervised Clustering (SC)
- Representative-based
- Find a set of objects (or representatives) that
best represent the objects in a dataset - A solution is a set of representatives
- Objects are assigned to the nearest
representatives to form clusters - The goal of SC is to find a clustering that
minimize the given fitness function or measure of
interesting
9Supervised Clustering Algorithms Used
- SRIDHCR
- Single Representative Insertion/Deletion steepest
decent Hill Climbing with Randomized Restart
algorithm - SCEC
- Supervised Clustering using Evolutionary
Computation
10SRIDHCR Algorithm Pseudo-code
- REPEAT r TIMES
- curr a randomly created set of
representatives (with size between c1 and 2c) - WHILE NOT DONE DO
- 1. Create new solutions S by adding a
single non-representative to - curr and by removing a single
representative from curr - 2. Determine the element s in S for
which q(s) is minimal (if there is - more than one minimal element,
randomly pick one) - 3. IF q(s)ltq(curr) THEN currs
- ELSE IF q(s)q(curr) AND sgtcurr
THEN currs - ELSE terminate and return curr as
the solution for this run - Report the best out of the r solutions found
11SCEC apply biological principlesof natural
evolution
- A Genetic Algorithm cycle
12SCEC Algorithm Pseudo-code
- t ? 0
- init_population P(t)
- evaluate P(t)
- WHILE NOT DONE DO
- t ? t 1
- P ? select_parents P(t)
- recombine P(t)
- mutate P(t)
- evaluate P(t)
- P ? survive P, P(t)
- Terminate when the pre-defined number of
generation - reached return the best solution
13Outline
- Goals of the Thesis
- Introduction
- Supervised Clustering (SC)
- Fitness Function for Region Discovery
- An Environment for Region Discovery
- Experimental Results
- Conclusion and Future Work
14Fitness Functions for Region Discovery
- Three fitness functions to evaluate region
discovery - Traditional SC Fitness Function
- Gerrymandering Fitness Function
- Reward-based Fitness Function
15Traditional SC Fitness Function
- Tries to maximize purity of clusters while
keeping the number of clusters low
16Example of Traditional SC Fitness Function
- Identify majority class for each cluster
- Count minority examples for each cluster
17Gerrymandering Fitness Function (1)
- Seeks for clustering in which a particular class
(class of interest) dominates as many clusters as
possible while minimizing the imbalance among
cluster
total of 15 objects class A has 6 objects, class
B has 9 objects Let, class of interest class A
18Gerrymandering Fitness Function (2)
- The Gerrymandering fitness function incorporates
three different criteria - Maximize the number of clusters (regions) that
are dominated by a particular class - Number of regions specified by user (controlled
by parameter ß, denotes user-specified number of
regions desired) - Maintain equality of population (controlled by
parameter ?)
19Gerrymandering Fitness Function (3)
20Reward-based Fitness Function (1)
- Evaluates a clustering based on the density of a
class of focus C and assigns rewards to regions
in which the distribution of class C
significantly deviates from the prior probability
of class C in the whole dataset. - The quality of a clustering qC(X) is the sum of
the rewards tC(c) associated with each cluster c
in X - Reward is higher for larger cluster using ß1
21Reward-based Fitness Function (2)
22Example of Reward-based Fitness Function
- Parameters ?10.5, ?21.5, R1, R-1, ß 1.1
- Prior(Poor)0.2 n1000
- p(c1,Poor)20/50 0.4
- p(c2,Poor)40/200 0.2
- p(c3,Poor)10/200 0.05
- p(c4,Poor)30/350 0.0857
- p(c5,Poor)100/200 0.5
-
c3,c4 0.1 c2 0.3 c1,c5 - qPoor(X) (1/7 x 50)1.1/1000 0 (1/2 x
200)1.1/1000 (0.143 x - 350)1.1/1000 (2/7 x
200)1.1/1000 - 0.00869 0 0.15849 0.07402
0.08564 - 0.32684
0.4-0.3 0.7
0.1-0.05 0.1
23Outline
- Goals of the Thesis
- Introduction
- Supervised Clustering (SC)
- Fitness Function for Region Discovery
- An Environment for Region Discovery
- Experimental Results
- Conclusion and Future Work
24An Environment for Region Discovery
Spatial Datasets
Graphic Display Tool
Fitness Functions Support
RSC Algorithms
Environment for Region Discovery
25Outline
- Goals of the Thesis
- Introduction
- Supervised Clustering (SC)
- Fitness Function for Region Discovery
- An Environment for Region Discovery
- Experimental Results
- Conclusion and Future Work
26Objectives of the Experiments
- To illustrate how SRIDHCR and SCEC work in region
discovery for four Wyoming state spatial datasets
and two artificial spatial datasets - To evaluate the performance of SRIDHCR and SCEC
with three individual fitness functions in region
discovery - To study how parameters values of the three
fitness functions affect the clustering results
(regions discovered) and to select a set of good
parameters for the fitness functions - To analyze and compare the performances of
SRIDHCR and SCEC in region discovery
27Datasets Used in the Experiments
- Artificial Datasets Matlab datasets
- Wyoming Datasets are created based on U.S. Bureau
Census
28Creation of Wyoming Datasets
29Original Wyoming Datasets (Census 2000)
Household Income in 1999
Poverty Status in 1999
Age
Race
30Household Income in 1999
31Poverty Status in 1999
32Age
33Race
34Wyoming Poverty Dataset and Modified Poverty
Dataset
Wyoming Poverty Dataset
Modified Poverty Dataset
35Example Output of Clustering
- Each color represent a cluster
- Classes are represented by different shape of
point - Representatives are circled in white
36 - Clustering using
- Traditional SC Fitness Function
37Modified Poverty Dataset
38SCEC Traditional SC Fitness Function (1)
- Clustering Output of Modified Poverty Dataset
(parameter ß0.3)
39SCEC Traditional SC Fitness Function (2)
- Clustering Output of Modified Poverty Dataset
(parameter ß0.1)
40SRIDHCR Traditional SC Fitness Function
- Clustering Output of Modified Poverty Dataset
(parameter ß0.3)
41SRIDHCR Traditional SC Fitness Function
- Clustering Output of Modified Poverty Dataset
(parameter ß0.1)
42 - Clustering using
- Gerrymandering Fitness Function
43Wyoming Age Dataset
44Wyoming Modified Age Dataset
45SCEC Gerrymandering Fitness Function (1)
- Clustering Output of Wyoming Age Dataset
(parameters k7, ß30000, ?0.01)
5 7
46SCEC Gerrymandering Fitness Function (2)
- Clustering Output of Wyoming Age Dataset
(parameters k12, ß30000, ?0.01)
10 12
47SCEC Gerrymandering Fitness Function (3)
- Clustering Output of Wyoming Age Dataset
(parameters k12, ß30000, ?0.08)
7 12
48SRIDHCR Gerrymandering Fitness Function
(1)
- Clustering Output of Wyoming Age Dataset
(parameters k12, ß30000, ?0.01)
11 12
49SRIDHCR Gerrymandering Fitness Function
(2)
- Clustering Output of Wyoming Age Dataset
(parameters k12, ß30000, ?0.08)
8 12
50 - Clustering using
- Reward-based Fitness Function
51Wyoming Race Dataset
52SCEC Reward-based Fitness Function (1)
- Clustering Output of Wyoming Race Dataset
(parameters ?10.5, ?21.5, R10, R-10, ß1.4)
53Wyoming Poverty Dataset
54SCEC Reward-based Fitness Function (2)
- Clustering Output of Wyoming Poverty Dataset
(parameters ?10.5, ?21.5, R10, R-10, ß1.1)
55Modified Poverty Dataset
56SCEC Reward-based Fitness Function (3)
- Clustering Output of Modified Poverty Dataset
(parameters ?10.5, ?21.5, R10, R-10, ß1.1)
57Wyoming Income Dataset
58SCEC Reward-based Fitness Function (4)
- Clustering Output of Wyoming Income Dataset
class of interest-class 1 - (parameters ?10.5, ?21.5, R10, R-10, ß2)
59SCEC Reward-based Fitness Function (5)
- Clustering Output of Wyoming Income Dataset
class of interest-class 4 - (parameters ?10.5, ?21.5, R10, R-10, ß2)
60SCEC Reward-based Fitness Function (6)
- Clustering Output of Wyoming Income Dataset
class of interest-class 4 - (parameters ?10.5, ?21.5, R10, R-0, ß1.1)
61SCEC Reward-based Fitness Function (7)
- Clustering Output of Wyoming Income Dataset
class of interest-class 4 - (parameters ?10.5, ?21.5, R0, R-10, ß1.1)
62Comparison of performances on SRIDHCR and SCEC
63Execution of Algorithm
- SCEC
- generation
- 0 300
- 1 300
- .
- .
- .
- 50 300
- 50 x 300 clusterings
- SRIDHCR
- random restart
- 0 5000,5000,
- 1 5000,...
- .
- .
- .
- 25 5000,
- 25 x i x 5000 clusterings
Dataset size 5000
insertion/deletion
clusterings for i iterations
clusterings
64Time Cost Comparison
65Algorithm comparison
- Solution Quality
- SRIDHCR finds better solutions in traditional SC
fitness function - SCEC finds better solutions in gerrymandering and
reward-based fitness function - Time Cost
- SCEC runs very fast
- SRIDHCR is very slow on large datasets
66Outline
- Goals of the Thesis
- Introduction to Clustering
- Supervised Clustering (SC)
- Fitness Function
- Experimental Results
- Conclusion and Future Work
67Conclusion
- Both traditional SC fitness function and
reward-based fitness function show some merits on
purity-base - Difficult to satisfy three criteria at once in
gerrymandering fitness function - SCEC is fast for dataset with size 100-3000 and
SRIDHCR is fast on 200 examples, okay on 1000
examples, and slow on 3000 examples, cannot run
beyond 5000 examples. - Graphic display tool works for visualizing and
summarizing the clustering results - SCEC is a better SC algorithm for region
discovery in term of both solution quality and
time cost
68Future Work
- More experiments are still needed for fitness
function and a more thorough experimental
evaluation - Improve data structure to handle larger distance
matrix - Faster algorithm are needed to process dataset
with more than 5000 examples - Improve graphic display tool to handle more than
25 clusters
69