Region Discovery Using Supervised Clustering Algorithms - PowerPoint PPT Presentation

1 / 58

About This Presentation

Title:

Region Discovery Using Supervised Clustering Algorithms

Description:

Introduction. Supervised Clustering (SC) Fitness Function ... Introduction ... Introduction to Clustering. Supervised Clustering (SC) Fitness Function ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 59

Provided by: mich494

Category:

more less

Transcript and Presenter's Notes

Title: Region Discovery Using Supervised Clustering Algorithms

1
Region Discovery Using Supervised Clustering
Algorithms

Kim Keen Wee

2
Outline

Goals of the Thesis
Introduction
Supervised Clustering (SC)
Fitness Function for Region Discovery
An Environment for Region Discovery
Experimental Results
Conclusion and Future Work

3
Goals of the Thesis

Investigate using Supervised Clustering (SC)
Algorithms for region discovery
Design a graphic display program to aid in
visualizing the results of region discovery
Create census-based spatial datasets for the
state of Wyoming
Analyze and compare the performance of SC
algorithms in region discovery

4
Outline

Goals of the Thesis
Introduction
Supervised Clustering (SC)
Fitness Function for Region Discovery
An Environment for Region Discovery
Experimental Results
Conclusion and Future Work

5
Introduction

Spatial Data Mining
seeks to discover meaningful and interesting
patterns from data where a key dimension of data
is geographical location
Region Discovery
subdivide a territory into different disjoint,
contiguous regions minimizing some measure of
interestingness

6
Overview of Clustering

Identify groups of object (or clusters) in a
dataset according to their similarity with
respect to a particular distance metric
Three types of clustering unsupervised (or
traditional) clustering, semi-supervised
clustering, and supervised clustering

7
Outline

Goals of the Thesis
Introduction
Supervised Clustering (SC)
Fitness Function for Region Discovery
An Environment for Region Discovery
Experimental Results
Conclusion and Future Work

8
Supervised Clustering (SC)

Representative-based
Find a set of objects (or representatives) that
best represent the objects in a dataset
A solution is a set of representatives
Objects are assigned to the nearest
representatives to form clusters
The goal of SC is to find a clustering that
minimize the given fitness function or measure of
interesting

9
Supervised Clustering Algorithms Used

SRIDHCR
Single Representative Insertion/Deletion steepest
decent Hill Climbing with Randomized Restart
algorithm
SCEC
Supervised Clustering using Evolutionary
Computation

10
SRIDHCR Algorithm Pseudo-code

REPEAT r TIMES
curr a randomly created set of
representatives (with size between c1 and 2c)
WHILE NOT DONE DO
1. Create new solutions S by adding a
single non-representative to
curr and by removing a single
representative from curr
2. Determine the element s in S for
which q(s) is minimal (if there is
more than one minimal element,
randomly pick one)
3. IF q(s)ltq(curr) THEN currs
ELSE IF q(s)q(curr) AND sgtcurr
THEN currs
ELSE terminate and return curr as
the solution for this run
Report the best out of the r solutions found

11
SCEC apply biological principlesof natural
evolution

A Genetic Algorithm cycle

12
SCEC Algorithm Pseudo-code

t ? 0
init_population P(t)
evaluate P(t)
WHILE NOT DONE DO
t ? t 1
P ? select_parents P(t)
recombine P(t)
mutate P(t)
evaluate P(t)
P ? survive P, P(t)
Terminate when the pre-defined number of
generation
reached return the best solution

13
Outline

Goals of the Thesis
Introduction
Supervised Clustering (SC)
Fitness Function for Region Discovery
An Environment for Region Discovery
Experimental Results
Conclusion and Future Work

14
Fitness Functions for Region Discovery

Three fitness functions to evaluate region
discovery
Traditional SC Fitness Function
Gerrymandering Fitness Function
Reward-based Fitness Function

15
Traditional SC Fitness Function

Tries to maximize purity of clusters while
keeping the number of clusters low

16
Example of Traditional SC Fitness Function

Identify majority class for each cluster
Count minority examples for each cluster

17
Gerrymandering Fitness Function (1)

Seeks for clustering in which a particular class
(class of interest) dominates as many clusters as
possible while minimizing the imbalance among
cluster

total of 15 objects class A has 6 objects, class
B has 9 objects Let, class of interest class A
18
Gerrymandering Fitness Function (2)

The Gerrymandering fitness function incorporates
three different criteria
Maximize the number of clusters (regions) that
are dominated by a particular class
Number of regions specified by user (controlled
by parameter ß, denotes user-specified number of
regions desired)
Maintain equality of population (controlled by
parameter ?)

19
Gerrymandering Fitness Function (3)
20
Reward-based Fitness Function (1)

Evaluates a clustering based on the density of a
class of focus C and assigns rewards to regions
in which the distribution of class C
significantly deviates from the prior probability
of class C in the whole dataset.
The quality of a clustering qC(X) is the sum of
the rewards tC(c) associated with each cluster c
in X
Reward is higher for larger cluster using ß1

21
Reward-based Fitness Function (2)
22
Example of Reward-based Fitness Function

Parameters ?10.5, ?21.5, R1, R-1, ß 1.1
Prior(Poor)0.2 n1000
p(c1,Poor)20/50 0.4
p(c2,Poor)40/200 0.2
p(c3,Poor)10/200 0.05
p(c4,Poor)30/350 0.0857
p(c5,Poor)100/200 0.5
c3,c4 0.1 c2 0.3 c1,c5
qPoor(X) (1/7 x 50)1.1/1000 0 (1/2 x
200)1.1/1000 (0.143 x
350)1.1/1000 (2/7 x
200)1.1/1000
0.00869 0 0.15849 0.07402
0.08564
0.32684

0.4-0.3 0.7
0.1-0.05 0.1
23
Outline

Goals of the Thesis
Introduction
Supervised Clustering (SC)
Fitness Function for Region Discovery
An Environment for Region Discovery
Experimental Results
Conclusion and Future Work

24
An Environment for Region Discovery
Spatial Datasets
Graphic Display Tool
Fitness Functions Support
RSC Algorithms
Environment for Region Discovery
25
Outline

Goals of the Thesis
Introduction
Supervised Clustering (SC)
Fitness Function for Region Discovery
An Environment for Region Discovery
Experimental Results
Conclusion and Future Work

26
Objectives of the Experiments

To illustrate how SRIDHCR and SCEC work in region
discovery for four Wyoming state spatial datasets
and two artificial spatial datasets
To evaluate the performance of SRIDHCR and SCEC
with three individual fitness functions in region
discovery
To study how parameters values of the three
fitness functions affect the clustering results
(regions discovered) and to select a set of good
parameters for the fitness functions
To analyze and compare the performances of
SRIDHCR and SCEC in region discovery

27
Datasets Used in the Experiments

Artificial Datasets Matlab datasets
Wyoming Datasets are created based on U.S. Bureau
Census

28
Creation of Wyoming Datasets
29
Original Wyoming Datasets (Census 2000)
Household Income in 1999
Poverty Status in 1999
Age
Race
30
Household Income in 1999
31
Poverty Status in 1999
32
Age
33
Race
34
Wyoming Poverty Dataset and Modified Poverty
Dataset
Wyoming Poverty Dataset
Modified Poverty Dataset
35
Example Output of Clustering

Each color represent a cluster
Classes are represented by different shape of
point
Representatives are circled in white

Clustering using
Traditional SC Fitness Function

37
Modified Poverty Dataset
38
SCEC Traditional SC Fitness Function (1)

Clustering Output of Modified Poverty Dataset
(parameter ß0.3)

39
SCEC Traditional SC Fitness Function (2)

Clustering Output of Modified Poverty Dataset
(parameter ß0.1)

40
SRIDHCR Traditional SC Fitness Function

Clustering Output of Modified Poverty Dataset
(parameter ß0.3)

41
SRIDHCR Traditional SC Fitness Function

Clustering Output of Modified Poverty Dataset
(parameter ß0.1)

Clustering using
Gerrymandering Fitness Function

43
Wyoming Age Dataset
44
Wyoming Modified Age Dataset
45
SCEC Gerrymandering Fitness Function (1)

Clustering Output of Wyoming Age Dataset
(parameters k7, ß30000, ?0.01)

5 7
46
SCEC Gerrymandering Fitness Function (2)

Clustering Output of Wyoming Age Dataset
(parameters k12, ß30000, ?0.01)

10 12
47
SCEC Gerrymandering Fitness Function (3)

Clustering Output of Wyoming Age Dataset
(parameters k12, ß30000, ?0.08)

7 12
48
SRIDHCR Gerrymandering Fitness Function
(1)

Clustering Output of Wyoming Age Dataset
(parameters k12, ß30000, ?0.01)

11 12
49
SRIDHCR Gerrymandering Fitness Function
(2)

Clustering Output of Wyoming Age Dataset
(parameters k12, ß30000, ?0.08)

8 12
50

Clustering using
Reward-based Fitness Function

51
Wyoming Race Dataset
52
SCEC Reward-based Fitness Function (1)

Clustering Output of Wyoming Race Dataset
(parameters ?10.5, ?21.5, R10, R-10, ß1.4)

53
Wyoming Poverty Dataset
54
SCEC Reward-based Fitness Function (2)

Clustering Output of Wyoming Poverty Dataset
(parameters ?10.5, ?21.5, R10, R-10, ß1.1)

55
Modified Poverty Dataset
56
SCEC Reward-based Fitness Function (3)

Clustering Output of Modified Poverty Dataset
(parameters ?10.5, ?21.5, R10, R-10, ß1.1)

57
Wyoming Income Dataset
58
SCEC Reward-based Fitness Function (4)

Clustering Output of Wyoming Income Dataset
class of interest-class 1
(parameters ?10.5, ?21.5, R10, R-10, ß2)

59
SCEC Reward-based Fitness Function (5)

Clustering Output of Wyoming Income Dataset
class of interest-class 4
(parameters ?10.5, ?21.5, R10, R-10, ß2)

60
SCEC Reward-based Fitness Function (6)

Clustering Output of Wyoming Income Dataset
class of interest-class 4
(parameters ?10.5, ?21.5, R10, R-0, ß1.1)

61
SCEC Reward-based Fitness Function (7)

Clustering Output of Wyoming Income Dataset
class of interest-class 4
(parameters ?10.5, ?21.5, R0, R-10, ß1.1)

62
Comparison of performances on SRIDHCR and SCEC
63
Execution of Algorithm

SCEC
generation
0 300
1 300
.
.
.
50 300
50 x 300 clusterings

SRIDHCR
random restart
0 5000,5000,
1 5000,...
.
.
.
25 5000,
25 x i x 5000 clusterings

Dataset size 5000
insertion/deletion
clusterings for i iterations
clusterings
64
Time Cost Comparison
65
Algorithm comparison

Solution Quality
SRIDHCR finds better solutions in traditional SC
fitness function
SCEC finds better solutions in gerrymandering and
reward-based fitness function
Time Cost
SCEC runs very fast
SRIDHCR is very slow on large datasets

66
Outline

Goals of the Thesis
Introduction to Clustering
Supervised Clustering (SC)
Fitness Function
Experimental Results
Conclusion and Future Work

67
Conclusion

Both traditional SC fitness function and
reward-based fitness function show some merits on
purity-base
Difficult to satisfy three criteria at once in
gerrymandering fitness function
SCEC is fast for dataset with size 100-3000 and
SRIDHCR is fast on 200 examples, okay on 1000
examples, and slow on 3000 examples, cannot run
beyond 5000 examples.
Graphic display tool works for visualizing and
summarizing the clustering results
SCEC is a better SC algorithm for region
discovery in term of both solution quality and
time cost

68
Future Work

More experiments are still needed for fitness
function and a more thorough experimental
evaluation
Improve data structure to handle larger distance
matrix
Faster algorithm are needed to process dataset
with more than 5000 examples
Improve graphic display tool to handle more than
25 clusters