Region Discovery Using Supervised Clustering Algorithms - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Region Discovery Using Supervised Clustering Algorithms

Description:

Introduction. Supervised Clustering (SC) Fitness Function ... Introduction ... Introduction to Clustering. Supervised Clustering (SC) Fitness Function ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 59
Provided by: mich494
Category:

less

Transcript and Presenter's Notes

Title: Region Discovery Using Supervised Clustering Algorithms


1
Region Discovery Using Supervised Clustering
Algorithms
  • Kim Keen Wee

2
Outline
  • Goals of the Thesis
  • Introduction
  • Supervised Clustering (SC)
  • Fitness Function for Region Discovery
  • An Environment for Region Discovery
  • Experimental Results
  • Conclusion and Future Work

3
Goals of the Thesis
  • Investigate using Supervised Clustering (SC)
    Algorithms for region discovery
  • Design a graphic display program to aid in
    visualizing the results of region discovery
  • Create census-based spatial datasets for the
    state of Wyoming
  • Analyze and compare the performance of SC
    algorithms in region discovery

4
Outline
  • Goals of the Thesis
  • Introduction
  • Supervised Clustering (SC)
  • Fitness Function for Region Discovery
  • An Environment for Region Discovery
  • Experimental Results
  • Conclusion and Future Work

5
Introduction
  • Spatial Data Mining
  • seeks to discover meaningful and interesting
    patterns from data where a key dimension of data
    is geographical location
  • Region Discovery
  • subdivide a territory into different disjoint,
    contiguous regions minimizing some measure of
    interestingness

6
Overview of Clustering
  • Identify groups of object (or clusters) in a
    dataset according to their similarity with
    respect to a particular distance metric
  • Three types of clustering unsupervised (or
    traditional) clustering, semi-supervised
    clustering, and supervised clustering

7
Outline
  • Goals of the Thesis
  • Introduction
  • Supervised Clustering (SC)
  • Fitness Function for Region Discovery
  • An Environment for Region Discovery
  • Experimental Results
  • Conclusion and Future Work

8
Supervised Clustering (SC)
  • Representative-based
  • Find a set of objects (or representatives) that
    best represent the objects in a dataset
  • A solution is a set of representatives
  • Objects are assigned to the nearest
    representatives to form clusters
  • The goal of SC is to find a clustering that
    minimize the given fitness function or measure of
    interesting

9
Supervised Clustering Algorithms Used
  • SRIDHCR
  • Single Representative Insertion/Deletion steepest
    decent Hill Climbing with Randomized Restart
    algorithm
  • SCEC
  • Supervised Clustering using Evolutionary
    Computation

10
SRIDHCR Algorithm Pseudo-code
  • REPEAT r TIMES
  • curr a randomly created set of
    representatives (with size between c1 and 2c)
  • WHILE NOT DONE DO
  • 1. Create new solutions S by adding a
    single non-representative to
  • curr and by removing a single
    representative from curr
  • 2. Determine the element s in S for
    which q(s) is minimal (if there is
  • more than one minimal element,
    randomly pick one)
  • 3. IF q(s)ltq(curr) THEN currs
  • ELSE IF q(s)q(curr) AND sgtcurr
    THEN currs
  • ELSE terminate and return curr as
    the solution for this run
  • Report the best out of the r solutions found

11
SCEC apply biological principlesof natural
evolution
  • A Genetic Algorithm cycle

12
SCEC Algorithm Pseudo-code
  • t ? 0
  • init_population P(t)
  • evaluate P(t)
  • WHILE NOT DONE DO
  • t ? t 1
  • P ? select_parents P(t)
  • recombine P(t)
  • mutate P(t)
  • evaluate P(t)
  • P ? survive P, P(t)
  • Terminate when the pre-defined number of
    generation
  • reached return the best solution

13
Outline
  • Goals of the Thesis
  • Introduction
  • Supervised Clustering (SC)
  • Fitness Function for Region Discovery
  • An Environment for Region Discovery
  • Experimental Results
  • Conclusion and Future Work

14
Fitness Functions for Region Discovery
  • Three fitness functions to evaluate region
    discovery
  • Traditional SC Fitness Function
  • Gerrymandering Fitness Function
  • Reward-based Fitness Function

15
Traditional SC Fitness Function
  • Tries to maximize purity of clusters while
    keeping the number of clusters low

16
Example of Traditional SC Fitness Function
  • Identify majority class for each cluster
  • Count minority examples for each cluster

17
Gerrymandering Fitness Function (1)
  • Seeks for clustering in which a particular class
    (class of interest) dominates as many clusters as
    possible while minimizing the imbalance among
    cluster

total of 15 objects class A has 6 objects, class
B has 9 objects Let, class of interest class A
18
Gerrymandering Fitness Function (2)
  • The Gerrymandering fitness function incorporates
    three different criteria
  • Maximize the number of clusters (regions) that
    are dominated by a particular class
  • Number of regions specified by user (controlled
    by parameter ß, denotes user-specified number of
    regions desired)
  • Maintain equality of population (controlled by
    parameter ?)

19
Gerrymandering Fitness Function (3)
20
Reward-based Fitness Function (1)
  • Evaluates a clustering based on the density of a
    class of focus C and assigns rewards to regions
    in which the distribution of class C
    significantly deviates from the prior probability
    of class C in the whole dataset.
  • The quality of a clustering qC(X) is the sum of
    the rewards tC(c) associated with each cluster c
    in X
  • Reward is higher for larger cluster using ß1

21
Reward-based Fitness Function (2)
22
Example of Reward-based Fitness Function
  • Parameters ?10.5, ?21.5, R1, R-1, ß 1.1
  • Prior(Poor)0.2 n1000
  • p(c1,Poor)20/50 0.4
  • p(c2,Poor)40/200 0.2
  • p(c3,Poor)10/200 0.05
  • p(c4,Poor)30/350 0.0857
  • p(c5,Poor)100/200 0.5

  • c3,c4 0.1 c2 0.3 c1,c5
  • qPoor(X) (1/7 x 50)1.1/1000 0 (1/2 x
    200)1.1/1000 (0.143 x
  • 350)1.1/1000 (2/7 x
    200)1.1/1000
  • 0.00869 0 0.15849 0.07402
    0.08564
  • 0.32684

0.4-0.3 0.7
0.1-0.05 0.1
23
Outline
  • Goals of the Thesis
  • Introduction
  • Supervised Clustering (SC)
  • Fitness Function for Region Discovery
  • An Environment for Region Discovery
  • Experimental Results
  • Conclusion and Future Work

24
An Environment for Region Discovery
Spatial Datasets
Graphic Display Tool
Fitness Functions Support
RSC Algorithms
Environment for Region Discovery
25
Outline
  • Goals of the Thesis
  • Introduction
  • Supervised Clustering (SC)
  • Fitness Function for Region Discovery
  • An Environment for Region Discovery
  • Experimental Results
  • Conclusion and Future Work

26
Objectives of the Experiments
  • To illustrate how SRIDHCR and SCEC work in region
    discovery for four Wyoming state spatial datasets
    and two artificial spatial datasets
  • To evaluate the performance of SRIDHCR and SCEC
    with three individual fitness functions in region
    discovery
  • To study how parameters values of the three
    fitness functions affect the clustering results
    (regions discovered) and to select a set of good
    parameters for the fitness functions
  • To analyze and compare the performances of
    SRIDHCR and SCEC in region discovery

27
Datasets Used in the Experiments
  • Artificial Datasets Matlab datasets
  • Wyoming Datasets are created based on U.S. Bureau
    Census

28
Creation of Wyoming Datasets
29
Original Wyoming Datasets (Census 2000)
Household Income in 1999
Poverty Status in 1999
Age
Race
30
Household Income in 1999
31
Poverty Status in 1999
32
Age
33
Race
34
Wyoming Poverty Dataset and Modified Poverty
Dataset
Wyoming Poverty Dataset
Modified Poverty Dataset
35
Example Output of Clustering
  • Each color represent a cluster
  • Classes are represented by different shape of
    point
  • Representatives are circled in white

36
  • Clustering using
  • Traditional SC Fitness Function

37
Modified Poverty Dataset
38
SCEC Traditional SC Fitness Function (1)
  • Clustering Output of Modified Poverty Dataset
    (parameter ß0.3)

39
SCEC Traditional SC Fitness Function (2)
  • Clustering Output of Modified Poverty Dataset
    (parameter ß0.1)

40
SRIDHCR Traditional SC Fitness Function
  • Clustering Output of Modified Poverty Dataset
    (parameter ß0.3)

41
SRIDHCR Traditional SC Fitness Function
  • Clustering Output of Modified Poverty Dataset
    (parameter ß0.1)

42
  • Clustering using
  • Gerrymandering Fitness Function

43
Wyoming Age Dataset
44
Wyoming Modified Age Dataset
45
SCEC Gerrymandering Fitness Function (1)
  • Clustering Output of Wyoming Age Dataset
    (parameters k7, ß30000, ?0.01)

5 7
46
SCEC Gerrymandering Fitness Function (2)
  • Clustering Output of Wyoming Age Dataset
    (parameters k12, ß30000, ?0.01)

10 12
47
SCEC Gerrymandering Fitness Function (3)
  • Clustering Output of Wyoming Age Dataset
    (parameters k12, ß30000, ?0.08)

7 12
48
SRIDHCR Gerrymandering Fitness Function
(1)
  • Clustering Output of Wyoming Age Dataset
    (parameters k12, ß30000, ?0.01)

11 12
49
SRIDHCR Gerrymandering Fitness Function
(2)
  • Clustering Output of Wyoming Age Dataset
    (parameters k12, ß30000, ?0.08)

8 12
50
  • Clustering using
  • Reward-based Fitness Function

51
Wyoming Race Dataset
52
SCEC Reward-based Fitness Function (1)
  • Clustering Output of Wyoming Race Dataset
    (parameters ?10.5, ?21.5, R10, R-10, ß1.4)

53
Wyoming Poverty Dataset
54
SCEC Reward-based Fitness Function (2)
  • Clustering Output of Wyoming Poverty Dataset
    (parameters ?10.5, ?21.5, R10, R-10, ß1.1)

55
Modified Poverty Dataset
56
SCEC Reward-based Fitness Function (3)
  • Clustering Output of Modified Poverty Dataset
    (parameters ?10.5, ?21.5, R10, R-10, ß1.1)

57
Wyoming Income Dataset
58
SCEC Reward-based Fitness Function (4)
  • Clustering Output of Wyoming Income Dataset
    class of interest-class 1
  • (parameters ?10.5, ?21.5, R10, R-10, ß2)

59
SCEC Reward-based Fitness Function (5)
  • Clustering Output of Wyoming Income Dataset
    class of interest-class 4
  • (parameters ?10.5, ?21.5, R10, R-10, ß2)

60
SCEC Reward-based Fitness Function (6)
  • Clustering Output of Wyoming Income Dataset
    class of interest-class 4
  • (parameters ?10.5, ?21.5, R10, R-0, ß1.1)

61
SCEC Reward-based Fitness Function (7)
  • Clustering Output of Wyoming Income Dataset
    class of interest-class 4
  • (parameters ?10.5, ?21.5, R0, R-10, ß1.1)

62
Comparison of performances on SRIDHCR and SCEC
63
Execution of Algorithm
  • SCEC
  • generation
  • 0 300
  • 1 300
  • .
  • .
  • .
  • 50 300
  • 50 x 300 clusterings
  • SRIDHCR
  • random restart
  • 0 5000,5000,
  • 1 5000,...
  • .
  • .
  • .
  • 25 5000,
  • 25 x i x 5000 clusterings

Dataset size 5000
insertion/deletion
clusterings for i iterations
clusterings
64
Time Cost Comparison
65
Algorithm comparison
  • Solution Quality
  • SRIDHCR finds better solutions in traditional SC
    fitness function
  • SCEC finds better solutions in gerrymandering and
    reward-based fitness function
  • Time Cost
  • SCEC runs very fast
  • SRIDHCR is very slow on large datasets

66
Outline
  • Goals of the Thesis
  • Introduction to Clustering
  • Supervised Clustering (SC)
  • Fitness Function
  • Experimental Results
  • Conclusion and Future Work

67
Conclusion
  • Both traditional SC fitness function and
    reward-based fitness function show some merits on
    purity-base
  • Difficult to satisfy three criteria at once in
    gerrymandering fitness function
  • SCEC is fast for dataset with size 100-3000 and
    SRIDHCR is fast on 200 examples, okay on 1000
    examples, and slow on 3000 examples, cannot run
    beyond 5000 examples.
  • Graphic display tool works for visualizing and
    summarizing the clustering results
  • SCEC is a better SC algorithm for region
    discovery in term of both solution quality and
    time cost

68
Future Work
  • More experiments are still needed for fitness
    function and a more thorough experimental
    evaluation
  • Improve data structure to handle larger distance
    matrix
  • Faster algorithm are needed to process dataset
    with more than 5000 examples
  • Improve graphic display tool to handle more than
    25 clusters

69
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com