Region Discovery Using Hierarchical Supervised Clustering - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Region Discovery Using Hierarchical Supervised Clustering

Description:

A Measure of Interestingness Reward-based Fitness Function ... Then, the background information was attached to each individual county based on ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 55
Provided by: www249
Category:

less

Transcript and Presenter's Notes

Title: Region Discovery Using Hierarchical Supervised Clustering


1
Region Discovery Using Hierarchical Supervised
Clustering
  • Jing Wang

2
Outline
  • Goals of the thesis
  • Introduction to Spatial Data Mining and Region
    Discovery
  • Introduction to Supervised Clustering
  • A Measure of Interestingness Reward-based
    Fitness Function
  • Agglomerative Hierarchical Technique (SCAH)
  • Supervised Clustering using Multi-Resolution
    Grids (SCMRG)
  • Experimental Evaluation
  • Summary and Future Work

3
1. Goals of this thesis
  • Design Supervised Clustering Algorithms for
    Region Discovery
  • Evaluate Reward-based Fitness Function for Region
    Discovery
  • Analyze the performance of the developed
    algorithms for a benchmark of spatial datasets.

4
2. Introduction of Spatial Data Mining
  • Spatial Data Mining
  • is the process of discovering interesting,
    unexpected and previously unknown, but
    potentially useful patterns from large spatial
    datasets.
  • Typical Task
  • Revelation of interesting groups
  • Co-location discovery
  • Location prediction
  • Spatial interaction
  • Hot spot discoveries

5
Spatial dataset format
  • The raster data model
  • represents map features as cells in a grid
    matrix.
  • The matrix is organized by continuous, evenly
    spaced rows and columns.
  • Each cell is coded with an attribute that
    represents a data parameter that appears within
    the cell.
  • The vector data model
  • is used to represent discrete features that are
    defined as points, lines, and polygons in a
    geographic information system (GIS).
  • Vector data represent features as pairs of x, y
    coordinates.
  • A point is defined as a single x, y coordinate
    pair.
  • Lines and polygons are defined by a set or
    series of coordinate pairs.

6
Difference between Traditional Dataset and
Spatial Dataset
  • Spatial data usually comes in a continuous space,
    whereas traditional datasets are often discrete.
  • Spatial data mining usually focuses on the
    discussion of local patterns whereas traditional
    data mining techniques often focus on global
    patterns.
  • Data samples are independently generated in
    traditional statistical analysis Spatial data
    tends to be highly auto-correlated

7
Region Discovery
  • Discovering interesting regions in spatial
    datasets
  • In particular, identifying disjoint, contiguous
    regions that are unusual with respect to the
    distribution of a given class
  • e.g. a region that contains an unusually low or
    high number of instances of a particular class
  • In a reasonable time.

8
Region Discovery and Related Work
  • MAH98 Characterization and Trend Detection in
    Spatial Database.
  • Starts with a small set of target objects (see
    Figure, left).
  • Expands regions around the target objects,
    simultaneously selecting those attributes of the
    regions for which the distribution of values
    differs significantly from the distribution in
    the whole database (Figure, right).

9
3. Supervised Clustering
10
Category of Clustering Algorithm
  • Hierarchical clustering methods
  • start with all patterns in a single cluster and
    successively performs splitting or merging until
    a stopping criterion is met.
  • Partitional / Representative-based clustering
    methods
  • start with each pattern as a single cluster and
    iteratively reassigns each data object to one of
    the clusters to forming new cluster until a
    stopping criterion is met.
  • Density-based clustering methods
  • try to find clusters based on the density of data
    points in a region. Dense regions of objects in
    the data space form a cluster.
  • Grid-based clustering algorithms
  • quantize the clustering space into a finite
    number of cells, and then perform the required
    operations on the quantized space. Cells that
    contain more than a certain number of points are
    treated as dense. The dense cells are then merged
    or split to form the clusters.

11
4. Measuring the Interestingness of Region using
Reward-based Fitness Function
  • Evaluates a clustering based on the density of a
    class of interest C and assigns rewards to
    regions in which the distribution of class C
    significantly deviates from the prior probability
    of class C in the whole dataset.
  • The quality of a clustering Sum of the rewards
    associated with each cluster
  • A reward of a cluster c is computed as follows
    Reward (c) size (c)ß
  • Idea clusters reward increases with cluster
    size non-linearly
  • Consequence Merging two clusters with equal
    reward leads to a better clustering a ß b ßlt (a
    b) ß with a and b being the size of each
    cluster.

12
(No Transcript)
13
Example of Reward-based Fitness Function
  • Class of Interest Poor
  • Prior Probability 20
  • ?1 0.5, ?2 1.5
  • R 1, R- 1
  • ß 1.1, ?1.

10
30
14
Outline
  • Goals of the thesis
  • Introduction to Spatial Data Mining and Region
    Discovery
  • Introduction to Supervised Clustering
  • Measures of Interestingness for Region Discovery
  • Agglomerative Hierarchical Technique (SCAH)
  • Supervised Clustering using Multi-Resolution
    Grids (SCMRG)
  • Experimental Evaluation
  • Summary and Future Work

15
SCAH-Description
  • SCAH constructs a two-dimension dissimilarity
    matrix (N N) which is used to store spatial
    distance of clusters
  • SCAH then computes the merge candidates which
    contain pairs of clusters that can be potentially
    merged.
  • A pair of clusters (ci, cj) can be classified as
    merge candidate if ci is the closest cluster to
    cj or cj is the closest cluster to ci.
  • Merge candidates and distance information is
    updated incrementally when clusters are merged.
  • In each step, SCAH selected two clusters to merge
    from the pool of merge candidates so that this
    merging improves the overall quality most.
  • The algorithm terminates if no such pair of
    clusters exists.

16
Inter-Cluster Distance Measures for SCAH
  • Single-linkage
  • encounter the problem called chaining
    phenomenon
  • Complete-linkage
  • datasets contain noise or outliers then the
    farthest objects are very possibly outliers.
  • Average-linkage
  • computationally more expensive
  • the chaining problem is fixed and outliers have
    much less negative impact on clustering decisions

17
Flow chart
18
Outline
  • Goals of the thesis
  • Introduction to Spatial Data Mining and Region
    Discovery
  • Introduction to Supervised Clustering
  • Measure of Interestingness Reward-based Fitness
    Function
  • Agglomerative Hierarchical Technique (SCAH)
  • Supervised Clustering using Multi-Resolution
    Grids (SCMRG)
  • Experimental Evaluation
  • Summary and Future Work

19
6. Supervised Clustering using
Multi-Resolution Grids
  • SCMRG is a hierarchical grid based method for
    reward-based region discovery.
  • SCMRG uses rectangular grid cells and searches in
    a top-down direction, subdividing a rectangular
    grid cell into sets of smaller rectangular grid
    cells.

STINGWang97
20
Example of SCMRG
  • If a cell receives a reward, and its reward is
    larger than the sum of the rewards associated of
    its children and larger than the sum of rewards
    of its grandchildren, this cell is returned as a
    cluster by the algorithm.
  • If a cell does not receive a reward and its
    children and grandchildren do not receive a
    reward, neither the cell nor any of its
    descendents will be included in the computed
    clustering.
  • Otherwise, all the children cells of the cell are
    put on a queue for further processing.

21
Pseudo code
22
Indexing Structure for SCMRG
23
Outline
  • Goals of the thesis
  • Introduction to Spatial Data Mining and Region
    Discovery
  • Introduction to Supervised Clustering
  • Measure of Interestingness Reward-based Fitness
    Function
  • Agglomerative Hierarchical Technique (SCAH)
  • Supervised Clustering using Multi-Resolution
    Grids (SCMRG)
  • Experimental Evaluation
  • Summary and Future Work

24
Objectives of the Experiments
  • To illustrate how SCAH and SCMRG perform for
    different kinds of spatial datasets
  • To study how changes in the parameters of the
    fitness functions affect the clustering results
    and region discovery, in general.
  • To determine which supervised clustering
    algorithm performs well for which dataset and
    identify the strength and weakness of each
    algorithm.

25
Datasets Used
These two datasets were obtained from Salvador,
S. and Chan, P., which were used in their
publication Determining the Number of
Clusters/Segments in Hierarchical
clustering/Segmentation Algorithm.
26
Datasets Used
  • Obtained from Geosciences Department in
    University of Houston.
  • The Volcano dataset contains basic geographic and
    geologic information for volcanoes thought to be
    active in the last 10,000 years
  • The original data include a unique volcano
    number, volcano name, location, latitude and
    longitude, summit elevation, volcano type, status
    and the time range of the last recorded eruption.
  • The Subset of the volcano dataset used in this
    thesis contains longitude, latitude and a class
    variable that indicates if a volcano is non
    violent (blue) or violent (red).

27
Datasets Used
  • Obtained from Geosciences Department in
    University of Houston.
  • The Earthquake dataset contains all earthquake
    data worldwide done by the United States
    Geological Survey (USGS) National Earthquake
    Information Center (NEIC).
  • The modified Earthquake dataset contains the
    longitude, latitude and a class variable that
    indicates the depth of the earthquake,
    0(shallow), 1(medium) and 2(deep).

28
Datasets Used
  • Wyoming datasets were created from U.S. Census
    2000 data.
  • The Wyoming Modified Poverty Status in 1999 is a
    modified version of the original dataset, Wyoming
    Poverty Status.
  • The Wyoming Poverty Datasets were created using
    county statistics. For each county, random
    population coordinates were generated using the
    complete spatial randomness (CSR) functions in
    S-PLUS.
  • Then, the background information was attached to
    each individual county based on the countys
    distribution for the class of interest. Finally,
    all counties were merged into a single dataset
    that describes the whole state.

29
Datasets Used
30
Evaluation Measures Used
  • Number of clusters it indicates the number of
    regions discovered.
  • SCMRG the regions discovered denotes the number
    of black cells that correspond to the set of
    discovered regions that receive a highest reward.
  • SCAH the number of clusters is reported.
  • Outliers
  • SCMRG if the region does not receive reward, the
    objects belong to this region are considered to
    be outliers.
  • SCAH every object has to belong to a cluster
    therefore, there are no outliers when running
    SCAH.
  • Quality the result of the reward-based fitness
    function.
  • Cluster purity
  • it is measured by the percentage of majority
    examples in different clusters that are returned
    by the supervised clustering algorithm.
  • A majority example in a cluster is an instance
    that belongs to a class which has the most
    frequent in that cluster.
  • Time Complexity this is total time used to
    process the whole dataset and produce the final
    clustering result. In the experiments, we used
    Wall-Clock Time (WCT) in seconds.

31
Input Parameters Tested
32
Using different Inter-Cluster Distance Measures
for SCAH
33
Complex9 Experimental Result Analysis
  • The outside elliptical shape belongs to class 1
    and two spots inside belong to class 0.
  • Based on average distance, merge candidates 1,
    6, 5, 6, 3, 7, 2, 7
  • SCAH will stop here since no improvement made
    using a single merging. However, for ß3 a higher
    reward can be obtained by merging all 7 clusters,
    but SCAH fails to do so.

34
Experimental Results for SCAH
35
SCAH Earthquake Results
36
SCAH Volcano Original Dataset
37
SCAH Volcano Experimental Result
38
SCAH - Advantage and Weakness
  • SCAH does not need to have prior knowledge of the
    datasets to run, and it processes the objects
    based on the distance matrix. It can precisely
    find the nearest neighbor of each cluster and
    produce the merge candidates for the next
    iteration.
  • SCAH does a good job in picking up small pure
    clusters, like chain-like patterns in the Volcano
    datasets that have been approximated by
    representative-based clustering algorithm into a
    union of clusters.
  • SCAHs definition of merging one pair of
    candidates per cluster is too restrictive to get
    good result as beta increases.
  • SCAH also has no look-ahead option and terminates
    in the certain step in the experiments of
    Complex9 dataset even when ß increases.
  • Moreover, SCAH has to store a two dimensional
    distance matrix and the complexity of this
    algorithm will be O (N2) where N is the total
    number of objects in the dataset. Therefore,
    space complexity and time complexity still remain
    the main challenges for this algorithm.

39
SCMRG - Experimental Results
  • The color yellow is used to indicate reward
    regions that have a high density with respect to
    the class of interest.
  • The color black indicates reward regions with a
    very low density of the class of interest.
  • Non-reward regions will not be colored at all.

40
ß 1.01? 6
41
ß 1.3 ? 1
42
ß 3 ? 1
43
SCMRG Complex9
44
SCMRG Complex8
45
SCMRG Volcano
46
SCMRG Earthquake
47
SCMRG - Advantage and Weakness
  • The main advantage of this approach is its fast
    processing time in comparison to the SCAH. The
    time complexity is determined by the number of
    cells added in the queue for further processing
    in each level. It makes this algorithm much
    faster than SCAH.
  • The disadvantage of SCMRG is that the boundary of
    the regions it discovered could only be vertical
    and horizontal since the regions discovered are
    mostly composed of square cells.

48
Comparison of Experimental Results
49
Comparison of Experimental Results
50
Purity and Quality
51
Comparison
  • SCAH outperforms SCMRG and SCEC on the purity and
    quality of the clustering result. However, SCAH
    does not do well for ß 3 SCAH only considers
    one merge candidate per cluster which is too
    restrictive to get good result. When ß increases,
    the algorithm gets stuck because of the
    restriction.
  • SCMRG dramatically outperforms the SCAH with
    respect to time consumed. SCMRG can use several
    seconds to process thousands of objects in a
    dataset while SCAH takes more than an hour to
    process the same dataset.
  • Comparing to the SCAH, SCEC takes much more time
    to process the same dataset. SCEC takes more than
    14 hours to process the earthquake 1 dataset,
    but SCMRG just takes 4 minutes to process the
    earthquake 1 dataset. Normally, the time
    consumed using SCEC is 3 to 5 times more than
    that of SCAH. From the time complexity aspect,
    SCMRG has the best performance. SCAH is better
    than SCEC.

52
Summary
  • We introduced two supervised clustering
    approaches and a reward-based evaluation
    framework suitable for region discovery problems.
  • Finding interesting regions in spatial datasets
    is viewed as a clustering problem, in which the
    sum of rewards for the obtained clusters is
    maximized, and where the reward associated with
    each cluster reflects its degree of
    interestingness for the problem at hand.
  • This approach is unique and quite different from
    most other work in spatial data mining that
    mostly uses association rules. Different measures
    of interestingness can easily be supported in the
    proposed framework by designing different cluster
    reward functions that correspond to different
    measures of interestingness
  • In this case, neither the supervised clustering
    algorithm itself nor our general evaluation
    framework has to be modified.
  • We also discussed how hierarchical, and
    grid-based clustering algorithms can be adapted
    for supervised clustering in general, and for
    region discovery in particular, and provided
    evidence with respect to the usefulness of the
    proposed framework for hotspot discovery
    problems.

53
Future Work
  • The algorithms should be tested with other reward
    functions.
  • For the SCAH, strategies should be developed to
    avoid that the algorithms gets stuck, returning a
    too large number of regions.
  • SCMRG needs to find a better way to visualize and
    has to approximate regions more precisely.
  • More comparisons with other region discovery
    algorithms are needed as well.

54
Reference
  • 1 Fayyad96 Usama M. Fayyad Tutorial on
    Knowledge Discovery and Data Mining
    IJCAI-95/KDD-95.
  • 2 NgHan94 R. Ng and J. Han, '' Efficient and
    Effective Clustering Method for Spatial Data
    Mining'', Proc. of 1994 Int'l Conf. on Very Large
    Data Bases (VLDB'94), Santiago, Chile, September
    1994, pp. 144-155.
  • 3 MAHKDD98 Ester M., Frommelt A., Kriegel
    H.-P., Sander J. Algorithms for
    Characterization and Trend Detection in Spatial
    Databases, Proc. 4th Int. Conf. on Knowledge
    Discovery and Data Mining (KDD'98), New York
    City, NY, 1998, pp. 44-50.
  • 4 KR87 KR90, clustering large applications
    based on randomized search
  • 5 NH02 Raymond T. Ng and Jiawei Han, Member,
    CLARANS A Method for Clustering Objects for
    Spatial Data Mining IEEE Transactions on
    Knowledge and Data Engineering, Vol. 14, No. 5,
    September/October 2002
  • 6 EST96 Ester, Kriegel, Sander, and Xu
    DBSCAN Density Based Spatial Clustering of
    Applications with Noise Knowledge Discovery and
    Data Mining (KDD96)
  • 7 ABHS99 Mihael Ankerst, Markus M. Breunig,
    Hans-Peter Kriegel, Jorg Sander. OPTICS
    Ordering Points To Identify the Clustering
    Structure Pro. ACM SOGMOD99 Int, Conf. on
    Management of Data, Philadelphia PA, 1999
  • 8 WANG97 Wei Wang, Jiong Yang, Richard Muntz
    STING A Statistical Information Grid Approach
    to Spatial Data Mining Twenty-Third
    International Conference on Very Large Data Bases
    1997
  • 9 BBM03 Basu, S., Bilenko,M., Mooney, R.
    Comparing and Unifying Search-based and
    Similarity-Based Approaches to Semi-Supervised
    Clustering, in Proc. ICML03, pp. 42-29,
    Washington DC, August 2003.
  • 10 BHSW03 Bar-Hillel, A., Hertz, T., Shental,
    N., Weinshall, D. Learning Distance Functions
    Using Equivalence Relations, in Proc. ICML03,
    Washington DC, August 2003.
  • 11 DBE99 Demiriz, A., Benett, K.-P.,
    Embrechts, M.J. Semi-supervised Clustering using
    Genetic Algorithms, in Proc. ANNIE99.
  • 12 ERCBV04 Eick, C., Rouhana, A., Chen, C.,
    Bagherjeiran, A., Vilalta, R. Using Clustering
    to Learn Distance Functions for Supervised
    Similarity Assessment, submitted for
    publication.
  • 13 EZ05 C. Eick, N. Zeidat, Using Supervised
    Clustering to Enhance Classifiers, in Proc. 15th
    International Symposium on Methodologies for
    Intelligent Systems (ISMIS), Saratoga Springs,
    New York, May 2005.
  • 14 EZV04 Using Representative-Based
    Clustering for Nearest Neighbor Dataset Editing,
    in Proc. IEEE Int. Conference on Data Mining,
    Brighton, England, Nov. 2004.
  • 15 KKM02 Klein,D., Kamvar,S.-D., Manning, C.
    From instance-level Constraints to Space-level
    Constraints Making the Most of Prior Knowledge
    in Data Clustering, in Proc. ICML02, Sydney,
    Australia.
  • 16 KR90 Kaufman L. and Rousseeuw P. J.
    Finding Groups in Data an Introduction to
    Cluster Analysis, John Wiley Sons, 1990.
  • 17 ST99 Slonim, N. and Tishby, N.,
    Agglomerative Information Bottleneck, Neural
    Information Processing Systems (NIPS-1999).
  • 18 Z04 Zhao, Z., Evolutionary Computing and
    Splitting Algorithms for Supervised Clustering,
    Masters Thesis, University of Houston,
    Department of Computer Science, May 2004,
    cs.uh.edu/zhenzhao/ZhenghongThesis.zip.
  • 19 ZE04 N. Zeidat and C. Eick, K-medoid-style
    Clustering Algorithms for Supervised Summary
    Generation, in Proc. 2004 International
    Conference on Machine Learning Models,
    Technologies and Applications (MLMTA'04), Las
    Vegas, Nevada, June 2004.
Write a Comment
User Comments (0)
About PowerShow.com