Automated Hierarchical Density Shaving: A robust, automated clustering and visualization framework

1 / 42
About This Presentation
Title:

Automated Hierarchical Density Shaving: A robust, automated clustering and visualization framework

Description:

Density Shaving (DS) and Auto-HDS framework. Can show there is a strong connection between above algorithms. ... implementation of Auto-HDS using streaming ... –

Number of Views:112
Avg rating:3.0/5.0
Slides: 43
Provided by: lansEce
Category:

less

Transcript and Presenter's Notes

Title: Automated Hierarchical Density Shaving: A robust, automated clustering and visualization framework


1
Automated Hierarchical Density Shaving A
robust, automatedclustering and visualization
framework
  • Gunjan Gupta, Alex Liu, Joydeep Ghosh
  • Nov 3, 2006

2
Why cluster only a part of the data into dense
clusters?
3
Why cluster only a part of the data into dense
clusters?
4
Why cluster only a part of the data into dense
clusters?
Exhaustive clustering (K-Means) result
5
Why cluster only a part of the data into dense
clusters?
  • Little or no labeled data available.
  • Only a part of the data clusters well.
  • Or only a fraction of the data relevant.

6
Biological Applications
  • A few genes cluster well, and form tight
    clusters
  • Gene expression data
  • Phylogenetic profiles
  • protein clustering
  • Finding functional groupings in pathway networks
  • Goals
  • Cluster only a (small) part of the data into
    multiple dense clusters.
  • Visualization for understanding and verifying
    clusters.

7
Other Application Scenarios
  • Document Retrieval
  • User interested in a few highly relevant
    documents.
  • Market Basket Data
  • Only some customers have highly correlated
    behaviors
  • Feature selection

8
Practical Issues
  • How many dense clusters, where are they located?
  • What fraction of data to cluster?
  • Notion of density?
  • All clusters not necessarily equally dense.
  • Choice of model, distance measure.

9
Density Based Clustering Algorithms
  • HMA (Wishart, 1968)
  • DBSCAN (Ester et al., 1996)
  • Density Shaving (DS) and Auto-HDS framework.
  • Can show there is a strong connection between
    above algorithms.
  • All 3 use a uniform density kernel (density ? no.
    points within some radius)

radius
radius
10
Density Shaving (DS)
11
Density Shaving (DS)
12
Density Shaving (DS)
Two inputs 1. f_shave Fraction to shave
(0.38) 2. n? Min. no. of nbrs (3)
Uses a trick to automatically compute correct
ball radius from f_shave and n? .
13
Density Shaving (DS)
Performs graph traversal on dense points to
identify clusters.
14
Density Shaving (DS)
dont care points
15
Properties of DS
  • Increasing n? has a smoothing effect on
    clustering.

n? 5
n? 50
.
x dense dont care points
16
Properties of DS
  • For a fixed n? , successive runs of DS with
    increasing data shaving (f_shave) result in a
    hierarchy of clusters.

38 shaving, n? 25
2-D Gaussian example 1298 pts, 5
Gaussians uniform background
15 shaving, n? 25
17
Properties of DS
  • With a fixed n? , successive runs of DS with
    increasing shaving (f_shave) result in a
    hierarchy of clusters.

15 shaving
38
62
85
  • clusters can
  • - split
  • - vanish
  • pts in separate
  • clusters never
  • merge into one

2-D Gaussian example 1298 pts, 5
Gaussians uniform background
18
Hierarchical Density Shaving (HDS)
  • Uses geometric/exponential shaving to create the
    hierarchy from DS.
  • Starting from all, fixed fraction r_shave of data
    shaved each iteration.
  • Clusters that lose points without splitting get
    the same id. Example

38 shaving
55 shaving
A
A
B
B
19
An important trickDictionary Row Sort on HDS
Label Matrix
20
Visualization using the sorted Label Matrix
C 85 shaving
A 38
B 62
A
B
C
  • Sorted matrix plotted
  • Each cluster plotted in unique color
  • Dont care points plotted in background color

Sorted rows index
  • Shows the compact, 8-node hierarchy

Shaving iteration
21
Cluster Stability
level
22 iterations
Shaving iteration ? level
Spatially relevant Projection
level
Stability diff. between first and last level of
a cluster ? no. of shaving
iterations a cluster exists.
22
Cluster Selection
  • We can show relative stability is independent
    of shaving rate r_shave
  • All clusters can be ranked by stability, even
    parents and children.
  • One way of selecting clusters
  • - Highest stability clusters picked first.
  • - Parents and children of picked clusters
    eliminated.

23
HDS Visualization Model selection Auto-HDS
Auto HDS (546 pts)
  • Auto-HDS
  • -Finds all modes /clusters.
  • -Finds clusters of varying density
  • simultaneously.

DS (546 pts, f_shave0.58)
24
Datasets
  • Gasch dataset
  • Microarray data
  • 173 experiments on yeast across 6151 dimensions
  • Lee dataset
  • Microarray data
  • 5612 genes across 591 experiments

25
Results Gasch Dataset
Results Gasch Dataset
26
Results Gasch Dataset
H202
Menadione
Diauxic Shift
Heat Shock
Heat Shock
Reference pool, not stressed
Heat Shock
YPD
Nitrogen Depletion
Stationary Phase
Sorbitol osmotic shock
27
Results Gasch Dataset
H202
Menadione
Diauxic Shift
Heat Shock
Heat Shock
Heat Shock
Nitrogen Depletion
Stationary Phase
Sorbitol osmotic shock
28
Gasch results (ARI)
29
Results Lee Dataset
30
Results Lee Dataset
31
Properties of Auto-HDS
  • Fast O(n n? log n) using a heap-based imp.
  • Gene DIVER a memory efficient heap-based
    implementation.
  • Extremely compact hierarchy of clusters.
  • Visualization
  • Creates a spatially relevant 2-D projection of
    points and clusters.
  • Spatially relevant 2-D project of the compact
    hierarchy.
  • Model selection
  • Can define a notion of stability (analogous to
    cluster height)
  • Based on stability, can select the most stable
    clusters automatically.

32
Gene DIVER
  • Gene Density Interactive Visual ExplorER
  • A scalable implementation of Auto-HDS using
    streaming data instead of main memory.
  • Special features for browsing clusters.
  • Special features for biological data mining.
  • Available for download at
  • http//www.ideal.ece.utexas.edu/gunjan/gened
    iver

Lets see the Gene DIVER Demo now
33
Gene DIVER loading data
34
Gene DIVER clustering
35
Gene DIVER browse clustering
36
Gene DIVER browsing gene in BioGRID
37
Gene DIVER browsing cluster in FunSpec
38
Gene DIVER browse clustering auto zoom
39
Conclusion
  • Auto-HDS improves upon non-parametric
    density-based clustering in many ways
  • Well-suited for very high-d datasets.
  • A powerful visualization.
  • Interactive clustering, compact hierarchy.
  • Automatic selection of clusters.
  • Gene DIVER a powerful tool for the data mining
    community, and especially for Bioinformatics
    practitioners.
  • Browsing clustered genes through BioGRID
  • Verifying (yeast only) gene clusters using
    FunSpec
  • Ability to specify custom descriptions for
    browsing

40
?
41
Conclusion
  • Have introduced a framework that incorporates
  • A robust hierarchical, density based clustering
  • Automatic cluster selection
  • Powerful, compact visualization of hierarchy,
    clusters and points.
  • Good empirical results on biological data
  • Gene DIVER
  • A fast, memory efficient, interactive Java
    implementation of Auto-HDS with Swing-based
    Visualization http//www.ideal.ece.utexas.edu/gu
    njan/genediver
  • Supports features relevant to biologists, such as
    ability to enter own data and distance measure,
    browse discovered gene clusters, etc.

42
Results Lee Dataset
Write a Comment
User Comments (0)
About PowerShow.com