Automated Hierarchical Density Shaving: A robust, automated clustering and visualization framework

1 / 42

About This Presentation

Title:

Automated Hierarchical Density Shaving: A robust, automated clustering and visualization framework

Description:

Density Shaving (DS) and Auto-HDS framework. Can show there is a strong connection between above algorithms. ... implementation of Auto-HDS using streaming ... –

Number of Views:112

Avg rating:3.0/5.0

Slides: 43

Provided by: lansEce

Category:

more less

Transcript and Presenter's Notes

Title: Automated Hierarchical Density Shaving: A robust, automated clustering and visualization framework

1
Automated Hierarchical Density Shaving A
robust, automatedclustering and visualization
framework

Gunjan Gupta, Alex Liu, Joydeep Ghosh
Nov 3, 2006

2
Why cluster only a part of the data into dense
clusters?
3
Why cluster only a part of the data into dense
clusters?
4
Why cluster only a part of the data into dense
clusters?
Exhaustive clustering (K-Means) result
5
Why cluster only a part of the data into dense
clusters?

Little or no labeled data available.
Only a part of the data clusters well.
Or only a fraction of the data relevant.

6
Biological Applications

A few genes cluster well, and form tight
clusters
Gene expression data
Phylogenetic profiles
protein clustering
Finding functional groupings in pathway networks
Goals
Cluster only a (small) part of the data into
multiple dense clusters.
Visualization for understanding and verifying
clusters.

7
Other Application Scenarios

Document Retrieval
User interested in a few highly relevant
documents.
Market Basket Data
Only some customers have highly correlated
behaviors
Feature selection

8
Practical Issues

How many dense clusters, where are they located?
What fraction of data to cluster?
Notion of density?
All clusters not necessarily equally dense.
Choice of model, distance measure.

9
Density Based Clustering Algorithms

HMA (Wishart, 1968)
DBSCAN (Ester et al., 1996)
Density Shaving (DS) and Auto-HDS framework.
Can show there is a strong connection between
above algorithms.
All 3 use a uniform density kernel (density ? no.
points within some radius)

radius
radius
10
Density Shaving (DS)
11
Density Shaving (DS)
12
Density Shaving (DS)
Two inputs 1. f_shave Fraction to shave
(0.38) 2. n? Min. no. of nbrs (3)
Uses a trick to automatically compute correct
ball radius from f_shave and n? .
13
Density Shaving (DS)
Performs graph traversal on dense points to
identify clusters.
14
Density Shaving (DS)
dont care points
15
Properties of DS

Increasing n? has a smoothing effect on
clustering.

n? 5
n? 50
.
x dense dont care points
16
Properties of DS

For a fixed n? , successive runs of DS with
increasing data shaving (f_shave) result in a
hierarchy of clusters.

38 shaving, n? 25
2-D Gaussian example 1298 pts, 5
Gaussians uniform background
15 shaving, n? 25
17
Properties of DS

With a fixed n? , successive runs of DS with
increasing shaving (f_shave) result in a
hierarchy of clusters.

15 shaving
38
62
85

clusters can
- split
- vanish
pts in separate
clusters never
merge into one

2-D Gaussian example 1298 pts, 5
Gaussians uniform background
18
Hierarchical Density Shaving (HDS)

Uses geometric/exponential shaving to create the
hierarchy from DS.
Starting from all, fixed fraction r_shave of data
shaved each iteration.
Clusters that lose points without splitting get
the same id. Example

38 shaving
55 shaving
A
A
B
B
19
An important trickDictionary Row Sort on HDS
Label Matrix
20
Visualization using the sorted Label Matrix
C 85 shaving
A 38
B 62
A
B
C

Sorted matrix plotted
Each cluster plotted in unique color
Dont care points plotted in background color

Sorted rows index

Shows the compact, 8-node hierarchy

Shaving iteration
21
Cluster Stability
level
22 iterations
Shaving iteration ? level
Spatially relevant Projection
level
Stability diff. between first and last level of
a cluster ? no. of shaving
iterations a cluster exists.
22
Cluster Selection

We can show relative stability is independent
of shaving rate r_shave
All clusters can be ranked by stability, even
parents and children.
One way of selecting clusters
- Highest stability clusters picked first.
- Parents and children of picked clusters
eliminated.

23
HDS Visualization Model selection Auto-HDS
Auto HDS (546 pts)

Auto-HDS
-Finds all modes /clusters.
-Finds clusters of varying density
simultaneously.

DS (546 pts, f_shave0.58)
24
Datasets

Gasch dataset
Microarray data
173 experiments on yeast across 6151 dimensions
Lee dataset
Microarray data
5612 genes across 591 experiments

25
Results Gasch Dataset
Results Gasch Dataset
26
Results Gasch Dataset
H202
Menadione
Diauxic Shift
Heat Shock
Heat Shock
Reference pool, not stressed
Heat Shock
YPD
Nitrogen Depletion
Stationary Phase
Sorbitol osmotic shock
27
Results Gasch Dataset
H202
Menadione
Diauxic Shift
Heat Shock
Heat Shock
Heat Shock
Nitrogen Depletion
Stationary Phase
Sorbitol osmotic shock
28
Gasch results (ARI)
29
Results Lee Dataset
30
Results Lee Dataset
31
Properties of Auto-HDS

Fast O(n n? log n) using a heap-based imp.
Gene DIVER a memory efficient heap-based
implementation.
Extremely compact hierarchy of clusters.
Visualization
Creates a spatially relevant 2-D projection of
points and clusters.
Spatially relevant 2-D project of the compact
hierarchy.
Model selection
Can define a notion of stability (analogous to
cluster height)
Based on stability, can select the most stable
clusters automatically.

32
Gene DIVER

Gene Density Interactive Visual ExplorER
A scalable implementation of Auto-HDS using
streaming data instead of main memory.
Special features for browsing clusters.
Special features for biological data mining.
Available for download at
http//www.ideal.ece.utexas.edu/gunjan/gened
iver

Lets see the Gene DIVER Demo now
33
Gene DIVER loading data
34
Gene DIVER clustering
35
Gene DIVER browse clustering
36
Gene DIVER browsing gene in BioGRID
37
Gene DIVER browsing cluster in FunSpec
38
Gene DIVER browse clustering auto zoom
39
Conclusion

Auto-HDS improves upon non-parametric
density-based clustering in many ways
Well-suited for very high-d datasets.
A powerful visualization.
Interactive clustering, compact hierarchy.
Automatic selection of clusters.
Gene DIVER a powerful tool for the data mining
community, and especially for Bioinformatics
practitioners.
Browsing clustered genes through BioGRID
Verifying (yeast only) gene clusters using
FunSpec
Ability to specify custom descriptions for
browsing

40
?
41
Conclusion

Have introduced a framework that incorporates
A robust hierarchical, density based clustering
Automatic cluster selection
Powerful, compact visualization of hierarchy,
clusters and points.
Good empirical results on biological data
Gene DIVER
A fast, memory efficient, interactive Java
implementation of Auto-HDS with Swing-based
Visualization http//www.ideal.ece.utexas.edu/gu
njan/genediver
Supports features relevant to biologists, such as
ability to enter own data and distance measure,
browse discovered gene clusters, etc.