Title: Automated Hierarchical Density Shaving: A robust, automated clustering and visualization framework
1Automated Hierarchical Density Shaving A
robust, automatedclustering and visualization
framework
- Gunjan Gupta, Alex Liu, Joydeep Ghosh
- Nov 3, 2006
2Why cluster only a part of the data into dense
clusters?
3Why cluster only a part of the data into dense
clusters?
4Why cluster only a part of the data into dense
clusters?
Exhaustive clustering (K-Means) result
5Why cluster only a part of the data into dense
clusters?
- Little or no labeled data available.
- Only a part of the data clusters well.
- Or only a fraction of the data relevant.
6Biological Applications
- A few genes cluster well, and form tight
clusters - Gene expression data
- Phylogenetic profiles
- protein clustering
- Finding functional groupings in pathway networks
- Goals
- Cluster only a (small) part of the data into
multiple dense clusters. - Visualization for understanding and verifying
clusters.
7Other Application Scenarios
- Document Retrieval
- User interested in a few highly relevant
documents. - Market Basket Data
- Only some customers have highly correlated
behaviors - Feature selection
8Practical Issues
- How many dense clusters, where are they located?
- What fraction of data to cluster?
- Notion of density?
- All clusters not necessarily equally dense.
- Choice of model, distance measure.
9Density Based Clustering Algorithms
- HMA (Wishart, 1968)
- DBSCAN (Ester et al., 1996)
- Density Shaving (DS) and Auto-HDS framework.
- Can show there is a strong connection between
above algorithms. - All 3 use a uniform density kernel (density ? no.
points within some radius)
radius
radius
10Density Shaving (DS)
11Density Shaving (DS)
12Density Shaving (DS)
Two inputs 1. f_shave Fraction to shave
(0.38) 2. n? Min. no. of nbrs (3)
Uses a trick to automatically compute correct
ball radius from f_shave and n? .
13Density Shaving (DS)
Performs graph traversal on dense points to
identify clusters.
14Density Shaving (DS)
dont care points
15Properties of DS
- Increasing n? has a smoothing effect on
clustering.
n? 5
n? 50
.
x dense dont care points
16Properties of DS
- For a fixed n? , successive runs of DS with
increasing data shaving (f_shave) result in a
hierarchy of clusters.
38 shaving, n? 25
2-D Gaussian example 1298 pts, 5
Gaussians uniform background
15 shaving, n? 25
17Properties of DS
- With a fixed n? , successive runs of DS with
increasing shaving (f_shave) result in a
hierarchy of clusters.
15 shaving
38
62
85
- clusters can
- - split
- - vanish
- pts in separate
- clusters never
- merge into one
2-D Gaussian example 1298 pts, 5
Gaussians uniform background
18Hierarchical Density Shaving (HDS)
- Uses geometric/exponential shaving to create the
hierarchy from DS. - Starting from all, fixed fraction r_shave of data
shaved each iteration. - Clusters that lose points without splitting get
the same id. Example
38 shaving
55 shaving
A
A
B
B
19An important trickDictionary Row Sort on HDS
Label Matrix
20Visualization using the sorted Label Matrix
C 85 shaving
A 38
B 62
A
B
C
- Sorted matrix plotted
- Each cluster plotted in unique color
- Dont care points plotted in background color
Sorted rows index
- Shows the compact, 8-node hierarchy
Shaving iteration
21Cluster Stability
level
22 iterations
Shaving iteration ? level
Spatially relevant Projection
level
Stability diff. between first and last level of
a cluster ? no. of shaving
iterations a cluster exists.
22Cluster Selection
- We can show relative stability is independent
of shaving rate r_shave - All clusters can be ranked by stability, even
parents and children. - One way of selecting clusters
- - Highest stability clusters picked first.
- - Parents and children of picked clusters
eliminated.
23HDS Visualization Model selection Auto-HDS
Auto HDS (546 pts)
- Auto-HDS
- -Finds all modes /clusters.
- -Finds clusters of varying density
- simultaneously.
DS (546 pts, f_shave0.58)
24Datasets
- Gasch dataset
- Microarray data
- 173 experiments on yeast across 6151 dimensions
- Lee dataset
- Microarray data
- 5612 genes across 591 experiments
25Results Gasch Dataset
Results Gasch Dataset
26Results Gasch Dataset
H202
Menadione
Diauxic Shift
Heat Shock
Heat Shock
Reference pool, not stressed
Heat Shock
YPD
Nitrogen Depletion
Stationary Phase
Sorbitol osmotic shock
27Results Gasch Dataset
H202
Menadione
Diauxic Shift
Heat Shock
Heat Shock
Heat Shock
Nitrogen Depletion
Stationary Phase
Sorbitol osmotic shock
28Gasch results (ARI)
29Results Lee Dataset
30Results Lee Dataset
31Properties of Auto-HDS
- Fast O(n n? log n) using a heap-based imp.
- Gene DIVER a memory efficient heap-based
implementation. - Extremely compact hierarchy of clusters.
- Visualization
- Creates a spatially relevant 2-D projection of
points and clusters. - Spatially relevant 2-D project of the compact
hierarchy. - Model selection
- Can define a notion of stability (analogous to
cluster height) - Based on stability, can select the most stable
clusters automatically.
32Gene DIVER
- Gene Density Interactive Visual ExplorER
- A scalable implementation of Auto-HDS using
streaming data instead of main memory. - Special features for browsing clusters.
- Special features for biological data mining.
- Available for download at
- http//www.ideal.ece.utexas.edu/gunjan/gened
iver
Lets see the Gene DIVER Demo now
33Gene DIVER loading data
34Gene DIVER clustering
35Gene DIVER browse clustering
36Gene DIVER browsing gene in BioGRID
37Gene DIVER browsing cluster in FunSpec
38Gene DIVER browse clustering auto zoom
39Conclusion
- Auto-HDS improves upon non-parametric
density-based clustering in many ways - Well-suited for very high-d datasets.
- A powerful visualization.
- Interactive clustering, compact hierarchy.
- Automatic selection of clusters.
- Gene DIVER a powerful tool for the data mining
community, and especially for Bioinformatics
practitioners. - Browsing clustered genes through BioGRID
- Verifying (yeast only) gene clusters using
FunSpec - Ability to specify custom descriptions for
browsing
40?
41Conclusion
- Have introduced a framework that incorporates
- A robust hierarchical, density based clustering
- Automatic cluster selection
- Powerful, compact visualization of hierarchy,
clusters and points. - Good empirical results on biological data
- Gene DIVER
- A fast, memory efficient, interactive Java
implementation of Auto-HDS with Swing-based
Visualization http//www.ideal.ece.utexas.edu/gu
njan/genediver - Supports features relevant to biologists, such as
ability to enter own data and distance measure,
browse discovered gene clusters, etc.
42Results Lee Dataset