Title: Subspace Clustering/Biclustering
1Subspace Clustering/Biclustering
- CS 685 Special Topics in Data Mining
- Spring 2008
- Jinze Liu
2Data Mining Clustering
K-means clustering minimizes
Where
3Clustering by Pattern Similarity (p-Clustering)
- The micro-array raw data shows 3 genes and
their values in a multi-dimensional space - Parallel Coordinates Plots
- Difficult to find their patterns
- non-traditional clustering
4Clusters Are Clear After Projection
5Motivation
- E-Commerce collaborative filtering
Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 Movie 7
Viewer 1 1 2 4 3 5
Viewer 2 4 6 7 1
Viewer 3 2 3 4 6 3
Viewer 4 3 4 5 7
Viewer 5 5 5 3 4
6Motivation
7Motivation
Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 Movie 7
Viewer 1 1 2 4 3 5
Viewer 2 4 6 7 1
Viewer 3 2 3 4 6 3
Viewer 4 3 4 5 7
Viewer 5 5 5 3 4
8Motivation
9Gene Expression Data
10Biclustering of Gene Expression Data
- Genes not regulated under all conditions
- Genes regulated by multiple factors/processes
concurrently - Key to determine function of genes
- Key to determine classification of conditions
11Motivation
CH1I CH1B CH1D CH2I CH2B
CTFC3 4392 284 4108 280 228
VPS8 401 281 120 275 298
EFB1 318 280 37 277 215
SSA1 401 292 109 580 238
FUN14 2857 285 2576 271 226
SP07 228 290 48 285 224
MDM10 538 272 266 277 236
CYS3 322 288 41 278 219
DEP1 312 272 40 273 232
NTG1 329 296 33 274 228
12Motivation
13Motivation
- Strong coherence exhibits by the selected objects
on the selected attributes. - They are not necessarily close to each other but
rather bear a constant shift. - Object/attribute bias
- bi-cluster
14Challenges
- The set of objects and the set of attributes are
usually unknown. - Different objects/attributes may possess
different biases and such biases - may be local to the set of selected
objects/attributes - are usually unknown in advance
- May have many unspecified entries
15Whats Biclustering?
- Given an n x m matrix, A, find a set of
submatrices, Bk, such that the contents of each
Bk follow a desired pattern - Row/Column order need not be consistent between
different Bks
16Bipartite Graphs
- Matrix can be thought of as a Graph Rows are one
set of vertices L, Columns are another set R - Edges are weighted by the corresponding entries
in the matrix If all weights are binary,
biclustering becomes biclique finding
17Bicluster Structures
18Previous Work
- Subspace clustering
- Identifying a set of objects and a set of
attributes such that the set of objects are
physically close to each other on the subspace
formed by the set of attributes. - Collaborative filtering Pearson R
- Only considers global offset of each
object/attribute.
19bi-cluster
- Consists of a (sub)set of objects and a (sub)set
of attributes - Corresponds to a submatrix
- Occupancy threshold ?
- Each object/attribute has to be filled by a
certain percentage. - Volume number of specified entries in the
submatrix - Base average value of each object/attribute (in
the bi-cluster)
20bi-cluster
CH1I CH1B CH1D CH2I CH2B Obj base
CTFC3
VPS8 401 120 298 273
EFB1 318 37 215 190
SSA1
FUN14
SP07
MDM10
CYS3 322 41 219 194
DEP1
NTG1
Attr base 347 66 244 219
21Bicluster Structures
22Cheng and Church
- Example
- Correlation between any two columns correlation
between any two rows 1. - aij aiJ aIj aIJ, where aiJ mean of row i,
aIj mean of column j, aIJ mean of A. - Biological meaning the genes have the same
(amount of) response to the conditions.
Back. 5 Col 0 1 Col 1 3 Col 2 2
Row 0 2 8 10 9
Row 1 4 10 12 11
Row 2 1 7 9 8
23bi-cluster
- Perfect ?-cluster
- Imperfect ?-cluster
- Residue
dij
diJ
dIJ
dIj
24Cheng and Church
- Model
- A bicluster is represented the submatrix A of the
whole expression matrix (the involved rows and
columns need not be contiguous in the original
matrix). - Each entry Aij in the bicluster is the
superposition (summation) of - The background level
- The row (gene) effect
- The column (condition) effect
- A dataset contains a number of biclusters, which
are not necessarily disjoint.
25Cheng and Church
- Finding the largest ?-bicluster
- The problem of finding the largest square
?-bicluster (I J) is NP-hard. - Objective function for heuristic methods (to
minimize)gt sum of the components from each
row and column, which suggests simple greedy
algorithms to evaluate each row and column
independently.
26Cheng and Church
- Greedy methods
- Algorithm 0 Brute-force deletion (skipped)
- Algorithm 1 Single node deletion
- Parameter(s) ? (maximum squared residue).
- Initialization the bicluster contains all rows
and columns. - Iteration
- Compute all aIj, aiJ, aIJ and H(I, J) for reuse.
- Remove a row or column that gives the maximum
decrease of H. - Termination when no action will decrease H or H
lt ?. - Time complexity O(MN)
27Cheng and Church
- Greedy methods
- Algorithm 2 Multiple node deletion (take one
more parameter ?. In iteration step 2, delete all
rows and columns with row/column residue gt ?H(I,
J)). - Algorithm 3 Node addition (allow both additions
and deletions of rows/columns).
28Cheng and Church
- Handling missing values and masking discovered
biclusters replace by random numbers so that no
recognizable structures will be introduced. - Data preprocessing
- Yeast x ? 100log(105x)
- Lymphoma x ? 100x (original data is already
log-transformed)
29Cheng and Church
- Some results on yeast cell cycle data (2884?17)
30Cheng and Church
- Some results on lymphoma data (4026?96)
No. of genes, no. of conditions No. of genes, no. of conditions No. of genes, no. of conditions
4, 96 10, 29 11, 25
103, 25 127, 13 13, 21
10, 57 2, 96 25, 12
9, 51 3, 96 2, 96
31Cheng and Church
- Discussion
- Biological validation comparing with the
clusters in previously published results. - No evaluation of the statistical significance of
the clusters. - Both the model and the algorithm are not tailored
for discovering multiple non-disjoint clusters. - Normalization is of utmost importance for the
model, but this issue is not well-discussed.
32The FLOC algorithm
Generating initial clusters
Determine the best action for each row and each
column
Perform the best action of each row and column
sequentially
Y
Improved?
N
33The FLOC algorithm
- Action the change of membership of a row(or
column) with respect to a cluster
column
M4
1
2
3
4
row
3
4
2
2
1
MN actions are Performed at each iteration
1
3
3
2
2
N3
4
2
0
4
3
34Performance
- Microarray data 2884 genes, 17 conditions
- 100 bi-clusters with smallest residue were
returned. - Average residue 10.34
- The average residue of clusters found via the
state of the art method in computational biology
field is 12.54 - The average volume is 25 bigger
- The response time is an order of magnitude faster
35Coherent Cluster
Want to accommodate noises but not outliers
36Coherent Cluster
- Coherent cluster
- Subspace clustering
- pair-wise disparity
- For a 2?2 (sub)matrix consisting of objects x,
y and attributes a, b
mutual bias of attribute a
mutual bias of attribute b
attribute
37Coherent Cluster
- A 2?2 (sub)matrix is a ?-coherent cluster if its
D value is less than or equal to ?. - An m?n matrix X is a ?-coherent cluster if every
2?2 submatrix of X is ?-coherent cluster. - A ?-coherent cluster is a maximum ?-coherent
cluster if it is not a submatrix of any other
?-coherent cluster. - Objective given a data matrix and a threshold ?,
find all maximum ?-coherent clusters.
38Coherent Cluster
- Challenges
- Finding subspace clustering based on distance
itself is already a difficult task due to the
curse of dimensionality. - The (sub)set of objects and the (sub)set of
attributes that form a cluster are unknown in
advance and may not be adjacent to each other in
the data matrix. - The actual values of the objects in a coherent
cluster may be far apart from each other. - Each object or attribute in a coherent cluster
may bear some relative bias (that are unknown in
advance) and such bias may be local to the
coherent cluster.
39References
- J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster
capturing subspace correlation in a large data
set, Proceedings of the 18th IEEE International
Conference on Data Engineering (ICDE), pp.
517-528, 2002. - H. Wang, W. Wang, J. Young, P. Yu, Clustering by
pattern similarity in large data sets, to appear
in Proceedings of the ACM SIGMOD International
Conference on Management of Data (SIGMOD), 2002. - Y. Sungroh, C. Nardini, L. Benini, G. De
Micheli, Enhanced pClustering and its
applications to gene expression data
Bioinformatics and Bioengineering, 2004. - J. Liu and W. Wang, OP-Cluster clustering by
tendency in high dimensional space, ICDM03.