Title: BiClustering II
1Bi-Clustering II
- COMP 790-90 Seminar
- Spring 2009
2Coherent Cluster
Want to accommodate noises but not outliers
3Coherent Cluster
- Coherent cluster
- Subspace clustering
- pair-wise disparity
- For a 2?2 (sub)matrix consisting of objects x,
y and attributes a, b
mutual bias of attribute a
mutual bias of attribute b
attribute
4Coherent Cluster
- A 2?2 (sub)matrix is a ?-coherent cluster if its
D value is less than or equal to ?. - An m?n matrix X is a ?-coherent cluster if every
2?2 submatrix of X is ?-coherent cluster. - A ?-coherent cluster is a maximum ?-coherent
cluster if it is not a submatrix of any other
?-coherent cluster. - Objective given a data matrix and a threshold ?,
find all maximum ?-coherent clusters.
5Coherent Cluster
- Challenges
- Finding subspace clustering based on distance
itself is already a difficult task due to the
curse of dimensionality. - The (sub)set of objects and the (sub)set of
attributes that form a cluster are unknown in
advance and may not be adjacent to each other in
the data matrix. - The actual values of the objects in a coherent
cluster may be far apart from each other. - Each object or attribute in a coherent cluster
may bear some relative bias (that are unknown in
advance) and such bias may be local to the
coherent cluster.
6Coherent Cluster
Compute the maximum coherent attribute sets for
each pair of objects
Two-way Pruning
Construct the lexicographical tree
Post-order traverse the tree to find maximum
coherent clusters
7Coherent Cluster
- Observation Given a pair of objects o1, o2 and
a (sub)set of attributes a1, a2, , ak, the 2?k
submatrix is a ?-coherent cluster iff, for every
attribute ai, the mutual bias (do1ai do2ai)
does not differ from each other by more than ?.
If ? 1.5, then a1,a2,a3,a4,a5 is a coherent
attribute set (CAS) of (o1,o2).
8Coherent Cluster
- Observation given a subset of objects o1, o2,
, ol and a subset of attributes a1, a2, ,
ak, the l?k submatrix is a ?-coherent cluster
iff a1, a2, , ak is a coherent attribute set
for every pair of objects (oi,oj) where 1 ? i, j
? l.
9Coherent Cluster
- Strategy find the maximum coherent attribute
sets for each pair of objects with respect to the
given threshold ?.
? 1
The maximum coherent attribute sets define the
search space for maximum coherent clusters.
10Two Way Pruning
(o0,o2) ?(a0,a1,a2) (o1,o2) ?(a0,a1,a2)
(a0,a1) ?(o0,o1,o2) (a0,a2) ?(o1,o2,o3) (a1,a2)
?(o1,o2,o4) (a1,a2) ?(o0,o2,o4)
(o0,o2) ?(a0,a1,a2) (o1,o2) ?(a0,a1,a2)
(a0,a1) ?(o0,o1,o2) (a0,a2) ?(o1,o2,o3) (a1,a2)
?(o1,o2,o4) (a1,a2) ?(o0,o2,o4)
delta1 nc 3 nr 3
MCAS
MCOS
11Coherent Cluster
- Strategy grouping object pairs by their CAS and,
for each group, find the maximum clique(s). - Implementation using a lexicographical tree to
organize the object pairs and to generate all
maximum coherent clusters with a single
post-order traversal of the tree.
12a0,a1 (o0,o1)
(o1,o2)
(o0,o2)
a0,a2 (o1,o3),(o2,o3)
(o1,o2)
(o0,o2)
a1,a2 (o0,o4),(o1,o4),(o2,o4)
(o1,o2)
(o0,o2)
a2,a3 (o0,o1),(o1,o2)
(o0,o2)
a0,a1,a2 (o1,o2)
(o0,o2)
a0,a1,a2,a3 (o0,o2)
a0
a2
a1
assume ? 1
a1
a2
a2
a3
(o0,o1)
(o1,o3)
(o0,o4)
(o0,o1)
(o2,o3)
(o1,o4)
(o1,o2)
a2
(o2,o4)
(o1,o2)
a3
(o0,o2)
13o0,o2 ? a0,a1,a2,a3
o1,o2 ? a0,a1,a2
o0,o1,o2 ? a0,a1
o1,o2,o3 ? a0,a2
o0,o2,o4 ? a1,a2
o1,o2,o4 ? a1,a2
o0,o1,o2 ? a2,a3
(o0,o2)
14Coherent Cluster
- High expressive power
- The coherent cluster can capture many interesting
and meaningful patterns overlooked by previous
clustering methods. - Efficient and highly scalable
- Wide applications
- Gene expression analysis
- Collaborative filtering
subspace cluster
coherent cluster
15Remark
- Comparing to Bicluster
- Can well separate noises and outliers
- No random data insertion and replacement
- Produce optimal solution
16Definition of OP-Cluster
- Let I be a subset of genes in the database.
Let J be a subset of conditions. We say ltI, Jgt
forms an Order Preserving Cluster
(OP-Cluster), if one
of the following relationships exists for any
pair of conditions. -
Expression Levels
A1 A2 A3 A4
when
17Problem Statement
- Given a gene expression matrix, our goal is to
find all the statistically significant
OP-Clusters. The significance is ensured by the
minimal size threshold nc and nr.
18Conversion to Sequence Mining Problem
Sequence
Expression Levels
A1 A2 A3 A4
19Ming OP-Clusters A naïve approach
- A naïve approach
- Enumerate all possible subsequences in a prefix
tree. - For each subsequences, collect all genes that
contain the subsequences. - Challenge
- The total number of distinct subsequences are
root
a
b
c
d
a
b
c
d
a
c
d
b
c
d
b
d
b
c
c
d
a
d
d
d
c
b
d
b
c
d
c
a
d
A Complete Prefix Tree with 4 items a,b,c,d
20Mining OP-Clusters Prefix Tree
- Goal
- Build a compact prefix tree that includes all
sub-sequences only occurring in the original
database. - Strategies
- Depth-First Traversal
- Suffix concatenation Visit subsequences that
only exist in the input sequences. - Apriori Property Visit subsequences that are
sufficiently supported in order to derive longer
subsequences.
Root
a1,2
b3
a1,2
a1,2,3
a1,2,3
d1
b2
a3
d1,3
d1,3
d1,2,3
d1,2,3
d2
b1
c1,3
d3
c1,2,3
c1
c2
c3
21References
- J. Yang, W. Wang, H. Wang, P. Yu, Delta-cluster
capturing subspace correlation in a large data
set, Proceedings of the 18th IEEE International
Conference on Data Engineering (ICDE), pp.
517-528, 2002. - H. Wang, W. Wang, J. Yang, P. Yu, Clustering by
pattern similarity in large data sets, to appear
in Proceedings of the ACM SIGMOD International
Conference on Management of Data (SIGMOD), 2002. - Y. Sungroh, C. Nardini, L. Benini, G. De
Micheli, Enhanced pClustering and its
applications to gene expression data
Bioinformatics and Bioengineering, 2004. - J. Liu and W. Wang, OP-Cluster clustering by
tendency in high dimensional space, ICDM03.