BiClustering II - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

BiClustering II

Description:

Each object or attribute in a coherent cluster may bear some relative bias (that ... Strategy: find the maximum coherent attribute sets for each pair of objects with ... – PowerPoint PPT presentation

Number of Views:191

Avg rating:3.0/5.0

Slides: 22

Provided by: anselmo9

Category:

more less

Transcript and Presenter's Notes

Title: BiClustering II

1
Bi-Clustering II

COMP 790-90 Seminar
Spring 2009

2
Coherent Cluster
Want to accommodate noises but not outliers
3
Coherent Cluster

Coherent cluster
Subspace clustering
pair-wise disparity
For a 2?2 (sub)matrix consisting of objects x,
y and attributes a, b

mutual bias of attribute a
mutual bias of attribute b
attribute
4
Coherent Cluster

A 2?2 (sub)matrix is a ?-coherent cluster if its
D value is less than or equal to ?.
An m?n matrix X is a ?-coherent cluster if every
2?2 submatrix of X is ?-coherent cluster.
A ?-coherent cluster is a maximum ?-coherent
cluster if it is not a submatrix of any other
?-coherent cluster.
Objective given a data matrix and a threshold ?,
find all maximum ?-coherent clusters.

5
Coherent Cluster

Challenges
Finding subspace clustering based on distance
itself is already a difficult task due to the
curse of dimensionality.
The (sub)set of objects and the (sub)set of
attributes that form a cluster are unknown in
advance and may not be adjacent to each other in
the data matrix.
The actual values of the objects in a coherent
cluster may be far apart from each other.
Each object or attribute in a coherent cluster
may bear some relative bias (that are unknown in
advance) and such bias may be local to the
coherent cluster.

6
Coherent Cluster
Compute the maximum coherent attribute sets for
each pair of objects
Two-way Pruning
Construct the lexicographical tree
Post-order traverse the tree to find maximum
coherent clusters
7
Coherent Cluster

Observation Given a pair of objects o1, o2 and
a (sub)set of attributes a1, a2, , ak, the 2?k
submatrix is a ?-coherent cluster iff, for every
attribute ai, the mutual bias (do1ai do2ai)
does not differ from each other by more than ?.

If ? 1.5, then a1,a2,a3,a4,a5 is a coherent
attribute set (CAS) of (o1,o2).
8
Coherent Cluster

Observation given a subset of objects o1, o2,
, ol and a subset of attributes a1, a2, ,
ak, the l?k submatrix is a ?-coherent cluster
iff a1, a2, , ak is a coherent attribute set
for every pair of objects (oi,oj) where 1 ? i, j
? l.

9
Coherent Cluster

Strategy find the maximum coherent attribute
sets for each pair of objects with respect to the
given threshold ?.

? 1
The maximum coherent attribute sets define the
search space for maximum coherent clusters.
10
Two Way Pruning
(o0,o2) ?(a0,a1,a2) (o1,o2) ?(a0,a1,a2)
(a0,a1) ?(o0,o1,o2) (a0,a2) ?(o1,o2,o3) (a1,a2)
?(o1,o2,o4) (a1,a2) ?(o0,o2,o4)
(o0,o2) ?(a0,a1,a2) (o1,o2) ?(a0,a1,a2)
(a0,a1) ?(o0,o1,o2) (a0,a2) ?(o1,o2,o3) (a1,a2)
?(o1,o2,o4) (a1,a2) ?(o0,o2,o4)
delta1 nc 3 nr 3
MCAS
MCOS
11
Coherent Cluster

Strategy grouping object pairs by their CAS and,
for each group, find the maximum clique(s).
Implementation using a lexicographical tree to
organize the object pairs and to generate all
maximum coherent clusters with a single
post-order traversal of the tree.

12
a0,a1 (o0,o1)
(o1,o2)
(o0,o2)
a0,a2 (o1,o3),(o2,o3)
(o1,o2)
(o0,o2)
a1,a2 (o0,o4),(o1,o4),(o2,o4)
(o1,o2)
(o0,o2)
a2,a3 (o0,o1),(o1,o2)
(o0,o2)
a0,a1,a2 (o1,o2)
(o0,o2)
a0,a1,a2,a3 (o0,o2)
a0
a2
a1
assume ? 1
a1
a2
a2
a3
(o0,o1)
(o1,o3)
(o0,o4)
(o0,o1)
(o2,o3)
(o1,o4)
(o1,o2)
a2
(o2,o4)
(o1,o2)
a3
(o0,o2)
13
o0,o2 ? a0,a1,a2,a3
o1,o2 ? a0,a1,a2
o0,o1,o2 ? a0,a1
o1,o2,o3 ? a0,a2
o0,o2,o4 ? a1,a2
o1,o2,o4 ? a1,a2
o0,o1,o2 ? a2,a3
(o0,o2)
14
Coherent Cluster

High expressive power
The coherent cluster can capture many interesting
and meaningful patterns overlooked by previous
clustering methods.
Efficient and highly scalable
Wide applications
Gene expression analysis
Collaborative filtering

subspace cluster
coherent cluster
15
Remark

Comparing to Bicluster
Can well separate noises and outliers
No random data insertion and replacement
Produce optimal solution

16
Definition of OP-Cluster

Let I be a subset of genes in the database.
Let J be a subset of conditions. We say ltI, Jgt
forms an Order Preserving Cluster
(OP-Cluster), if one
of the following relationships exists for any
pair of conditions.

Expression Levels
A1 A2 A3 A4

when
17
Problem Statement

Given a gene expression matrix, our goal is to
find all the statistically significant
OP-Clusters. The significance is ensured by the
minimal size threshold nc and nr.

18
Conversion to Sequence Mining Problem
Sequence
Expression Levels
A1 A2 A3 A4
19
Ming OP-Clusters A naïve approach

A naïve approach
Enumerate all possible subsequences in a prefix
tree.
For each subsequences, collect all genes that
contain the subsequences.
Challenge
The total number of distinct subsequences are

root
a
b
c
d
a
b
c
d
a
c
d

b
c
d
b
d
b
c
c
d
a
d

d
d
c
b
d
b
c
d
c
a
d

A Complete Prefix Tree with 4 items a,b,c,d
20
Mining OP-Clusters Prefix Tree

Goal
Build a compact prefix tree that includes all
sub-sequences only occurring in the original
database.
Strategies
Depth-First Traversal
Suffix concatenation Visit subsequences that
only exist in the input sequences.
Apriori Property Visit subsequences that are
sufficiently supported in order to derive longer
subsequences.

Root
a1,2
b3
a1,2
a1,2,3
a1,2,3
d1
b2
a3
d1,3
d1,3
d1,2,3
d1,2,3
d2
b1
c1,3
d3
c1,2,3
c1
c2
c3
21
References

J. Yang, W. Wang, H. Wang, P. Yu, Delta-cluster
capturing subspace correlation in a large data
set, Proceedings of the 18th IEEE International
Conference on Data Engineering (ICDE), pp.
517-528, 2002.
H. Wang, W. Wang, J. Yang, P. Yu, Clustering by
pattern similarity in large data sets, to appear
in Proceedings of the ACM SIGMOD International
Conference on Management of Data (SIGMOD), 2002.
Y. Sungroh, C. Nardini, L. Benini, G. De
Micheli, Enhanced pClustering and its
applications to gene expression data
Bioinformatics and Bioengineering, 2004.
J. Liu and W. Wang, OP-Cluster clustering by
tendency in high dimensional space, ICDM03.