BiClustering II - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

BiClustering II

Description:

Each object or attribute in a coherent cluster may bear some relative bias (that ... Strategy: find the maximum coherent attribute sets for each pair of objects with ... – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 22
Provided by: anselmo9
Category:

less

Transcript and Presenter's Notes

Title: BiClustering II


1
Bi-Clustering II
  • COMP 790-90 Seminar
  • Spring 2009

2
Coherent Cluster
Want to accommodate noises but not outliers
3
Coherent Cluster
  • Coherent cluster
  • Subspace clustering
  • pair-wise disparity
  • For a 2?2 (sub)matrix consisting of objects x,
    y and attributes a, b

mutual bias of attribute a
mutual bias of attribute b
attribute
4
Coherent Cluster
  • A 2?2 (sub)matrix is a ?-coherent cluster if its
    D value is less than or equal to ?.
  • An m?n matrix X is a ?-coherent cluster if every
    2?2 submatrix of X is ?-coherent cluster.
  • A ?-coherent cluster is a maximum ?-coherent
    cluster if it is not a submatrix of any other
    ?-coherent cluster.
  • Objective given a data matrix and a threshold ?,
    find all maximum ?-coherent clusters.

5
Coherent Cluster
  • Challenges
  • Finding subspace clustering based on distance
    itself is already a difficult task due to the
    curse of dimensionality.
  • The (sub)set of objects and the (sub)set of
    attributes that form a cluster are unknown in
    advance and may not be adjacent to each other in
    the data matrix.
  • The actual values of the objects in a coherent
    cluster may be far apart from each other.
  • Each object or attribute in a coherent cluster
    may bear some relative bias (that are unknown in
    advance) and such bias may be local to the
    coherent cluster.

6
Coherent Cluster
Compute the maximum coherent attribute sets for
each pair of objects
Two-way Pruning
Construct the lexicographical tree
Post-order traverse the tree to find maximum
coherent clusters
7
Coherent Cluster
  • Observation Given a pair of objects o1, o2 and
    a (sub)set of attributes a1, a2, , ak, the 2?k
    submatrix is a ?-coherent cluster iff, for every
    attribute ai, the mutual bias (do1ai do2ai)
    does not differ from each other by more than ?.

If ? 1.5, then a1,a2,a3,a4,a5 is a coherent
attribute set (CAS) of (o1,o2).
8
Coherent Cluster
  • Observation given a subset of objects o1, o2,
    , ol and a subset of attributes a1, a2, ,
    ak, the l?k submatrix is a ?-coherent cluster
    iff a1, a2, , ak is a coherent attribute set
    for every pair of objects (oi,oj) where 1 ? i, j
    ? l.

9
Coherent Cluster
  • Strategy find the maximum coherent attribute
    sets for each pair of objects with respect to the
    given threshold ?.

? 1
The maximum coherent attribute sets define the
search space for maximum coherent clusters.
10
Two Way Pruning
(o0,o2) ?(a0,a1,a2) (o1,o2) ?(a0,a1,a2)
(a0,a1) ?(o0,o1,o2) (a0,a2) ?(o1,o2,o3) (a1,a2)
?(o1,o2,o4) (a1,a2) ?(o0,o2,o4)
(o0,o2) ?(a0,a1,a2) (o1,o2) ?(a0,a1,a2)
(a0,a1) ?(o0,o1,o2) (a0,a2) ?(o1,o2,o3) (a1,a2)
?(o1,o2,o4) (a1,a2) ?(o0,o2,o4)
delta1 nc 3 nr 3
MCAS
MCOS
11
Coherent Cluster
  • Strategy grouping object pairs by their CAS and,
    for each group, find the maximum clique(s).
  • Implementation using a lexicographical tree to
    organize the object pairs and to generate all
    maximum coherent clusters with a single
    post-order traversal of the tree.

12
a0,a1 (o0,o1)
(o1,o2)
(o0,o2)
a0,a2 (o1,o3),(o2,o3)
(o1,o2)
(o0,o2)
a1,a2 (o0,o4),(o1,o4),(o2,o4)
(o1,o2)
(o0,o2)
a2,a3 (o0,o1),(o1,o2)
(o0,o2)
a0,a1,a2 (o1,o2)
(o0,o2)
a0,a1,a2,a3 (o0,o2)
a0
a2
a1
assume ? 1
a1
a2
a2
a3
(o0,o1)
(o1,o3)
(o0,o4)
(o0,o1)
(o2,o3)
(o1,o4)
(o1,o2)
a2
(o2,o4)
(o1,o2)
a3
(o0,o2)
13
o0,o2 ? a0,a1,a2,a3
o1,o2 ? a0,a1,a2
o0,o1,o2 ? a0,a1
o1,o2,o3 ? a0,a2
o0,o2,o4 ? a1,a2
o1,o2,o4 ? a1,a2
o0,o1,o2 ? a2,a3
(o0,o2)
14
Coherent Cluster
  • High expressive power
  • The coherent cluster can capture many interesting
    and meaningful patterns overlooked by previous
    clustering methods.
  • Efficient and highly scalable
  • Wide applications
  • Gene expression analysis
  • Collaborative filtering

subspace cluster
coherent cluster
15
Remark
  • Comparing to Bicluster
  • Can well separate noises and outliers
  • No random data insertion and replacement
  • Produce optimal solution

16
Definition of OP-Cluster
  • Let I be a subset of genes in the database.
    Let J be a subset of conditions. We say ltI, Jgt
    forms an Order Preserving Cluster
    (OP-Cluster), if one
    of the following relationships exists for any
    pair of conditions.

Expression Levels
A1 A2 A3 A4

when
17
Problem Statement
  • Given a gene expression matrix, our goal is to
    find all the statistically significant
    OP-Clusters. The significance is ensured by the
    minimal size threshold nc and nr.

18
Conversion to Sequence Mining Problem
Sequence
Expression Levels
A1 A2 A3 A4
19
Ming OP-Clusters A naïve approach
  • A naïve approach
  • Enumerate all possible subsequences in a prefix
    tree.
  • For each subsequences, collect all genes that
    contain the subsequences.
  • Challenge
  • The total number of distinct subsequences are

root
a
b
c
d
a
b
c
d
a
c
d

b
c
d
b
d
b
c
c
d
a
d

d
d
c
b
d
b
c
d
c
a
d

A Complete Prefix Tree with 4 items a,b,c,d
20
Mining OP-Clusters Prefix Tree
  • Goal
  • Build a compact prefix tree that includes all
    sub-sequences only occurring in the original
    database.
  • Strategies
  • Depth-First Traversal
  • Suffix concatenation Visit subsequences that
    only exist in the input sequences.
  • Apriori Property Visit subsequences that are
    sufficiently supported in order to derive longer
    subsequences.

Root
a1,2
b3
a1,2
a1,2,3
a1,2,3
d1
b2
a3
d1,3
d1,3
d1,2,3
d1,2,3
d2
b1
c1,3
d3
c1,2,3
c1
c2
c3
21
References
  • J. Yang, W. Wang, H. Wang, P. Yu, Delta-cluster
    capturing subspace correlation in a large data
    set, Proceedings of the 18th IEEE International
    Conference on Data Engineering (ICDE), pp.
    517-528, 2002.
  • H. Wang, W. Wang, J. Yang, P. Yu, Clustering by
    pattern similarity in large data sets, to appear
    in Proceedings of the ACM SIGMOD International
    Conference on Management of Data (SIGMOD), 2002.
  • Y. Sungroh,  C. Nardini, L. Benini, G. De
    Micheli, Enhanced pClustering and its
    applications to gene expression data
    Bioinformatics and Bioengineering, 2004.
  • J. Liu and W. Wang, OP-Cluster clustering by
    tendency in high dimensional space, ICDM03.
Write a Comment
User Comments (0)
About PowerShow.com