Subspace Clustering/Biclustering - PowerPoint PPT Presentation

About This Presentation
Title:

Subspace Clustering/Biclustering

Description:

11. Motivation. DNA microarray analysis. CH1I. CH1B. CH1D. CH2I. CH2B. CTFC3. 4392. 284. 4108 ... the state of the art method in computational biology field ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 40
Provided by: jinz1
Category:

less

Transcript and Presenter's Notes

Title: Subspace Clustering/Biclustering


1
Subspace Clustering/Biclustering
  • CS 685 Special Topics in Data Mining
  • Spring 2008
  • Jinze Liu

2
Data Mining Clustering
K-means clustering minimizes
Where
3
Clustering by Pattern Similarity (p-Clustering)
  • The micro-array raw data shows 3 genes and
    their values in a multi-dimensional space
  • Parallel Coordinates Plots
  • Difficult to find their patterns
  • non-traditional clustering

4
Clusters Are Clear After Projection
5
Motivation
  • E-Commerce collaborative filtering

Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 Movie 7
Viewer 1 1 2 4 3 5
Viewer 2 4 6 7 1
Viewer 3 2 3 4 6 3
Viewer 4 3 4 5 7
Viewer 5 5 5 3 4
6
Motivation
7
Motivation
Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 Movie 7
Viewer 1 1 2 4 3 5
Viewer 2 4 6 7 1
Viewer 3 2 3 4 6 3
Viewer 4 3 4 5 7
Viewer 5 5 5 3 4
8
Motivation
9
Gene Expression Data
10
Biclustering of Gene Expression Data
  • Genes not regulated under all conditions
  • Genes regulated by multiple factors/processes
    concurrently
  • Key to determine function of genes
  • Key to determine classification of conditions

11
Motivation
  • DNA microarray analysis

CH1I CH1B CH1D CH2I CH2B
CTFC3 4392 284 4108 280 228
VPS8 401 281 120 275 298
EFB1 318 280 37 277 215
SSA1 401 292 109 580 238
FUN14 2857 285 2576 271 226
SP07 228 290 48 285 224
MDM10 538 272 266 277 236
CYS3 322 288 41 278 219
DEP1 312 272 40 273 232
NTG1 329 296 33 274 228
12
Motivation
13
Motivation
  • Strong coherence exhibits by the selected objects
    on the selected attributes.
  • They are not necessarily close to each other but
    rather bear a constant shift.
  • Object/attribute bias
  • bi-cluster

14
Challenges
  • The set of objects and the set of attributes are
    usually unknown.
  • Different objects/attributes may possess
    different biases and such biases
  • may be local to the set of selected
    objects/attributes
  • are usually unknown in advance
  • May have many unspecified entries

15
Whats Biclustering?
  • Given an n x m matrix, A, find a set of
    submatrices, Bk, such that the contents of each
    Bk follow a desired pattern
  • Row/Column order need not be consistent between
    different Bks

16
Bipartite Graphs
  • Matrix can be thought of as a Graph Rows are one
    set of vertices L, Columns are another set R
  • Edges are weighted by the corresponding entries
    in the matrix If all weights are binary,
    biclustering becomes biclique finding

17
Bicluster Structures
18
Previous Work
  • Subspace clustering
  • Identifying a set of objects and a set of
    attributes such that the set of objects are
    physically close to each other on the subspace
    formed by the set of attributes.
  • Collaborative filtering Pearson R
  • Only considers global offset of each
    object/attribute.

19
bi-cluster
  • Consists of a (sub)set of objects and a (sub)set
    of attributes
  • Corresponds to a submatrix
  • Occupancy threshold ?
  • Each object/attribute has to be filled by a
    certain percentage.
  • Volume number of specified entries in the
    submatrix
  • Base average value of each object/attribute (in
    the bi-cluster)

20
bi-cluster
CH1I CH1B CH1D CH2I CH2B Obj base
CTFC3
VPS8 401 120 298 273
EFB1 318 37 215 190
SSA1
FUN14
SP07
MDM10
CYS3 322 41 219 194
DEP1
NTG1
Attr base 347 66 244 219
21
Bicluster Structures
22
Cheng and Church
  • Example
  • Correlation between any two columns correlation
    between any two rows 1.
  • aij aiJ aIj aIJ, where aiJ mean of row i,
    aIj mean of column j, aIJ mean of A.
  • Biological meaning the genes have the same
    (amount of) response to the conditions.

Back. 5 Col 0 1 Col 1 3 Col 2 2
Row 0 2 8 10 9
Row 1 4 10 12 11
Row 2 1 7 9 8
23
bi-cluster
  • Perfect ?-cluster
  • Imperfect ?-cluster
  • Residue

dij
diJ
dIJ
dIj
24
Cheng and Church
  • Model
  • A bicluster is represented the submatrix A of the
    whole expression matrix (the involved rows and
    columns need not be contiguous in the original
    matrix).
  • Each entry Aij in the bicluster is the
    superposition (summation) of
  • The background level
  • The row (gene) effect
  • The column (condition) effect
  • A dataset contains a number of biclusters, which
    are not necessarily disjoint.

25
Cheng and Church
  • Finding the largest ?-bicluster
  • The problem of finding the largest square
    ?-bicluster (I J) is NP-hard.
  • Objective function for heuristic methods (to
    minimize)gt sum of the components from each
    row and column, which suggests simple greedy
    algorithms to evaluate each row and column
    independently.

26
Cheng and Church
  • Greedy methods
  • Algorithm 0 Brute-force deletion (skipped)
  • Algorithm 1 Single node deletion
  • Parameter(s) ? (maximum squared residue).
  • Initialization the bicluster contains all rows
    and columns.
  • Iteration
  • Compute all aIj, aiJ, aIJ and H(I, J) for reuse.
  • Remove a row or column that gives the maximum
    decrease of H.
  • Termination when no action will decrease H or H
    lt ?.
  • Time complexity O(MN)

27
Cheng and Church
  • Greedy methods
  • Algorithm 2 Multiple node deletion (take one
    more parameter ?. In iteration step 2, delete all
    rows and columns with row/column residue gt ?H(I,
    J)).
  • Algorithm 3 Node addition (allow both additions
    and deletions of rows/columns).

28
Cheng and Church
  • Handling missing values and masking discovered
    biclusters replace by random numbers so that no
    recognizable structures will be introduced.
  • Data preprocessing
  • Yeast x ? 100log(105x)
  • Lymphoma x ? 100x (original data is already
    log-transformed)

29
Cheng and Church
  • Some results on yeast cell cycle data (2884?17)

30
Cheng and Church
  • Some results on lymphoma data (4026?96)

No. of genes, no. of conditions No. of genes, no. of conditions No. of genes, no. of conditions
4, 96 10, 29 11, 25
103, 25 127, 13 13, 21
10, 57 2, 96 25, 12
9, 51 3, 96 2, 96
31
Cheng and Church
  • Discussion
  • Biological validation comparing with the
    clusters in previously published results.
  • No evaluation of the statistical significance of
    the clusters.
  • Both the model and the algorithm are not tailored
    for discovering multiple non-disjoint clusters.
  • Normalization is of utmost importance for the
    model, but this issue is not well-discussed.

32
The FLOC algorithm
Generating initial clusters
Determine the best action for each row and each
column
Perform the best action of each row and column
sequentially
Y
Improved?
N
33
The FLOC algorithm
  • Action the change of membership of a row(or
    column) with respect to a cluster

column
M4
1
2
3
4
row
3
4
2
2
1
MN actions are Performed at each iteration
1
3
3
2
2
N3
4
2
0
4
3
34
Performance
  • Microarray data 2884 genes, 17 conditions
  • 100 bi-clusters with smallest residue were
    returned.
  • Average residue 10.34
  • The average residue of clusters found via the
    state of the art method in computational biology
    field is 12.54
  • The average volume is 25 bigger
  • The response time is an order of magnitude faster

35
Coherent Cluster
Want to accommodate noises but not outliers
36
Coherent Cluster
  • Coherent cluster
  • Subspace clustering
  • pair-wise disparity
  • For a 2?2 (sub)matrix consisting of objects x,
    y and attributes a, b

mutual bias of attribute a
mutual bias of attribute b
attribute
37
Coherent Cluster
  • A 2?2 (sub)matrix is a ?-coherent cluster if its
    D value is less than or equal to ?.
  • An m?n matrix X is a ?-coherent cluster if every
    2?2 submatrix of X is ?-coherent cluster.
  • A ?-coherent cluster is a maximum ?-coherent
    cluster if it is not a submatrix of any other
    ?-coherent cluster.
  • Objective given a data matrix and a threshold ?,
    find all maximum ?-coherent clusters.

38
Coherent Cluster
  • Challenges
  • Finding subspace clustering based on distance
    itself is already a difficult task due to the
    curse of dimensionality.
  • The (sub)set of objects and the (sub)set of
    attributes that form a cluster are unknown in
    advance and may not be adjacent to each other in
    the data matrix.
  • The actual values of the objects in a coherent
    cluster may be far apart from each other.
  • Each object or attribute in a coherent cluster
    may bear some relative bias (that are unknown in
    advance) and such bias may be local to the
    coherent cluster.

39
References
  • J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster
    capturing subspace correlation in a large data
    set, Proceedings of the 18th IEEE International
    Conference on Data Engineering (ICDE), pp.
    517-528, 2002.
  • H. Wang, W. Wang, J. Young, P. Yu, Clustering by
    pattern similarity in large data sets, to appear
    in Proceedings of the ACM SIGMOD International
    Conference on Management of Data (SIGMOD), 2002.
  • Y. Sungroh,  C. Nardini, L. Benini, G. De
    Micheli, Enhanced pClustering and its
    applications to gene expression data
    Bioinformatics and Bioengineering, 2004.
  • J. Liu and W. Wang, OP-Cluster clustering by
    tendency in high dimensional space, ICDM03.
Write a Comment
User Comments (0)
About PowerShow.com