Title: Probation Talk: On Mining Microarray data by OrderPreserving Submatrix
1Probation TalkOn Mining Micro-array data by
Order-Preserving Submatrix
- Lin Cheung
- Kevin Y. Yip
- David W. Cheung
- Ben Kao
- Michael K. Ng
2Abstract
- Study the problem of pattern-based subspace
clustering - Unlike traditional clustering methods that focus
on grouping objects with similar values on a set
of dimensions - clustering by pattern similarity finds objects
that exhibit a coherent pattern of rises and
falls in subspaces. - Applications
- DNA micro-array data analysis
- Automatic recommendation systems
- Target marketing systems
3Outline
- Introduction
- Related Work
- Problem Definition
- Algorithm
- Experiment Results
- Future Plan
4Introduction
- Invention of DNA micro-array technologies has
revolutionized the experimental study of gene
expression. - Thousands of genes are probed every day.
- Various gene expression data analysis techniques
have been intensively studied.
5Introduction
- In DNA micro-array data analysis, gene expression
data is organized as matrices. - a row carries the information of a gene
- a column represents a sample for an experiment.
- The number in each cell records the expression
value of a particular gene under a particular
sample. - In this presentation, we will use the terms
- object to mean a row (gene) of a dataset
- dimension (or attribute) to mean a column
(sample) of a dataset
(columns)
(rows)
6Clustering
- Clustering
- One of the most popular methods of discovering
useful biological insights from gene expression
data. - Group together data points that are close or
similar to each other over all dimensions into
clusters - Traditional Distance-based clustering
- Highly depends on distance (or similarity)
measure. - Euclidean distance, Manhattan distance, etc
7Subspace clustering
- DNA micro-array data is typically
high-dimensional - Clustering over the global set of attributes
often fails to extract all meaningful clusters - Objects tend to exhibit strong similarity only
over a (unknown) subset of the dimensions. - Subspace clustering
- Group together data points that are close or
similar to each other over a subset of the
dimensions into clusters - Discovery of clusters that are embedded in
certain subspaces of high-dimensional data, such
as gene expression data
8Pattern-based subspace clustering
- A special type of subspace clustering that uses
pattern similarity as a measure of object
distances.
9Pattern-based subspace clustering
Raw Data 3 rows and 10 columns
10Pattern-based subspace clustering
- In columns, b, c, f, h, i
- The expression values of the rows follow the same
rise-and-fall pattern - The 3 rows form a cluster in the subspace b, c,
f, h, i. - Traditional distance functions are sometimes
inadequate in capturing correlations among rows.
11Pattern-based subspace clustering
- Rearrange the columns so that expression values
are listed in ascending order - Form a column sequence f, c, b, i, h
- all increasing under the new column sequence
- f, c, b, i, h is referred as an order-preserving
pattern - The length of a pattern is the number of columns
- A row is a supporting row if its values exhibit
an increasing order with respect to the column
sequence, where support refers to the number of
supporting rows - An order-preserving pattern together with the set
of supporting rows form an Order Preserving
Sub-matrix, or OPSM.
12Challenges
- Computationally challenging problem
- Complexity lies in the requirement of
simultaneously determining both cluster members
and relevant dimensions. - Number of potential order-preserving patterns
grows exponentially with respect to the number of
attributes. - A dataset D with n attributes has ?in!/(n-i)!
potential order-preserving patterns - There could be tens to hundreds of attributes in
DNA micro-array datasets - Number of potential OPSMs are huge
- a set of rows R and a set of columns C form a
valid OPSM - Any subset of those rows plus any subset of those
columns form a valid OPSM too - FOCUS
- Mine ALL maximal OPSMs only
13Related Work
- OPSM model, Ben-Dor et al. RECOMB 2002 3
- Proposed a partial model returns k good
quality OPSMs
14Related Work
- d-pCluster, Wang et al, SIGMOD 2002, 11
- Only can discover shift-pattern and
scaling-pattern
15Formal description of the OPSMproblem
- Consider a gene-expression dataset D, represented
as a matrix. - We use O and C to denote the set of rows and
columns in D, respectively. - We use di,j to denote the entry of D in row i and
column j.
16Formal description of the OPSMproblem
- A cluster S is a submatrix of D formed by a
subset of nS( 2) rows and a subset of mS( 2)
columns of D. - Rows and columns in S need not be contiguous in
D. The rows in S are referenced by their row
indices in D, each of which is a distinct integer
in 1, 2, ..., nD. The set of row indices of S
is denoted as RS. - Columns in S are similarly referenced. The set of
columns in S is denoted by CS.
17Formal description of the OPSMproblem
- The columns in S are enclosed in curly brackets,
e.g., CS c1, c2, ..., cmS . A sequence CS of
the columns in S is enclosed in angled brackets,
e.g., CS lt c1, c2, ..., cmS gt. The columns in a
sequence is totally ordered. - For the basic OPSM problem, a cluster is a set of
rows and a set of columns such that entries in
every row are increasing w.r.t. a particular
column sequence. - A cluster S is thus written as S (RS, CS).
18Problem Definition
- Definition 1 A cluster S is an OPSM if there
exists a sequence of columns such that in the
cluster S, si,j si,j1 for all i ? 1, 2, ...,
nS and all j ? 1, 2, ...,mS-1. - A cluster S is a subcluster of a cluster S' if RS
? RS and CS ? CS. A cluster S is a proper
subcluster of a cluster S' if S is a subcluster
of S' and either RS ? RS or CS ? CS . - Definition 2 An OPSM is a maximal OPSM if it is
not a proper subcluster of any OPSM.
19Problem Definition
- (Maximal size-constrained OPSM problem) Given a
data matrix D, a supporting row thresh old nmin,
and a column threshold mmin, find all maximal
OPSMs S in D such that nS nmin and mS mmin.
20Algorithm
- Devised an apriori-like algorithm to solve the
problem - Invented a new data structure
- Head-Tail Trees
- Efficient processing of
- identifying all pairs of column sequences with
length k where the first k - 1 indices of the
first clusters column sequence equal the last k
- 1 indices of the second clusters column
sequence, and - Intersecting two sets of row indices
21Algorithm
Given a dataset D, nmin 2 and mmin 2
22Algorithm
The head tree for clusters with 2 columns
The tail tree for clusters with 2 columns
23Algorithm
The head tree for clusters with 3 columns
The tail tree for clusters with 3 columns
24Algorithm
- Property 1 (A priori property) A cluster S is an
OPSM if and only if all proper subclusters of S
are also OPSM. - Property 2 (Transitivity) If S1 (RS1
,ltx1,x2,...,xi ,y1,y2,...,yj gt) and S2 (RS2 ,lt
y1,y2,...,yj ,z1,z2,...,zk gt) are two maximal
OPSMs and RS1 n RS2 contains nmin or more
indices, then S (RS1 n RS2 , lt x1,x2,...,xi ,
y1,y2,...,yj ,z1,z2,...,zk gt) is a maximal OPSM.
25Experiments
- Yeast Dataset
- Used in Y. Cheng and G.M. Church. Biclustering of
expression data. ISMB, 2000. - 2884 rows x 17 columns
- 54 gene groups
- Pick a gene group and put all the genes of the
gene group in a cluster - Find all dimension pairs along which the
expression values of all the genes move in the
same direction - Merge the results in step 2 to find OPSMs with
more than 2 dimensions - Repeat 1
26Scalability with respect to number of columns
Number of rows 1000, Nmin 10, mmin 2
27Scalability with respect to number of rows
Number of columns 10, Nmin 10, mmin 2
28Future Work
- Uncertainty, experimental errors exist in
micro-array datasets - Mine error-tolerated OPSMs
29References
- 1 R. Agrawal and R. Srikant. Fast algorithms
for mining association rules in large databases.
VLDB, 1994. - 2 Y. Cheng and G.M. Church. Biclustering of
expression data. ISMB, 2000. - 3 A. B.-D. et. al. Discovering local structure
in gene expression data the order-preserving
submatrix problem. RECOMB, 2002. - 4 J. P. et. al. MaPle A fast algorithm for
maximal patternbased clustering. ICDM, 2003. - 5 J. P. B. et. al. Significance and statistical
errors in the analysis of dna microarray data.
PNAS, 2002. - 6 Y. K. et. al. Spectral biclustering of
microarray cancer data Co-clustering genes and
conditions. Genome Research, 13(4)703716, 2003. - 7 L. Lazzeroni and A. Owen. Plaid models for
gene expression data. Statistica Sinica,
126186, 2002. - 8 J. Liu and W. Wang. Op-cluster Clustering by
tendency in high dimensional space. ICDM, 2003. - 9 J. Liu, J. Yang, and W. Wang. Biclustering in
gene expression data by tendency. CSB, 2004. - 10 H. Wang, F. Chu, W. Fan, P. Yu, and J. Pei.
A fast algorithm for subspace clustering by
pattern similarity. SSDBM, 2004. - 11 H. Wang, W. Wang, J. Yang, and P. S. Yu.
Clustering by pattern similarity in large data
sets. SIGMOD, 2002. - 12 J. Yang, H. Wang, W. Wang, and P. Yu.
Enhanced biclustering on expression data. BIBE,
2003.